图片主成分分析后的可视化_主成分分析-可视化

图片主成分分析后的可视化

If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too but I never really understood the graphical significance of PCA because all I saw was matrices and equations. It took me quite a lot of time to understand this concept from various sources. So, I decided to compile it all in one place.

如果您曾经参加过有关机器学习的在线课程,那么您必须碰到主成分分析以降低维度,或者简单来说就是压缩数据。 猜猜我也参加过此类课程,但是我从来没有真正理解PCA的图形意义,因为我看到的只是矩阵和方程式。 我花了很多时间从各种来源了解这个概念。 因此,我决定将其全部编译在一个地方。

In this article, we will take a visual (graphical) approach to understand PCA and how it can be used to compress data. Basic knowledge of Linear Algebra and Matrices is assumed. If you are new to this concept, just follow along, I have tried my best to keep this as simple as possible.

在本文中,我们将采用视觉(图形)方法来理解PCA以及如何将其用于压缩数据。 假设线性代数和矩阵的基本知识。 如果您是这个概念的新手,那么请跟随我,我已尽力使它尽可能简单。

介绍 (Introduction)

These days, datasets containing a large number of dimensions are increasingly common and are often difficult to interpret. One example can be a database of face photographs of let’s say, 1,000,000 people. If each face photograph has a dimension of 100x100, then the data of each face is 10000 dimensional (there are 100x100 = 10,000 unique values to be stored for each face). Now, if 1 byte is required to store the information of each pixel, then 10,000 bytes are required to store 1 face. Since there are 1000 faces in the database,10,000 x 1,000,000 = 10 GB will be needed to store the dataset.

如今,包含大量维的数据集变得越来越普遍,并且通常难以解释。 一个例子可以是一个数据库,假设有100万人的面部照片。 如果每个人脸照片的尺寸为100x100,则每个人脸的数据为10000维(每个人脸要存储100x100 = 10,000个唯一值)。 现在,如果需要1个字节来存储每个像素的信息,则需要10,000个字节来存储1个面部。 由于数据库中有1000张面Kong,因此需要10,000 x 1,000,000 = 10 GB来存储数据集。

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, exploiting the fact that the images in these datasets have something in common. For instance, in a dataset consisting of face photographs, each photograph will have facial features like eyes, nose, mouth. Instead of encoding this information pixel by pixel, we could make a template of each type of these features and then just combine these templates to generate any face in the dataset. In this approach, each template will still be 100x100 = 1000 dimensional, but since we will be reusing these templates (basis functions) to generate each face in the dataset, the number of templates required will be very small. PCA does exactly this.

主成分分析(PCA)是一种利用此类数据集中的图像具有共同点的事实来降低此类数据集的维数的技术。 例如,在由脸部照片组成的数据集中,每张照片将具有面部特征,例如眼睛,鼻子,嘴巴。 不用逐个像素地编码此信息,我们可以制作这些特征的每种类型的模板,然后将这些模板组合在一起以生成数据集中的任何人脸。 在这种方法中,每个模板仍将是100x100 = 1000尺寸,但是由于我们将重用这些模板(基本函数)以生成数据集中的每个面,因此所需模板的数量将非常少。 PCA正是这样做的。

PCA如何工作? (How does PCA work?)

This part is going to be a bit technical, so bear with me! I will try to explain the working of PCA with a simple example. Let’s consider the data shown below containing 100 points each 2 dimensional (x & y coordinates is needed to represent each point).

这部分将有点技术性,请多多包涵! 我将尝试通过一个简单的例子来解释PCA的工作。 让我们考虑下面显示的数据,每个数据包含100个点(二维)(需要用x和y坐标表示每个点)。

Image for post
Image by Author
图片作者

Currently, we are using 2 values to represent each point. Let’s explain this situation in a more technical way. We are currently using 2 basis functions, x as (1, 0) and y as (0, 1). Each point in the dataset is represented as a weighted sum of these basis functions. For instance, point (2, 3) can be represented as 2(1, 0) + 3(0, 1) = (2, 3). If we omit either of these basis functions, we will not be able to represent the points in the dataset accurately. Therefore, both the dimensions necessary, and we can’t just drop one of them to reduce the storage requirement. This set of basis functions is actually the cartesian coordinate in 2 dimensions.

当前,我们使用2个值来表示每个点。 让我们以更技术性的方式解释这种情况。 我们目前正在使用2个基函数,x作为(1,0),y作为(0,1)。 数据集中的每个点都表示为这些基函数的加权和。 例如,点(2,3)可以表示为2(1,0)+ 3(0,1)=(2,3)。 如果我们忽略这些基本函数中的任何一个,我们将无法准确表示数据集中的点。 因此,这两个尺寸都是必需的,我们不能只丢掉其中一个以减少存储需求。 这套基础函数实际上是二维的直角坐标。

If we notice closely, we can very well see that the data approximates a line as shown by the red line below.

如果我们密切注意,我们可以很好地看到数据接近一条线,如下面的红线所示。

Image for post
Image by Author
图片作者

Now, let’s rotate the coordinate system such that the x-axis lies along the red line. Then, the y-axis (green line) will be perpendicular to this red line. Let’s call these new x and y axes as a-axis and b-axis respectively. This is shown below.

现在,让我们旋转坐标系,以使x轴沿着红线。 然后,y轴(绿线)将垂直于该红线。 我们将这些新的x和y轴分别称为a轴b轴 。 如下所示。

Image for post
Image by Author
图片作者

Now, if we use a and b as the new set basis functions (instead of using x and y) for this dataset, it wouldn’t be wrong to say that most of the variance in the dataset is along the a-axis. Now, if we drop the b-axis, we can still represent the points in the dataset very accurately, using just a-axis. Therefore, we now only need half as must storage to store the dataset and reconstruct it accurately. This is exactly how PCA works.

现在,如果我们使用ab作为该数据集的新集合基函数(而不是使用xy ),那么说数据集中的大多数方差都沿着a轴是没有错的。 现在,如果我们放下b轴,我们仍然可以仅使用a轴就非常精确地表示数据集中的点。 因此,我们现在只需要存储一半就可以存储数据集并准确地重建它。 这正是PCA的工作方式。

PCA is a 4 step process. Starting with a dataset containing n dimensions (requiring n-axes to be represented):

PCA是一个四步过程。 从包含n个维度的数据集开始(需要表示n个轴):

  • Find a new set of basis functions (n-axes) where some axes contribute to most of the variance in the dataset while others contribute very little.

    找到一组新的基函数( n轴),其中一些轴对数据集中的大部分方差有贡献,而另一些轴则贡献很小。

  • Arrange these axes in the decreasing order of variance contribution.

    以方差贡献的降序排列这些轴。
  • Now, pick the top k axes to be used and drop the remaining n-k axes.

    现在,选择要使用的前k个轴,然后删除其余的nk个轴。

  • Now, project the dataset onto these k axes.

    现在,将数据集投影到这k个轴上。

After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions (k<n).

经过这4个步骤,数据集将从n维压缩为仅k维( k < n )。

脚步 (Steps)

For the sake of simplicity, let’s take the above dataset and apply PCA on that. The steps involved will be technical and basic knowledge of linear algebra is assumed. You can view the Colab Notebook here:

为了简单起见,让我们采用以上数据集并在其上应用PCA。 所涉及的步骤将是线性代数的技术和基本知识。 您可以在此处查看Colab笔记本:

第1步 (Step 1)

Since this is a 2-dimensional dataset, n=2. The first step is to find the new set of basis functions (a & b). In the explanation above, we saw that the dataset had the maximum variance along a line and we manually chose that line as a-axis and the line the perpendicular to it as b-axis. In practice, we want this step to be automated.

由于这是二维数据集,因此n = 2。 第一步是找到新的基础函数集( ab )。 在上面的说明中,我们看到数据集沿一条线具有最大方差,我们手动选择了该线作为a轴,垂直选择与其垂直的线作为b轴。 实际上,我们希望这一步骤是自动化的。

To accomplish this, we can find the eigenvalues and eigenvectors of the covariance matrix of the dataset. Since the dataset is 2 dimensional, we will get 2 eigenvalues and their corresponding eigenvectors. Then, the 2 eigenvectors are two basis functions (new axes) and the two eigenvalues tell us the variance contribution of the corresponding eigenvectors. A large value of eigenvalue implies that the corresponding eigenvector (axis) contributes more towards the total variance of the dataset.

为此,我们可以找到数据集协方差矩阵的特征值和特征向量。 由于数据集是二维的,因此我们将获得2个特征值及其对应的特征向量。 然后,这两个特征向量是两个基函数(新轴),两个特征值告诉我们相应特征向量的方差贡献。 特征值的较大值表示相应的特征向量(轴)对数据集的总方差贡献更大。

Image for post
Image by Author
图片作者

第2步 (Step 2)

Now, sort the eigenvectors (axes) according to decreasing eigenvalues. Here, we can see that the eigenvalue for a-axis is much larger than that of the b-axis meaning that a-axis contributes more towards the dataset variance.

现在,根据递减的特征值对特征向量(轴)进行排序。 在这里,我们可以看到a轴的特征值远大于b轴的特征值, 意味着a轴对数据集方差的贡献更大。

Image for post
Image by Author
图片作者

The percentage contribution of each axis towards the total dataset variance can be calculated as:

每个轴对总数据集方差的百分比贡献可以计算为:

Image for post
Image for post
Image by Author
图片作者

The above numbers prove that the a-axis contributes 99.7% towards the dataset variance and that we can drop the b-axis and lose just 0.28% of the variance.

以上数字证明, a轴对数据集方差的贡献为99.7%,我们可以删除b轴并仅损失0.28%的方差。

第三步 (Step 3)

Now, we will drop the b-axis and keep only the a-axis.

现在,我们将放下b轴 ,仅保留a轴。

Image for post
Image by Author
图片作者

第4步 (Step 4)

Now, reshape the first eigenvector (a-axis) into a 2x1 matrix, called the projection matrix. It will be used to project the original dataset of shape (100, 2) onto the new basis function (a-axis), thus compressing it to (100, 1).

现在,将第一个特征向量(a轴)整形为2x1矩阵,称为投影矩阵。 它将用于将形状为(100,2)的原始数据集投影到新的基函数(a轴)上,从而将其压缩为(100,1)。

Image for post
Image for post

重建数据 (Reconstruct the data)

Now, we can use the projection matrix to expand the data back to its original size, with of course a small loss of variance (0.28%).

现在,我们可以使用投影矩阵将数据扩展回其原始大小,当然会有很小的方差损失(0.28%)。

Image for post
Image for post

The reconstructed data is shown below:

重建的数据如下所示:

Image for post
Image by Author
图片作者

Please note that the variance along the b-axis (0.28%) is lost as evident by the above figure.

请注意,如上图所示,沿b轴的方差(0.28%)丢失了。

那是所有人! (That’s all folks!)

If you made it till here, hats off to you! In this article, we took a graphical approach to understand how Principal Component Analysis works and how it can be used for data compression. In my next article, I will show how PCA can be used to compress Labelled Faces in the Wild (LFW), a large scale dataset consisting of 13233 human-face images.

如果您做到了这里,就向您致敬! 在本文中,我们采用图形化的方法来了解主成分分析的工作原理以及如何将其用于数据压缩。 在下一篇文章中,我将展示如何使用PCA压缩野外标记的面部(LFW),LFW是由13233张人脸图像组成的大规模数据集。

If you have any suggestions, please leave a comment. I write articles regularly so you should consider following me to get more such articles in your feed.

如果您有任何建议,请发表评论。 我会定期撰写文章,因此您应该考虑关注我,以便在您的供稿中获取更多此类文章。

If you liked this article, you might as well love these:

如果您喜欢这篇文章,则不妨喜欢这些:

Visit my website to learn more about me and my work.

访问我的网站以了解有关我和我的工作的更多信息。

翻译自: https://towardsdatascience.com/principal-component-analysis-visualized-17701e18f2fa

图片主成分分析后的可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390970.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

TP引用样式表和js文件及验证码

TP引用样式表和js文件及验证码 引入样式表和js文件 <script src"__PUBLIC__/bootstrap/js/jquery-1.11.2.min.js"></script> <script src"__PUBLIC__/bootstrap/js/bootstrap.min.js"></script> <link href"__PUBLIC__/bo…

pytorch深度学习_深度学习和PyTorch的推荐系统实施

pytorch深度学习The recommendation is a simple algorithm that works on the principle of data filtering. The algorithm finds a pattern between two users and recommends or provides additional relevant information to a user in choosing a product or services.该…

Java 集合-集合介绍

2017-10-30 00:01:09 一、Java集合的类关系图 二、集合类的概述 集合类出现的原因&#xff1a;面向对象语言对事物的体现都是以对象的形式&#xff0c;所以为了方便对多个对象的操作&#xff0c;Java就提供了集合类。数组和集合类同是容器&#xff0c;有什么不同&#xff1a;数…

Exchange 2016部署实施案例篇-04.Ex基础配置篇(下)

上二篇我们对全新部署完成的Exchange Server做了基础的一些配置&#xff0c;今天继续基础配置这个话题。 DAG配置 先决条件 首先在配置DGA之前我们需要确保DAG成员服务器上磁盘的盘符都是一样的&#xff0c;大小建议最好也相同。 其次我们需要确保有一块网卡用于数据复制使用&…

数据库课程设计结论_结论:

数据库课程设计结论In this article, we will learn about different types[Z Test and t Test] of commonly used Hypothesis Testing.在本文中&#xff0c;我们将学习常用假设检验的不同类型[ Z检验和t检验 ]。 假设是什么&#xff1f; (What is Hypothesis?) This is a St…

配置Java_Home,临时环境变量信息

一、内容回顾 上一篇博客《Java运行环境的搭建---Windows系统》 我们说到了配置path环境变量的目的在于控制台可以在任意路径下都可以找到java的开发工具。 二、配置其他环境变量 1. 原因 为了获取更大的用户群体&#xff0c;所以使用java语言开发系统需要兼容不同版本的jdk&a…

网页缩放与窗口缩放_功能缩放—不同的Scikit-Learn缩放器的效果:深入研究

网页缩放与窗口缩放内部AI (Inside AI) In supervised machine learning, we calculate the value of the output variable by supplying input variable values to an algorithm. Machine learning algorithm relates the input and output variable with a mathematical func…

Python自动化开发01

一、 变量变量命名规则变量名只能是字母、数字或下划线的任意组合变量名的第一个字符不能是数字以下关键字不能声明为变量名 [and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not,…

未越狱设备提取数据_从三星设备中提取健康数据

未越狱设备提取数据Health data is collected every time you have your phone in your pocket. Apple or Android, the phones are equipped with a pedometer that counts your steps. Hence, health data is recorded. This data could be your one free data mart for a si…

[BZOJ2599][IOI2011]Race 点分治

2599: [IOI2011]Race Time Limit: 70 Sec Memory Limit: 128 MBSubmit: 3934 Solved: 1163[Submit][Status][Discuss]Description 给一棵树,每条边有权.求一条简单路径,权值和等于K,且边的数量最小.N < 200000, K < 1000000 Input 第一行 两个整数 n, k第二..n行 每行三…

分词消除歧义_角色标题消除歧义

分词消除歧义折磨数据&#xff0c;它将承认任何事情 (Torture the data, and it will confess to anything) Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whi…

北航教授李波:说AI会有低潮就是胡扯,这是人类长期的追求

这一轮所谓人工智能的高潮&#xff0c;和以往的几次都有所不同&#xff0c;那是因为其受到了产业界的极大关注和参与。而以前并不是这样。 当今世界是一个高度信息化的世界&#xff0c;甚至我们有一只脚已经踏入了智能化时代。而在我们日常交流和信息互动中&#xff0c;迅速发…

在加利福尼亚州投资于新餐馆:一种数据驱动的方法

“It is difficult to make predictions, especially about the future.”“很难做出预测&#xff0c;尤其是对未来的预测。” ~Niels Bohr〜尼尔斯波尔 Everything is better interpreted through data. And data-driven decision making is crucial for success in any ind…

阿里云ESC上的Ubuntu图形界面的安装

系统装的是Ubuntu Server 16.04 64位版的图形界面&#xff0c;这里是转载的一个大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867&#xff0c; 当然自己也再记录一下&#xff0c;加深点印象 1.更新apt-get 保证最新 apt-get update 2.用putty或者Xshell连接远…

近似算法的近似率_选择最佳近似最近算法的数据科学家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安装CentOS之二——最小化安装CentOS

1、上文已经创建了一个虚拟机&#xff0c;现在我们点击开启虚拟机。2、虚拟机进入到安装的界面&#xff0c;在这里我们选择第一行&#xff0c;安装或者升级系统。3、这里会提示要检查光盘&#xff0c;我们直接选择跳过。4、这里会提示我的硬件设备不被支持&#xff0c;点击OK&a…

在Python中使用Seaborn和WordCloud可视化YouTube视频

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link我是YouTube的狂热用户&#xff0c;喜欢在业余时间…

老生常谈:抽象工厂模式

在创建型模式中有一个模式是不得不学的,那就是抽象工厂模式(Abstract Factory),这是创建型模式中最为复杂,功能最强大的模式.它常与工厂方法组合来实现。平时我们在写一个组件的时候一般只针对一种语言,或者说是针对一个区域的人来实现。 例如:现有有一个新闻组件,在中国我们有…

数据结构入门最佳书籍_最佳数据科学书籍

数据结构入门最佳书籍Introduction介绍 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人问到…

函数式编程概念

什么是函数式编程 简单地说&#xff0c;函数式编程通过使用函数&#xff0c;将值转换成抽象单元&#xff0c;接着用于构建软件系统。 面向对象VS函数式编程 面向对象编程 面向对象编程认为一切事物皆对象&#xff0c;将现实世界的事物抽象成对象&#xff0c;现实世界中的关系抽…