pca(主成分分析技术)_主成分分析技巧

pca(主成分分析技术)

介绍 (Introduction)

Principal Component Analysis (PCA) is an unsupervised technique for dimensionality reduction.

主成分分析(PCA)是一种无监督的降维技术。

What is dimensionality reduction?

什么是降维?

Let us start with an example. In a tabular data set, each column would represent a feature, or dimension. It is commonly known that it is difficult to manipulate a tabular data set that has a lot of columns/features, especially if there are more columns than observations.

让我们从一个例子开始。 在表格数据集中,每列将代表一个要素或尺寸。 众所周知,很难处理具有许多列/功能的表格数据集,尤其是当列数多于观察值时。

Given a linearly modelable problem having a number of features p=40, then the best subset approach would fit a about trillion (2^p-1) possible models and submodels, making their computation extremely onerous.

给定一个具有多个特征p = 40的线性可建模问题,那么最佳子集方法将适合大约一万亿(2 ^ p-1)个可能的模型和子模型,从而使其计算极为繁琐。

How does PCA come to aid?

PCA如何提供帮助?

PCA can extract information from a high-dimensional space (i.e., a tabular data set with many columns) by projecting it onto a lower-dimensional subspace. The idea is that the projection space will have dimensions, named principal components, that will explain the majority of the variation of the original data set.

PCA可以通过将其投影到低维子空间上来 从高维空间 (即具有许多列的表格数据集)中提取信息 。 这个想法是,投影空间将具有称为主成分的维,这些维将解释原始数据集的大部分变化。

How does PCA work exactly?

PCA如何工作?

PCA is the eigenvalue decomposition of the covariance matrix obtained after centering the features, to find the directions of maximum variation. The eigenvalues represent the variance explained by each principal component.

PCA是对特征进行居中后找到最大变化方向的协方差矩阵的特征值分解。 特征值代表每个主成分所解释的方差。

The purpose of PCA is to obtain an easier and faster way to both manipulate data set (reducing its dimensions) and retain most of the original information through the explained variance.

PCA的目的是获得一种更简便,更快捷的方式来处理数据集(减小其尺寸)并通过所解释的差异保留大多数原始信息。

The question now is

现在的问题是

How many components should I use for dimensionality reduction? What is the “right” number?

我应该使用几个零件来减少尺寸? 什么是“正确的”数字?

In this post, we will discuss some tips for selecting the optimal number of principal components by providing practical examples in Python, by:

在本文中,我们将通过在Python中提供一些实用的示例讨论一些用于选择最佳数量的主要组件的技巧,方法如下:

  1. Observing the cumulative ratio of explained variance.

    观察解释方差的累积比率。
  2. Observing the eigenvalues of the covariance matrix

    观察协方差矩阵的特征值
  3. Tuning the number of components as hyper-parameter in a cross-validation framework where PCA is applied in a Machine Learning pipeline.

    在交叉验证框架中将组件的数量调整为超参数,在交叉验证框架中将PCA应用到机器学习管道中。

Finally, we will also apply dimensionality reduction on a new observation, in the scenario where PCA was already applied to a data set, and we would like to project the new observation on the previously obtained subspace.

最后,在已经将PCA应用于数据集的情况下,我们还将对新观测值进行降维 ,并且我们希望将新观测值投影到先前获得的子空间上。

环境设置 (Environment set-up)

At first, we import the modules we will be using, and load the “Breast Cancer Data Set”: it contains 569 observations and 30 features for relevant clinical information — such as radius, texture, perimeter, area, etc. — computed from digitized image of aspirates of breast masses, and it presents a binary classification problem, as the labes are only 0 or 1 (benign vs malignant), indicating whether a patient has breast cancer or not.

首先,我们导入将要使用的模块,并加载“ 乳腺癌数据集 ”:它包含569个观察值和30个相关临床信息的特征(例如半径,纹理,周长,面积等),这些特征是通过数字化计算得出的乳腺抽吸物的图像,它提出了一个二元分类问题 ,因为标记只有0或1(良性与恶性),表明患者是否患有乳腺癌。

The data set is already available in scikit-learn:

数据集已在scikit-learn中提供:

Without diving deep into the pre-processing task, it is important to mention that the PCA is affected by different scales in the data.

在不深入研究预处理任务的情况下,重要的是要提到PCA受数据中不同比例的影响。

Therefore, before applying PCA the data must be scaled (i.e., converted to have mean=0 and variance=1). This can be easily achieved with the scikit-learn StandardScaler object:

因此,在应用PCA之前, 必须数据进行缩放 (即转换为均值= 0和方差= 1)。 这可以通过scikit-learn StandardScaler对象轻松实现:

This returns:

返回:

Mean:  -6.118909323768877e-16
Standard Deviation: 1.0

Once the features are scaled, applying the PCA is straightforward. In fact, scikit-learn handles almost everything by itself: the user only has to declare the number of components and then fit.

缩放功能后,即可轻松应用PCA。 实际上,scikit-learn本身几乎可以处理所有事情:用户只需要声明组件的数量即可。

Notably, the scikit-learn user can either declare the number of components to be used, or the ratio of explained variance to be reached:

值得注意的是,scikit-learn用户可以声明要使用的组件数量要达到的解释方差比率

  • pca = PCA(n_components=5): performs PCA using 5 components.

    pca = PCA(n_components = 5) :使用5个组件执行PCA。

  • pca = PCA(n_components=.95): performs PCA using a number of components sufficient to consider 95% of variance.

    pca = PCA(n_components = .95) :使用足以考虑95%方差的多个分量执行PCA。

Indeed, this is a way to select the number of components: asking scikit-learn to reach a certain amount of explained variance, such as 95%. But maybe we could have used a significantly lower amount of dimensions and reach a similar variance, for example 92%.

确实,这是一种选择组件数量的方法:要求scikit-learn达到一定的解释方差,例如95%。 但是也许我们可以使用低得多的尺寸并达到类似的方差,例如92%。

So, how do we select the number of components?

那么,我们如何选择组件数量?

1.观察解释方差的比率 (1. Observing the ratio of explained variance)

PCA achieves dimensionality reduction by projecting the observations on a smaller subspace, but we also want to keep as much information as possible in terms of variance.

PCA通过将观测值投影在较小的子空间上来实现降维,但我们也希望在方差方面保留尽可能多的信息。

So, one heuristic yet effective approach is to see how much variance is explained by adding the principal components one by one, and afterwards select the number of dimensions that meet our expectations.

因此,一种启发式但有效的方法是通过将主要成分一一添加来查看解释了多少差异,然后选择满足我们期望的维数。

It is very easy to follow this approach thanks to scikit-learn, that provides the explained_variance_ratio_ property to the (fitted) PCA object:

借助scikit-learn,可以很容易地采用这种方法,该方法为(已安装的)PCA对象提供了explained_variance_ratio_属性:

Image for post

From the plot, we can see that the first 6 components are sufficient to retain the 89% of the original variance.

从图中可以看出,前6个分量足以保留原始方差的89%。

This is a good result, if we think that we started with a data set of 30 features, and that we could limit further analysis to only 6 dimensions without loosing too much information.

如果我们认为我们从30个要素的数据集开始,并且可以将进一步的分析限制在6个维度而又不丢失太多信息,那将是一个很好的结果。

2.使用协方差矩阵 (2. Using the covariance matrix)

Covariance is a measure of the “spread” of a set of observations around their mean value. When we apply PCA, what happens behind the curtain is that we apply a rotation to the covariance matrix of our data, in order to achieve a diagonal covariance matrix. In this way, we obtain data whose dimensions are uncorrelated.

协方差是对一组观测值在其平均值附近的“分布”的度量。 当我们应用PCA时,幕后发生的事情是我们将旋转应用于数据的协方差矩阵,以获得对角协方差矩阵 。 通过这种方式,我们可以获得维度不相关的数据

The diagonal covariance matrix obtained after transformation is the eigenvalue matrix, where the eigenvalues correspond to the variance explained by each component.

变换后获得的对角协方差矩阵是特征值矩阵,其中特征值对应于每个组件解释的方差。

Therefore, another approach to the selection of the ideal number of components is to look for an “elbow” in the plot of the eigenvalues.

因此, 选择理想数量的零件的另一种方法是在特征值图中寻找“弯头”。

Let us observe the first elements of the covariance matrix of the principal components. As said, we expect it to be diagonal:

让我们观察主成分协方差矩阵的第一个元素。 如前所述,我们希望它是对角线的:

Image for post

Indeed, at first glance the covariance matrix appears to be diagonal. In order to be sure that the matrix is diagonal, we can verify that all the values outside of the main diagonal are almost equal to zero (up to a certain decimal, as they will not be exactly zero).

确实,乍看之下,协方差矩阵似乎是对角线的。 为了确保矩阵是对角线,我们可以验证主对角线之外的所有值几乎都等于零(最多为某个小数,因为它们将不完全为零)。

We can use the assert_almost_equal statement, that leads to an exception in case its inner condition is not met, while it leads to no visible output in case the condition is met. In this case, no exception is raised (up to the tenth decimal):

我们可以使用assert_almost_equal语句,在不满足其内部条件的情况下导致异常,而在满足条件的情况下则导致不可见输出。 在这种情况下,不会引发任何异常(最多十进制小数):

The matrix is diagonal. Now we can proceed to plot the eigenvalues from the covariance matrix and look for an elbow in the plot.

矩阵是对角线的。 现在我们可以继续绘制协方差矩阵的特征值,并在图中寻找弯头。

We use the diag method to extract the eigenvalues from the covariance matrix:

我们使用diag方法从协方差矩阵中提取特征值:

Image for post

We may see an “elbow” around the sixth component, where the slope seems to change significantly.

我们可能会在第六部分周围看到一个“弯头”,那里的坡度似乎发生了很大变化。

Actually, all these steps were not needed: scikit-learn provides, among the others, the explained_variance_ attribute, defined in the documentation as “The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.”:

实际上,不需要所有这些步骤:scikit-learn除其他外,还提供了explained_variance_属性,该属性在文档中定义为“每个选定组件所解释的方差量。 等于X的协方差矩阵的n_components个最大特征值。”:

Image for post

In fact, we notice the same result as from the calculation of the covariance matrix and the eigenvalues.

实际上,我们注意到与计算协方差矩阵和特征值相同的结果。

3.应用交叉验证程序 (3. Applying a cross-validation procedure)

Although PCA is an unsupervised technique, it might be used together with other techniques in a broader pipeline for a supervised problem.

尽管PCA是一种不受监督的技术,但它可以与其他技术一起在更广泛的管道中用于受监督的问题。

For instance, we might have a classification (or regression) problem in a large data set, and we might apply PCA before our classification (or regression) model in order to reduce the dimensionality of the input dataset.

例如,我们可能在大型数据集中存在分类(或回归)问题,并且我们可能在分类(或回归)模型之前应用PCA,以降低输入数据集的维数。

In this scenario, we would tune the number of principal components as a hyper-parameter within a cross-validation procedure.

在这种情况下,我们将在交叉验证过程中将主成分的数量调整为超参数。

This can be achieved by using two scikit-learn object:

这可以通过使用两个scikit-learn对象来实现:

  • Pipeline: allows the definition of a pipeline of sequential steps in order to cross-validate them together.

    管道 :允许定义顺序步骤的管道,以便一起交叉验证它们。

  • GridSearchCV: performs a grid search in a cross-validation framework for hyper-parameter tuning (= finding the optimal parameters of the steps in the pipeline).

    GridSearchCV :在交叉验证框架中执行网格搜索以进行超参数调整 (=查找管线中步骤的最佳参数)。

The process is as follows:

流程如下:

  1. The steps (dimensionality reduction, classification) are chained in a pipeline.

    步骤(降维,分类)链接在管道中。
  2. The parameters to search are defined.

    已定义要搜索的参数。
  3. The grid search procedure is executed.

    执行网格搜索过程。

In our example, we are facing a binary classification problem. Therefore, we apply PCA followed by logistic regression in a pipeline:

在我们的示例中,我们面临一个二进制分类问题。 因此,我们在管道中应用PCA,然后进行逻辑回归

This returns:

返回:

Best parameters obtained from Grid Search:
{'log_reg__C': 1.2589254117941673, 'pca__n_components': 9}

The grid search finds the best number of components for the PCA during the cross-validation procedure.

网格搜索会在交叉验证过程中找到PCA的最佳组件数量。

For our problem and tested parameters range, the best number of components is 9.

对于我们的问题和经过测试的参数范围, 最佳组件数量是9

The grid search provides more detailed results in the cv_results_ attribute, that can be stored as a pandas dataframe and inspected:

网格搜索在cv_results_属性中提供了更详细的结果,可以将其存储为pandas数据并进行检查:

Image for post
Some of the columns in the dataframe obtained by storing the cv_results_ attribute output.
通过存储cv_results_属性输出获得的数据框中的某些列。

As we can see, it contains detailed information on the cross-validated procedure with the grid search.

如我们所见,它包含有关通过网格搜索进行交叉验证的过程的详细信息。

But we might be not interested in seeing all the iterations performed by the grid search. Therefore, we can get the best validation score (averaged on all folds) for each number of components, and finally plot them together with the cumulative ratio of explained variance:

但是我们可能对查看网格搜索执行的所有迭代不感兴趣。 因此,对于每种数量的组分,我们可以获得最佳的验证分数(在所有折叠中平均),最后将它们与解释的方差的累积比率一起绘制:

Image for post

From the plot, we can notice that 6 components are enough to create a model whose validation accuracy reaches 97%, where considering all 30 components would lead to a 98% validation accuracy.

从图中可以看出,只有6个组件足以创建一个验证精度达到97%的模型,而考虑所有30个组件将得出98%的验证精度

In a scenario with a significant number of features in a input data set, reducing the number of input features with PCA could lead to significant advantages in terms of:

输入数据集中大量要素的情况下,使用PCA减少输入要素的数量可能会带来以下方面的显着优势:

  1. Reduced training and prediction time.

    减少训练和预测时间。

  2. Increased scalability.

    增加可伸缩性。

  3. Reduced training computational effort.

    减少训练计算量。

While, at the same time, by choosing the optimal number of principal components in a pipeline for a supervised problem, tuning the hyper-parameter in a cross-validated procedure, we would ensure to retain optimal performances.

同时, 在同一时间 ,通过选择一个监督问题的管线主要成分的最佳数量,调整超参数在交叉验证过程中,我们会确保留住最佳的性能。

Although, it must be taken into account that in a data set with many features the PCA itself may prove computationally expensive.

但是,必须考虑到, 在具有许多功能的数据集中,PCA本身可能在计算上非常昂贵

如何将PCA应用于新观测? (How to apply PCA to a new observation?)

Now, let us suppose that we have applied the PCA to an existing data set and kept (for example) 6 components.

现在,让我们假设已经将PCA应用于现有数据集并保留(例如)6个组件。

At some point, a new observation is added to the data set and needs to be projected on the reduced subspace obtained by PCA.

在某个时候, 新的观测值会添加到数据集,并且需要投影到PCA获得的缩小子空间上

How can this be achieved?

如何做到这一点?

We can perform this calculation manually through the projection matrix.

我们可以通过投影矩阵手动执行此计算。

Therefore, we also estimate the error in the manual calculation by checking if we would get the same output as “fit_transform” on the original data:

因此,我们还通过检查是否在原始数据上获得与“ fit_transform”相同的输出来估计手动计算中的错误:

Image for post

The projection matrix is orthogonal, and the manual reduction provides a fairly reasonable error.

投影矩阵是正交的,并且手动缩小提供了相当合理的误差。

We can finally obtain the projection by the multiplication between the new observation (scaled) and the transposed projection matrix:

我们最终可以通过新观测值(按比例缩放)与转置投影矩阵之间的乘法来获得投影:

This returns:

返回:

[-3.22877012 -1.17207348  0.26466433 -1.00294458  0.89446764  0.62922496]

That’s it! The new observation is projected to the 6-dimensional subspace obtained with PCA.

而已! 新的观测值投影到使用PCA获得的6维子空间。

结论 (Conclusion)

This tutorial is meant to provide a few tips on the selection of the number of components to be used for the dimensionality reduction in the PCA, showing practical demonstrations in Python.

本教程旨在为您提供一些用于选择PCA中降维的组件数量的技巧,并显示Python的实际演示。

Finally, it is also explained how to perform the projection onto the reduced subspace of a new sample, information which is rarely found on tutorials on the subject.

最后,还说明了如何在新样本的缩小子空间上进行投影,该信息在该主题的教程中很少见。

This is but a brief overview. The topic is far broader and it has been deeply investigated in literature.

这只是一个简短的概述。 该主题范围更广,并且已在文献中进行了深入研究。

翻译自: https://medium.com/@nicolo_albanese/tips-on-principal-component-analysis-7116265971ad

pca(主成分分析技术)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389522.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

npm link run npm script

npm link & run npm script https://blog.csdn.net/juhaotian/article/details/78672390 npm link命令可以将一个任意位置的npm包链接到全局执行环境,从而在任意位置使用命令行都可以直接运行该npm包。 app-cmd.cmd #!/usr/bin/env nodeecho "666" &a…

一文详解java中对JVM的深度解析、调优工具、垃圾回收

2019独角兽企业重金招聘Python工程师标准>>> jvm监控分析工具一般分为两类,一种是jdk自带的工具,一种是第三方的分析工具。jdk自带工具一般在jdk bin目录下面,以exe的形式直接点击就可以使用,其中包含分析工具已经很强…

借用继承_博物馆正在数字化,并在此过程中从数据中借用

借用继承Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicking up our fee…

高斯噪声,椒盐噪声的思想及多种噪声的实现

图像噪声: 概念: • 图像噪声是图像在获取或是传输过程中受到随机信号干扰,妨碍人们对图像理解及分析处理 的信号。 • 很多时候将图像噪声看做多维随机过程,因而描述噪声的方法完全可以借用随机过程的描述, 也就是使…

如何识别媒体偏见_描述性语言理解,以识别文本中的潜在偏见

如何识别媒体偏见TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackatho…

分享 : 警惕MySQL运维陷阱:基于MyCat的伪分布式架构

分布式数据库已经进入了全面快速发展阶段。这种发展是与时俱进的,与人的需求分不开,因为现在信息时代的高速发展,导致数据量和交易量越来越大。这种现象首先导致的就是存储瓶颈,因为MySQL数据库实质上还是一个单机版本的数据库&am…

数据不平衡处理_如何处理多类不平衡数据说不可以

数据不平衡处理重点 (Top highlight)One of the common problems in Machine Learning is handling the imbalanced data, in which there is a highly disproportionate in the target classes.机器学习中的常见问题之一是处理不平衡的数据,其中目标类别的比例非常…

最小二乘法以及RANSAC(随机采样一致性)思想及实现

线性回归–最小二乘法(Least Square Method) 线性回归: 什么是线性回归? 举个例子,某商品的利润在售价为2元、5元、10元时分别为4元、10元、20元, 我们很容易得出商品的利润与售价的关系符合直线&#xf…

糖药病数据集分类_使用optuna和mlflow进行心脏病分类器调整

糖药病数据集分类背景 (Background) Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to…

Android MVP 框架

为什么80%的码农都做不了架构师?>>> 前言 根据网络上的MVP套路写了一个辣鸡MVP DEMO 用到的 android studio MVPHelper插件,方便自动生成框架代码rxjavaretrofit什么是MVP MVP就是英文的Model View Presenter,然而实际分包并不是只有这三个包…

相似图像搜索的哈希算法思想及实现(差值哈希算法和均值哈希算法)

图像相似度比较哈希算法: 什么是哈希(Hash)? • 散列函数(或散列算法,又称哈希函数,英语:Hash Function)是一种从任何一种数据中创建小 的数字“指纹”的方法。散列函数把消息或数…

腾讯云AI应用产品总监王磊:AI 在传统产业的最佳实践

欢迎大家前往腾讯云社区,获取更多腾讯海量技术实践干货哦~ 背景:5月23-24日,以“焕启”为主题的腾讯“云未来”峰会在广州召开,广东省各级政府机构领导、海内外业内学术专家、行业大咖及技术大牛等在现场共议云计算与数字化产业创…

Toast源码深度分析

目录介绍 1.最简单的创建方法 1.1 Toast构造方法1.2 最简单的创建1.3 简单改造避免重复创建1.4 为何会出现内存泄漏1.5 吐司是系统级别的 2.源码分析 2.1 Toast(Context context)构造方法源码分析2.2 show()方法源码分析2.3 mParams.token windowToken是干什么用的2.4 schedul…

运行keras出现 FutureWarning: Passing (type, 1) or ‘1type‘ as a synonym of type is deprecated解决办法

运行keras出现 FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, 原则来说,没啥影响,还是能运行,但是看着难受 解决办法: 点击蓝色的链接: 进入 …

mongdb 群集_群集文档的文本摘要

mongdb 群集This is a part 2 of the series analyzing healthcare chart notes using Natural Language Processing (NLP)这是使用自然语言处理(NLP)分析医疗保健图表笔记的系列文章的第2部分。 In the first part, we talked about cleaning the text and extracting sectio…

keras框架实现手写数字识别

详细细节可学习从零开始神经网络:keras框架实现数字图像识别详解! 代码实现: [1]将训练数据和检测数据加载到内存中(第一次运行需要下载数据,会比较慢): (mnist是手写数据集) train_images是用于训练系统…

gdal进行遥感影像读写_如何使用遥感影像进行矿物勘探

gdal进行遥感影像读写Meet Jose Manuel Lattus, a geologist from Chile. In the latest Soar Cast, he discusses his work in mineral exploration and environmental studies, and explains how he makes a living by creating valuable information products based on diff…

从零开始神经网络:keras框架实现数字图像识别详解!

接口实现可参考:keras框架实现手写数字识别 思路: 我们的代码要导出三个接口,分别完成以下功能: 初始化initialisation,设置输入层,中间层,和输出层的节点数。训练train:根据训练数据不断的更…

推荐算法的先验算法的连接_数据挖掘专注于先验算法

推荐算法的先验算法的连接So here we are diving into the world of data mining this time, let’s begin with a small but informative definition;因此,这一次我们将进入数据挖掘的世界,让我们从一个小的但内容丰富的定义开始; 什么是数…

Tensorflow入门神经网络代码框架

Tensorflow—基本用法 使用图 (graph) 来表示计算任务.在被称之为 会话 (Session) 的上下文 (context) 中执行图.使用 tensor 表示数据.通过 变量 (Variable) 维护状态.使用 feed 和 fetch 可以为任意的操作(arbitrary operation)赋值或者从其中获取数据。 • TensorFlow 是一…