pca数学推导

As I promised in the previous article, Principal Component Analysis (PCA) with Scikit-learn, today, I’ll discuss the mathematics behind the principal component analysis by manually executing the algorithm using the powerful numpy and pandas libraries. This will help you to understand how PCA really works behind the scenes.

正如我在上一篇文章Scikit-learn中的主成分分析(PCA)中所承诺的那样，今天，我将通过使用功能强大的numpy和pandas库手动执行算法来讨论主成分分析背后的数学原理。这将帮助您了解PCA在幕后的工作原理。

Before proceeding to read this one, I highly recommend you to read the following article:

在继续阅读此文章之前，我强烈建议您阅读以下文章：

Principal Component Analysis (PCA) with Scikit-learn
使用Scikit学习的主成分分析(PCA)

This is because this article is continued from the above article.

这是因为本文是上述文章的续篇。

In this article, I first review some statistical and mathematical concepts which are required to execute the PCA calculations.

在本文中，我首先回顾一些执行PCA计算所需的统计和数学概念。

PCA背后的统计概念 (Statistical concepts behind PCA)

意思 (Mean)

The mean (also called the average) is calculated by simply adding all the values and dividing by the number of values.

平均值 (也称为平均值 )是通过简单地将所有值相加并除以值的数量来计算的。

标准偏差 (Standard Deviation)

The standard deviation is a measure of how much of the data lies within proximity to the mean. It is the square root of the variance.

标准偏差是多少数据位于平均值附近的度量。它是方差的平方根。

协方差 (Covariance)

The standard deviation is calculated on a single variable. The covariance is the variance of one variable against another. When the covariance of a variable is computed against itself, the result is the same as simply calculating the variance for that variable.

标准偏差是根据单个变量计算的。 协方差是一个变量相对于另一个变量的方差。当针对自身计算变量的协方差时，结果与简单地计算该变量的方差相同。

协方差矩阵 (Covariance Matrix)

A covariance matrix is a matrix representation of the possible covariance values that can be computed for all features of a dataset. It is required to execute the PCA of a dataset. The following image shows such a covariance matrix of a dataset which has 3 variables called X, Y and Z.

协方差矩阵是可以为数据集的所有特征计算的可能协方差值的矩阵表示。需要执行数据集的PCA。下图显示了这样的数据集的协方差矩阵，该数据集具有3个变量，分别为X ， Y和Z。

cov(Y, X) is the covariance of Y with respect to X. It is same as the cov(X, Y). The diagonal elements of the covariance matrix give you the values of variance for each variable. For example, cov(X, X) is the variance of X.

cov(Y，X)是Y相对于X的协方差。与cov(X，Y)相同 。协方差矩阵的对角元素为您提供每个变量的方差值。例如，COV(X，X)是X的方差。

A large value of the covariance of one variable against another would suggest that one feature changes significantly with respect to the other while a value close to zero would signify a very little change.

一个变量相对于另一个变量的协方差值较大将表明一个特征相对于另一个特征发生了显着变化，而接近零的值表示变化很小。

To calculate the covariance matrix for a given dataset, we can use numpy cov() function or pandas DataFrame cov() method.

要计算给定数据集的协方差矩阵，我们可以使用numpy cov()函数或pandas DataFrame cov()方法。

PCA背后的数学概念 (Mathematical concepts behind PCA)

特征值和特征向量 (Eigenvalues and Eigenvectors)

Let A be an n x n matrix. A scalar λ is called an eigenvalue of A if there is a non-zero vector x satisfying the following equation:

设A为nxn矩阵 。如果存在一个满足以下等式的非零向量x ，则标量λ称为A的特征值 ：

The vector x is called the eigenvector of A corresponding to λ.

向量x称为与λ对应的A的特征向量 。

The above equation implicitly represents PCA. The following equation which is same as the above equation (but with different terms) directly represents PCA.

上述方程式隐含表示PCA。以下公式与上面的公式相同(但术语不同)直接代表PCA。

Where,

哪里，

A is an n x n square matrix. In terms of PCA, A is the covariance matrix.
A是一个nxn方阵 。就PCA而言， A是协方差矩阵。
Σ represents all the eigenvalues in the form of a diagonal matrix which has the diagonal elements representing eigenvalues. The amount of variability within the dataset is indicated by the corresponding eigenvalue. This is done by describing how much contribution each eigenvector provides to the dataset. The larger the eigenvalue, the greater its contribution.
Σ以对角矩阵的形式表示所有特征值，其中对角元素表示特征值。数据集内的变化量由相应的特征值指示。这是通过描述每个特征向量对数据集的贡献来完成的。特征值越大，贡献越大。
U represents all the eigenvectors in the form of an n x n square matrix.
U以nxn方阵的形式表示所有特征向量。

We have discussed some statistical and mathematical concepts behind PCA. In the next steps, we calculate the eigenvalues and eigenvectors using the covariance matrix of the breast_cancer dataset. Then we perform the PCA. For the entire PCA process, we will only use numpy and pandas libraries and will not use the scikit-learn library except for the feature scaling.

我们讨论了PCA背后的一些统计和数学概念。在接下来的步骤中，我们将使用breast_cancer数据集的协方差矩阵来计算特征值和特征向量。然后我们执行PCA。对于整个PCA流程，除了功能扩展外，我们将仅使用numpy和pandas库，而不使用scikit-learn库。

使用numpy和pandas手动执行PCA (Execute PCA manually using numpy and pandas)

步骤1：导入库并设置图样式 (Step 1: Import libraries and set plot styles)

As the first step, we import various Python libraries which are useful for our data analysis, data visualization, calculation and model building tasks. When importing those libraries, we use the following conventions.

第一步，我们导入各种Python库，这些库对于我们的数据分析，数据可视化，计算和模型构建任务很有用。导入这些库时，我们使用以下约定。

步骤2：获取并准备数据 (Step 2: Get and prepare data)

The dataset that we use here is available in Scikit-learn. But it is not in the correct format that we want. So, we have to do some manipulations to get the dataset ready for our task. First, we load the dataset using Scikit-learn load_breast_cancer() function. Then, we convert the data into a pandas DataFrame which is the format we are familiar with.

我们在这里使用的数据集可以在Scikit-learn中找到。但这不是我们想要的正确格式。因此，我们必须进行一些操作才能使数据集为我们的任务做好准备。首先，我们使用Scikit-learn load_breast_cancer()函数加载数据集。然后，我们将数据转换为我们熟悉的pandas DataFrame格式。

Now, the variable df contains a pandas DataFrame of the breast_cancer dataset. We can see its first 5 rows by calling the head() method. The following image shows a part of the dataset.

现在，变量df包含breast_cancer数据集的pandas DataFrame。我们可以通过调用head()方法查看其前5行。下图显示了数据集的一部分。

The full dataset contains 30 columns and 569 observations.

完整的数据集包含30列和569个观察值。

步骤3：获取特征矩阵 (Step 3: Obtain the feature matrix)

The feature matrix contains the values of all 30 features in the dataset. It is a 569x30 two-dimensional Numpy array. It is stored in the X variable.

特征矩阵包含数据集中所有30个特征的值。这是一个569x30的二维Numpy数组。它存储在X变量中。

步骤4：如有必要，对功能进行标准化 (Step 4: Standardize the features if necessary)

You can see that the values of the dataset are not equally scaled. So, we need to apply z-score standardization to get all features into the same scale. For this, we use Scikit-learn StandardScaler() class which is in the preprocessing submodule in Scikit-learn.

您会看到数据集的值没有按比例缩放。因此，我们需要应用z分数标准化，以使所有功能达到相同的比例。为此，我们使用Scikit-learn StandardScaler()类，该类位于Scikit-learn的预处理子模块中。

First, we import the StandardScaler() class. Then, we create an object of that class and store it in the variable sc. Then we use the sc object’s fit() method with the input X (feature matrix). This will calculate the mean and standard deviation for each variable in the dataset. Finally, we do the transformation with the transform() method of the sc object. The transformed (scaled) values of X are stored in the variable X_scaled which is also a 569x30 two-dimensional Numpy array.

首先，我们导入StandardScaler()类。然后，我们创建该类的对象并将其存储在变量sc中 。然后，将sc对象的fit()方法与输入X (功能矩阵)一起使用。这将计算数据集中每个变量的平均值和标准偏差。最后，我们使用sc对象的transform()方法进行转换。 X的转换(缩放)值存储在变量X_scaled中 ，该变量也为 569x30二维Numpy数组。

步骤5：计算协方差矩阵 (Step 5: Compute the covariance matrix)

Now, we compute the covariance matrix for all features of our dataset. Note that we use X_scaled matrix, not the X. To calculate the covariance matrix for our dataset, we can use numpy cov() function. We need to take the transpose of X_scaled because the covariance matrix is based on the number of features (30), not observations (569).

现在，我们为数据集的所有特征计算协方差矩阵。请注意，我们使用X_scaled矩阵，而不是X。要为我们的数据集计算协方差矩阵，我们可以使用numpy cov()函数。我们需要对X_scaled进行转置，因为协方差矩阵基于特征的数量(30)，而不是观察值(569)。

The covariance matrix of our dataset is a 30x30 2d numpy array.

我们的数据集的协方差矩阵是30x30 2d numpy数组。

步骤6：计算协方差矩阵的特征值和特征向量 (Step 6: Compute the eigenvalues and eigenvectors of the covariance matrix)

We can use the eig() function to calculate the eigenvalues and eigenvectors of the covariance matrix. The eig() function is in the linalg module which is a subpackage of the numpy library.

我们可以使用eig()函数来计算协方差矩阵的特征值和特征向量。在EIG()函数是linalg模块这是numpy的库的一个子包英寸

The variable eigenvalues returns all the eigenvalues.

变量特征值返回所有特征值。

The variable eigenvectors returns all the eigenvectors in the form of 30x30 2d numpy array.

变量特征向量以30x30 2d numpy数组的形式返回所有特征向量。

Then we get the transpose of eigenvectors.

然后我们得到特征向量的转置。

The eigenvector for the first eigenvalue (13.304) is the first row of the eigenvectors array. It has 30 elements.

第一个特征值(13.304)的特征向量是特征向量数组的第一行。它具有30个元素。

步骤7：从最高到最低对特征值和特征向量进行排序 (Step 7: Sort the eigenvalues and eigenvectors from the highest to the lowest)

About the first 20 eigenvalues were automatically sorted from the highest to the lowest. So, we do not need to sort the eigenvalues and eigenvectors. This is because we need just first 10 eigenvalues for our PCA process.

自动从最高到最低对大约前20个特征值进行排序。因此，我们不需要对特征值和特征向量进行排序。这是因为我们的PCA过程仅需要前10个特征值。

步骤8：将特征值计算为数据集中方差的百分比 (Step 8: Compute the eigenvalues as a percentage of the variance within the dataset)

From the above eigenvalues, we need only the first 10 eigenvalues to preserve 95.15% of the variability in the data. The corresponding (selected) eigenvectors for the first 10 eigenvalues are:

从以上特征值中，我们仅需要前10个特征值即可保留数据中95.15％的变异性。前10个特征值的对应(选定)特征向量为：

步骤9：将缩放后的数据集乘以所选特征向量 (Step 9: Multiply the scaled dataset by the selected eigenvectors)

The dimensionality reduction process is a matrix multiplication of the selected eigenvectors and the (scaled) data to be transformed. Note that the transpose of the X_scaled is required to match the dimension when executing the matrix multiplication.

降维处理是所选特征向量与要转换的(缩放)数据的矩阵相乘。请注意，执行矩阵乘法时需要X_scaled的转置以匹配尺寸。

Then we get the transpose of data_new matrix (2d array).

然后我们得到data_new矩阵(2d数组)的转置。

Now the data_new array is in the right dimension. It contains the transformed data with 10 principal components.

现在， data_new数组的尺寸正确。它包含具有10个主要成分的转换数据。

步骤10：将转换后的数据转换成pandas DataFrame (Step 10: Convert transformed data into a pandas DataFrame)

Let’s create a pandas DataFrame using the values of all 10 principal components.

让我们使用所有10个主要成分的值创建一个熊猫DataFrame。

The transformed dataset now has 10 features (principal components) and 569 observations. The original dataset contains 30 features and 569 observations. Therefore, we have reduced the dimensionality of the data by 20 features preserving 95.15% of the variability in the data.

转换后的数据集现在具有10个特征(主要成分)和569个观测值。原始数据集包含30个特征和569个观测值。因此，我们已通过减少20个要素的数据维数来保留了数据中95.15％的可变性。

步骤11：绘制主成分的值 (Step 11: Plot the values of the principal components)

The output is:

输出为：

To verify, you can compare the results obtained here with the results obtained using the Scikit-learn PCA() by setting n_components=0.95. The results are exactly the same!

为了进行验证，可以通过设置n_components = 0.95 ，将此处获得的结果与使用Scikit-learn PCA()获得的结果进行比较。结果完全一样！

Thank you for reading! See you in the next article.

感谢您的阅读！下篇文章见。

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

本教程设计和创造的Rukshan Pramoditha ，的作者数据科学365博客。

本教程中使用的技术 (Technologies used in this tutorial)

Python (High-level programming language)
Python (高级编程语言)
numPy (Numerical Python library)
numPy (数字Python库)
pandas (Python data analysis and manipulation library)
pandas (Python数据分析和操作库)
matplotlib (Python data visualization library)
matplotlib (Python数据可视化库)
Jupyter Notebook (Integrated Development Environment)
Jupyter Notebook (集成开发环境)