主成分分析 独立成分分析_主成分分析概述

主成分分析 独立成分分析

by Moshe Binieli

由Moshe Binieli

主成分分析概述 (An overview of Principal Component Analysis)

This article will explain you what Principal Component Analysis (PCA) is, why we need it and how we use it. I will try to make it as simple as possible while avoiding hard examples or words which can cause a headache.

本文将向您解释什么是主成分分析(PCA),为什么需要它以及如何使用它。 我将尝试使其尽可能简单,同时避免使用可能引起头痛的硬示例或单词。

A moment of honesty: to fully understand this article, a basic understanding of some linear algebra and statistics is essential. Take a few minutes to review the following topics, if you need to, in order to make it easy to understand PCA:

诚实的时刻:要完全理解本文,对一些线性代数和统计量的基本理解是必不可少的。 如果需要,请花几分钟时间回顾以下主题,以使其易于理解PCA:

  • vectors

    向量
  • eigenvectors

    特征向量
  • eigenvalues

    特征值
  • variance

    方差
  • covariance

    协方差

那么该算法如何帮助我们呢? 该算法有什么用? (So how can this algorithm help us? What are the uses of this algorithm?)

  • Identifies the most relevant directions of variance in the data.

    标识数据中最相关的方差方向。
  • Helps capture the most “important” features.

    帮助捕获最“重要”的功能。
  • Easier to make computations on the dataset after the dimension reductions since we have fewer data to deal with.

    降维后,由于我们要处理的数据较少,因此更易于在数据集上进行计算。
  • Visualization of the data.

    数据可视化。

简短的口头解释。 (Short verbal explanation.)

Let’s say we have 10 variables in our dataset and let’s assume that 3 variables capture 90% of the dataset, and 7 variables capture 10% of the dataset.

假设我们的数据集中有10个变量,并假设3个变量捕获了90%的数据集,而7个变量捕获了10%的数据集。

Let’s say we want to visualize 10 variables. Of course we cannot do that, we can visualize only maximum 3 variables (Maybe in future we will be able to).

假设我们要可视化10个变量。 当然我们不能这样做,我们只能可视化最多3个变量(也许将来我们可以)。

So we have a problem: we don’t know which of the variables captures the largest variability in our data. To solve this mystery, we’ll apply the PCA Algorithm. The output will tell us who are those variables. Sounds cool, doesn’t it? ?

所以我们有一个问题:我们不知道哪个变量捕获了数据中最大的可变性。 为了解决这个难题,我们将应用PCA算法。 输出将告诉我们那些变量是谁。 听起来很酷,不是吗? ?

那么,使PCA起作用的步骤是什么? 我们如何运用魔法? (So what are the steps to make PCA work? How do we apply the magic?)

  1. Take the dataset you want to apply the algorithm on.

    获取您要对其应用算法的数据集。
  2. Calculate the covariance matrix.

    计算协方差矩阵。
  3. Calculate the eigenvectors and their eigenvalues.

    计算特征向量及其特征值。
  4. Sort the eigenvectors according to their eigenvalues in descending order.

    根据特征向量按降序对特征向量进行排序。
  5. Choose the first K eigenvectors (where k is the dimension we’d like to end up with).

    选择前K个特征向量(其中k是我们要最终得出的维数)。
  6. Build new reduced dataset.

    建立新的简化数据集。

是时候使用真实数据了。 (Time for an example with real data.)

1) 将数据集加载到矩阵中: (1) Load the dataset to a matrix:)

Our main goal is to figure out how many variables are the most important for us and stay only with them.

我们的主要目标是找出对我们来说最重要的变量有多少,并且只保留它们。

For this example, we will use the program “Spyder” for running python. We’ll also use a pretty cool dataset that is embedded inside “sklearn.datasets” which is called “load_iris”. You can read more about this dataset at Wikipedia.

对于此示例,我们将使用程序“ Spyder”运行python。 我们还将使用嵌入在“ sklearn.datasets”内部的非常酷的数据集,称为“ load_iris”。 您可以在Wikipedia上了解有关此数据集的更多信息。

First of all, we will load the iris module and transform the dataset into a matrix. The dataset contains 4 variables with 150 examples. Hence, the dimensionality of our data matrix is: (150, 4).

首先,我们将加载虹膜模块并将数据集转换为矩阵。 数据集包含4个变量和150个示例。 因此,我们的数据矩阵的维数为:(150,4)。

import numpy as npimport pandas as pdfrom sklearn.datasets import load_iris
irisModule = load_iris()dataset = np.array(irisModule.data)

There are more rows in this dataset — as we said there are 150 rows, but we can see only 17 rows.

该数据集中有更多行-就像我们说的有150行,但我们只能看到17行。

The concept of PCA is to reduce the dimensionality of the matrix by finding the directions that captures most of the variability in our data matrix. Therefore, we’d like to find them.

PCA的概念是通过找到捕获我们数据矩阵中大多数可变性的方向来减少矩阵的维数。 因此,我们想找到它们。

2) 计算协方差矩阵: (2) Calculate the covariance matrix:)

It’s time to calculate the covariance matrix of our dataset, but what does this even mean? Why do we need to calculate the covariance matrix? How will it look?

现在是时候计算数据集的协方差矩阵了,但这到底意味着什么? 为什么我们需要计算协方差矩阵? 看起来如何?

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures the spread of a set of numbers from their mean. The mathematical definition is:

方差是随机变量与其平均值的平方偏差的期望值。 非正式地, 它根据一组均值来衡量一组数字的传播。 数学定义为:

Covariance is a measure of the joint variability of two random variables. In other words, how any 2 features vary from each other. Using the covariance is very common when looking for patterns in data. The mathematical definition is:

协方差是两个随机变量联合变量的度量。 换句话说,任何两个功能如何彼此不同。 在数据中查找模式时,通常使用协方差。 数学定义为:

From this definition, we can conclude that the covariance matrix will be symmetric. This is important because it means that its eigenvectors will be real and non-negative, which makes it easier for us (we dare you to claim that working with complex numbers is easier than real ones!)

根据这个定义,我们可以得出结论,协方差矩阵将是对称的。 这很重要,因为这意味着它的特征向量将是实数和非负数,这使我们更容易(我们敢于要求与实数相比,使用复数更容易!)

After calculating the covariance matrix it will look like this:

计算协方差矩阵后,它将如下所示:

As you can see, the main diagonal is written as V (variance) and the rest is written as C (covariance), why is that?

如您所见,主对角线写为V (方差),其余对角线写为C (协方差),为什么?

Because calculating the covariance of the same variable is basically calculating its variance (if you’re not sure why — please take a few minutes to understand what variance and covariance are).

因为计算同一个变量的协方差基本上就是计算其方差(如果不确定为什么,请花几分钟来了解什么是方差和协方差)。

Let’s calculate in Python the covariance matrix of the dataset using the following code:

让我们使用以下代码在Python中计算数据集的协方差矩阵:

covarianceMatrix = pd.DataFrame(data = np.cov(dataset, rowvar = False), columns = irisModule.feature_names, index = irisModule.feature_names)
  • We’re not interested in the main diagonal, because they are the variance of the same variable. Since we’re trying to find new patterns in the dataset, we’ll ignore the main diagonal.

    我们对主对角线不感兴趣,因为它们是同一变量的方差。 由于我们正在尝试在数据集中查找新的模式,因此我们将忽略主对角线

  • Since the matrix is symmetric, covariance(a,b) = covariance(b,a), we will look only at the top values of the covariance matrix (above diagonal).

    由于矩阵是对称的,所以协方差(a,b)=协方差(b,a), 我们将仅查看协方差矩阵的最高值(对角线以上)

    Something important to mention about covariance: if the covariance of variables

    关于协方差重要的事情要提到:如果变量的协方差

    a and b is positive, that means they vary in the same direction. If the covariance of a and b is negative, they vary in different directions.

    ab ,表示它们沿相同方向变化。 如果ab的协方差为 ,则它们在不同的方向上变化。

3) 计算特征值和特征向量: (3) Calculate the eigenvalues and eigenvectors:)

As I mentioned at the beginning, eigenvalues and eigenvectors are the basic terms you must know in order to understand this step. Therefore, I won’t explain it, but will rather move to compute them.

正如我在开始时提到的那样,特征值和特征向量是您必须了解的基本术语,才能理解此步骤。 因此,我不会解释它,而是会进行计算。

The eigenvector associated with the largest eigenvalue indicates the direction in which the data has the most variance. Hence, using eigenvalues we will know what eigenvectors capture the most variability in our data.

与最大特征值关联的特征向量指示数据具有最大方差的方向。 因此,使用特征值,我们将知道哪些特征向量捕获了数据中最大的可变性。

eigenvalues, eigenvectors = np.linalg.eig(covarianceMatrix)

This is the vector of the eigenvalues, the first index at eigenvalues vector is associated with the first index at eigenvectors matrix.

这是特征值的向量,在特征值向量处的第一索引与在特征向量矩阵处的第一索引相关联。

The eigenvalues:

特征值:

The eigenvectors matrix:

特征向量矩阵:

4)选择前K个特征值(K个主要分量/轴): (4) Choose the first K eigenvalues (K principal components/axises):)

The eigenvalues tells us the amount of variability in the direction of its corresponding eigenvector. Therefore, the eigenvector with the largest eigenvalue is the direction with most variability. We call this eigenvector the first principle component (or axis). From this logic, the eigenvector with the second largest eigenvalue will be called the second principal component, and so on.

特征值告诉我们在其相应特征向量方向上的变化量。 因此,特征值最大的特征向量是变化最大的方向。 我们称这个特征向量为第一主成分(或轴)。 根据此逻辑,具有第二大特征值的特征向量将被称为第二主成分,依此类推。

We see the following values:[4.224, 0.242, 0.078, 0.023]

我们看到以下值:[4.224、0.242、0.078、0.023]

Let’s translate those values to percentages and visualize them. We’ll take the percentage that each eigenvalue covers in the dataset.

让我们将这些值转换为百分比并将其可视化。 我们将获取每个特征值在数据集中所占的百分比。

totalSum = sum(eigenvalues)variablesExplained = [(i / totalSum) for i in sorted(eigenvalues, reverse = True)]

As you can clearly see the first and eigenvalue takes 92.5% and the second one takes 5.3%, and the third and forth don’t cover much data from the total dataset. Therefore we can easily decide to remain with only 2 variables, the first one and the second one.

正如您可以清楚地看到的那样, 第一个和特征值占92.5%第二占5.3%,第三个和第四个并没有覆盖整个数据集中的大量数据。 因此,我们可以轻松地决定只保留两个变量 ,第一个和第二个。

featureVector = eigenvectors[:,:2]

Let’s remove the third and fourth variables from the dataset. Important to say that at this point we lose some information. It is impossible to reduce dimensions without losing some information (under the assumption of general position). PCA algorithm tells us the right way to reduce dimensions while keeping the maximum amount of information regarding our data.

让我们从数据集中删除第三个和第四个变量。 重要的一点是,我们现在会丢失一些信息。 在不损失某些信息的情况下减小尺寸是不可能的(在一般情况下)。 PCA算法为我们提供了减少尺寸的正确方法,同时又保留了有关我们数据的最大信息量。

And the remaining data set looks like this:

其余数据集如下所示:

5) 建立新的简化数据集: (5) Build the new reduced dataset:)

We want to build a new reduced dataset from the K chosen principle components.

我们想从K个选定的主成分构建一个新的简化数据集。

We’ll take the K chosen principles component (k=2 here) which gives us a matrix of size (4, 2), and we will take the original dataset which is a matrix of size (150, 4).

我们将选取K个选定的原则分量(此处k = 2),这将为我们提供一个大小为(4,2)的矩阵,我们将为原始数据集提供一个大小为(150,4)的矩阵。

We’ll perform matrices multiplication in such a way:

我们将以以下方式执行矩阵乘法:

  • The first matrix we take is the matrix that contains the K component principles we’ve chosen and we transpose this matrix.

    我们采用的第一个矩阵是包含我们选择的K个成分原则的矩阵,并且我们对该矩阵进行转置。
  • The second matrix we take is the original matrix and we transpose it.

    我们采用的第二个矩阵是原始矩阵,然后对其进行转置。
  • At this point, we perform matrix multiplication between those two matrices.

    在这一点上,我们在这两个矩阵之间执行矩阵乘法。
  • After we perform matrix multiplication we transpose the result matrix.

    在执行矩阵乘法之后,我们转置结果矩阵。
featureVectorTranspose = np.transpose(featureVector)datasetTranspose = np.transpose(dataset)newDatasetTranspose = np.matmul(featureVectorTranspose, datasetTranspose)newDataset = np.transpose(newDatasetTranspose)

After performing the matrices multiplication and transposing the result matrix, these are the values we get for the new data which contains only the K principal components we’ve chosen.

在执行矩阵乘法并转置结果矩阵之后,这些就是我们从新数据中获得的值,这些新数据仅包含我们选择的K个主成分。

结论 (Conclusion)

As (we hope) you can now see, PCA is not that hard. We’ve managed to reduce the dimensions of the dataset pretty easily using Python.

正如我们现在所希望的那样,PCA并不难。 我们已经成功地使用Python轻松减少了数据集的尺寸。

In our data set, we did not cause serious impact because we removed only 2 variables out of 4. But let’s assume we have 200 variables in our data set, and we reduced from 200 variables to 3 variables — it’s already becoming more meaningful.

在我们的数据集中,我们没有造成严重影响,因为我们仅从4个变量中删除了2个变量。但是,假设我们的数据集中有200个变量,并且从200个变量减少到3个变量-它已经变得越来越有意义。

Hopefully, you’ve learned something new today. Feel free to contact Chen Shani or Moshe Binieli on Linkedin for any questions.

希望您今天学到了一些新知识。 如有任何疑问,请随时通过Linkedin与Chen Shani或Moshe Binieli联系。

翻译自: https://www.freecodecamp.org/news/an-overview-of-principal-component-analysis-6340e3bc4073/

主成分分析 独立成分分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391700.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

扩展方法略好于帮助方法

如果针对一个类型实例的代码片段经常被用到,我们可能会想到把之封装成帮助方法。如下是一段针对DateTime类型实例的一段代码:class Program{static void Main(string[] args){DateTime d new DateTime(2001,5,18);switch (d.DayOfWeek){case DayOfWeek.…

零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面...

原文:零元学Expression Blend 4 - Chapter 25 以Text相关功能就能简单做出具有设计感的登入画面本章将交大家如何运用Blend 4 内的Text相关功能做出有设计感的登入画面 让你五分钟就能快速做出一个登入画面 ? 本章将教大家如何运用Blend 4 内的Text相关功能做出有设计感的登入…

leetcode 395. 至少有 K 个重复字符的最长子串(滑动窗口)

给你一个字符串 s 和一个整数 k ,请你找出 s 中的最长子串, 要求该子串中的每一字符出现次数都不少于 k 。返回这一子串的长度。 示例 1: 输入:s “aaabb”, k 3 输出:3 解释:最长子串为 “aaa” &…

冠状病毒时代的负责任数据可视化

First, a little bit about me: I’m a data science grad student. I have been writing for Medium for a little while now. I’m a scorpio. I like long walks on beaches. And writing for Medium made me realize the importance of taking personal responsibility ove…

集合_java集合框架

转载自http://blog.csdn.net/zsw101259/article/details/7570033 Java集合框架图 简化图: Java平台提供了一个全新的集合框架。“集合框架”主要由一组用来操作对象的接口组成。不同接口描述一组不同数据类型。 1、Java 2集合框架图 ①集合接口:6个…

显示随机键盘

显示随机键盘 1 <!DOCTYPE html>2 <html lang"zh-cn">3 <head>4 <meta charset"utf-8">5 <title>7-77 课堂演示</title>6 <link rel"stylesheet" type"text/css" href"style…

数据特征分析-统计分析

一、统计分析 统计分析是对定量数据进行统计描述&#xff0c;常从集中趋势和离中趋势两个方面分析。 集中趋势&#xff1a;指一组数据向某一中心靠拢的倾向&#xff0c;核心在于寻找数据的代表值或中心值-统计平均数&#xff08;算数平均数和位置平均数&#xff09; 算术平均数…

心学 禅宗_禅宗宣言,用于有效的代码审查

心学 禅宗by Jean-Charles Fabre通过让查尔斯法布尔(Jean-Charles Fabre) 禅宗宣言&#xff0c;用于有效的代码审查 (A zen manifesto for effective code reviews) When you are coding, interruptions really suck.当您编码时&#xff0c;中断确实很糟糕。 You are in the …

leetcode 896. 单调数列

如果数组是单调递增或单调递减的&#xff0c;那么它是单调的。 如果对于所有 i < j&#xff0c;A[i] < A[j]&#xff0c;那么数组 A 是单调递增的。 如果对于所有 i < j&#xff0c;A[i]> A[j]&#xff0c;那么数组 A 是单调递减的。 当给定的数组 A 是单调数组…

数据eda_银行数据EDA:逐步

数据edaThis banking data was retrieved from Kaggle and there will be a breakdown on how the dataset will be handled from EDA (Exploratory Data Analysis) to Machine Learning algorithms.该银行数据是从Kaggle检索的&#xff0c;将详细介绍如何将数据集从EDA(探索性…

结构型模式之组合

重新看组合/合成&#xff08;Composite&#xff09;模式&#xff0c;发现它并不像自己想象的那么简单&#xff0c;单纯从整体和部分关系的角度去理解还是不够的&#xff0c;并且还有一些通俗的模式讲解类的书&#xff0c;由于其举的例子太过“通俗”&#xff0c;以致让人理解产…

计算机网络原理笔记-三次握手

三次握手协议指的是在发送数据的准备阶段&#xff0c;服务器端和客户端之间需要进行三次交互&#xff1a; 第一次握手&#xff1a;客户端发送syn包(synj)到服务器&#xff0c;并进入SYN_SEND状态&#xff0c;等待服务器确认&#xff1b; 第二次握手&#xff1a;服务器收到syn包…

VB2010 的隐式续行(Implicit Line Continuation)

VB2010 的隐式续行&#xff08;Implicit Line Continuation&#xff09;许多情况下,您可以让 VB 后一行继续前一行的语句&#xff0c;而不必使用下划线&#xff08;_&#xff09;。下面列举出隐式续行语法的使用情形。1、逗号“&#xff0c;”之后PublicFunctionGetUsername(By…

flutter bloc_如何在Flutter中使用Streams,BLoC和SQLite

flutter blocRecently, I’ve been working with streams and BLoCs in Flutter to retrieve and display data from an SQLite database. Admittedly, it took me a very long time to make sense of them. With that said, I’d like to go over all this in hopes you’ll w…

leetcode 303. 区域和检索 - 数组不可变

给定一个整数数组 nums&#xff0c;求出数组从索引 i 到 j&#xff08;i ≤ j&#xff09;范围内元素的总和&#xff0c;包含 i、j 两点。 实现 NumArray 类&#xff1a; NumArray(int[] nums) 使用数组 nums 初始化对象 int sumRange(int i, int j) 返回数组 nums 从索引 i …

Bigmart数据集销售预测

Note: This post is heavy on code, but yes well documented.注意&#xff1a;这篇文章讲的是代码&#xff0c;但确实有据可查。 问题描述 (The Problem Description) The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in…

Android控制ScrollView滑动速度

翻阅查找ScrollView的文档并搜索了一下没有发现直接设置的属性和方法&#xff0c;这里通过继承来达到这一目的。 /*** 快/慢滑动ScrollView * author农民伯伯 * */public class SlowScrollView extends ScrollView {public SlowScrollView(Context context, Att…

数据特征分析-帕累托分析

帕累托分析(贡献度分析)&#xff1a;即二八定律 目的&#xff1a;通过二八原则寻找属于20%的关键决定性因素。 随机生成数据 df pd.DataFrame(np.random.randn(10)*10003000,index list(ABCDEFGHIJ),columns [销量]) #避免出现负数 df.sort_values(销量,ascending False,i…

leetcode 304. 二维区域和检索 - 矩阵不可变(前缀和)

给定一个二维矩阵&#xff0c;计算其子矩形范围内元素的总和&#xff0c;该子矩阵的左上角为 (row1, col1) &#xff0c;右下角为 (row2, col2) 。 上图子矩阵左上角 (row1, col1) (2, 1) &#xff0c;右下角(row2, col2) (4, 3)&#xff0c;该子矩形内元素的总和为 8。 示…

算法训练营 重编码_编码训练营后如何找到工作

算法训练营 重编码by Roxy Ayaz由Roxy Ayaz 编码训练营后如何找到工作 (How to get a job after a coding bootcamp) Getting a tech job after a coding bootcamp is very possible, but not necessarily pain-free.在编码训练营之后获得技术工作是很有可能的&#xff0c;但不…