pca 主成分分析
Save time, resources and stay healthy with data exploration that goes beyond means, distributions and correlations: Leverage PCA to see through the surface of variables. It saves time and resources, because it uncovers data issues before an hour-long model training and is good for a programmer’s health, since she trades off data worries with something more enjoyable. For example, a well-proven machine learning model might fail, because of one-dimensional data with insufficient variance or other related issues. PCA offers valuable insights that make you confident about data properties and its hidden dimensions.
超越均值,分布和相关性的数据探索可节省时间,资源并保持健康:利用PCA透视变量的表面。 它节省了时间和资源,因为它在一小时的模型训练之前就发现了数据问题,并且对程序员的健康非常有益,因为她可以用更有趣的东西来权衡数据的烦恼。 例如,由于一维数据的方差不足或其他相关问题,一个经过充分验证的机器学习模型可能会失败。 PCA提供了宝贵的见解,使您对数据属性及其隐藏维度充满信心。
This article shows how to leverage PCA to understand key properties of a dataset, saving time and resources down the road which ultimately leads to a happier, more fulfilled coding life. I hope this post helps to apply PCA in a consistent way and understand its results.
本文展示了如何利用PCA来理解数据集的关键属性,从而节省时间和资源,最终使编码寿命更长寿,更令人满意。 我希望这篇文章有助于以一致的方式应用PCA并了解其结果。
TL; DR (TL;DR)
PCA provides valuable insights that reach beyond descriptive statistics and help to discover underlying patterns. Two PCA metrics indicate 1. how many components capture the largest share of variance (explained variance), and 2., which features correlate with the most important components (factor loading). These metrics crosscheck previous steps in the project work flow, such as data collection which then can be adjusted. As a shortcut and ready-to-use tool, I provide the function do_pca()
which conducts a PCA for a prepared dataset to inspect its results within seconds in this notebook or this script.
PCA提供了有价值的见解,这些见解超出了描述性统计数据的范围,并有助于发现潜在的模式。 两个PCA指标指示1.捕获最大方差份额的成分( 解释了方差 ),以及2.与最重要的成分相关的特征( 要素负载 )。 这些度量标准可以交叉检查项目工作流程中的先前步骤 ,例如可以进行数据收集的调整。 作为一种快捷且易于使用的工具,我提供了do_pca()
函数,该函数为准备好的数据集执行PCA,以在此笔记本或此脚本中在几秒钟内检查其结果。
数据探索作为安全网 (Data exploration as a safety net)
When a project structure resembles the one below, the prepared dataset is under scrutiny in the 4. step by looking at descriptive statistics. Among the most common ones are means, distributions and correlations taken across all observations or subgroups.
当项目结构类似于以下结构时,通过查看描述性统计数据,将在第4步中仔细检查准备的数据集。 最常见的是在所有观察值或子组中采用的均值,分布和相关性。
Common project structure
共同的项目结构
- Collection: gather, retrieve or load data 收集:收集,检索或加载数据
- Processing: Format raw data, handle missing entries 处理:格式化原始数据,处理缺失的条目
- Engineering: Construct and select features 工程:构造和选择特征
Exploration: Inspect descriptives, properties
探索:检查描述,属性
- Modelling: Train, validate and test models 建模:训练,验证和测试模型
- Evaluation: Inspect results, compare models 评估:检查结果,比较模型
When the moment arrives of having a clean dataset after hours of work, makes many glances already towards the exciting step of applying models to the data. At this stage, around 80–90% of the project’s workload is done, if the data did not fell out of the sky, cleaned and processed. Of course, the urge is strong for modeling, but here are two reasons why a thorough data exploration saves time down the road:
在经过数小时的工作后,有了一个干净的数据集的时刻到来时,已经将许多目光投向了将模型应用于数据的令人兴奋的步骤。 在这个阶段,如果数据没有从天而降,清理和处理,则大约完成了项目工作量的80–90%。 当然,建模的冲动很强烈,但是这里有 彻底的数据探索可以节省时间的两个原因:
catch coding errors → revise feature engineering (step 3)
捕获编码错误 →修改特征工程(步骤3)
identify underlying properties → rethink data collection (step 1), preprocessing (step 2) or feature engineering (step 3)
识别基础属性 →重新考虑数据收集(步骤1),预处理(步骤2)或特征工程(步骤3)
Wondering about underperforming models due to underlying data issues after a few hours into training, validating and testing is like a photographer on the set, not knowing how their models might look like. Therefore, the key message is to see data exploration as an opportunity to get to know your data, understanding its strength and weaknesses.
经过数小时的培训,验证和测试后,由于底层数据问题而导致模型表现不佳的问题,就像布景中的摄影师一样,不知道其模型会是什么样子。 因此,关键信息是将数据探索视为了解您的数据 ,了解其优势和劣势的机会。
Descriptive statistics often reveal coding errors. However, detecting underlying issues likely requires more than that. Decomposition methods such as PCA help to identify these and enable to revise previous steps. This ensures a smooth transition to model building.
描述性统计通常会揭示编码错误。 但是,要发现潜在的问题可能还需要更多。 分解方法(例如PCA)有助于识别这些方法,并可以修改以前的步骤。 这样可以确保顺利过渡到模型构建。
用PCA看表面之下 (Look beneath the surface with PCA)
Large datasets often require PCA to reduce dimensionality anyway. The method as such captures the maximum possible variance across features and projects observations onto mutually uncorrelated vectors, called components. Still, PCA serves other purposes than dimensionality reduction. It also helps to discover underlying patterns across features.
无论如何,大型数据集通常需要PCA来减少维数。 这样的方法捕获了整个特征的最大可能方差,并将观测值投影到互不相关的向量(称为分量)上。 尽管如此,PCA还可以实现降维以外的其他目的。 它还有助于发现跨功能的基础模式。
To focus on the implementation in Python instead of methodology, I will skip describing PCA in its workings. There exist many great resources about it that I refer to those instead:
为了专注于Python的实现而不是方法论,我将不介绍PCA的工作原理。 我有很多关于它的大量资源可供参考:
Animations showing PCA in action: https://setosa.io/ev/principal-component-analysis/
演示PCA动作的动画: https : //setosa.io/ev/principal-component-analysis/
PCA explained in a family conversation: https://stats.stackexchange.com/a/140579
PCA在一次家庭对话中进行了解释: https : //stats.stackexchange.com/a/140579
Smith [2]. A tutorial on principal components analysis: Accessible here.
史密斯[2]。 主成分分析教程: 可在此处访问 。
Two metrics are crucial to make sense of PCA for data exploration:
要使PCA对数据探索有意义,有两个指标至关重要:
1. Explained variance measures how much a model can reflect the variance of the whole data. Principle components try to capture as much of the variance as possible and this measure shows to what extent they can do that. It helps to see Components are sorted by explained variance, with the first one scoring highest and with a total sum of up to 1 across all components.
1.解释方差 度量模型可以反映多少整体数据方差。 主成分尝试捕获尽可能多的方差,此度量表明它们可以做到的程度。 有助于按解释的方差对组件进行排序,第一个得分最高,所有组件的总和最高为1。
2. Factor loading indicates how much a variable correlates with a component. Each component is made of a linear combination of variables, where some might have more weight than others. Factor loadings indicate this as correlation coefficients, ranging from -1 to 1, and make components interpretable.
2.因子加载 指示变量与组件的相关程度。 每个组件均由变量的线性组合组成,其中某些组件可能比其他组件具有更大的权重。 因子加载将此表示为相关系数,范围为-1至1,并使组件可解释。
The upcoming sections apply PCA to exciting data from a behavioral field experiment and guide through using these metrics to enhance data exploration.
接下来的部分将PCA应用于来自行为现场实验的令人兴奋的数据,并指导如何使用这些指标来增强数据探索。
负荷数据:砂砾的随机教育干预(Alan等,2019) (Load data: A Randomized Educational Intervention on Grit (Alan et al., 2019))
The iris dataset served well as a canonical example of several PCA. In an effort to be diverse and using novel data from a field study, I rely on replication data from Alan et al. [1]. I hope this is appreciated. It comprises data from behavioral experiments at Turkish schools, where 10 year olds took part in a curriculum to improve a non-cognitive skill called grit which defines as perseverance to pursue a task. The authors sampled individual characteristics and conducted behavioral experiments to measure a potential treatment effect between those receiving the program ( grit == 1
) and those taking part in a control treatment ( grit == 0
).
虹膜数据集很好地代表了几个PCA。 为了实现多样性并使用来自实地研究的新颖数据,我依靠Alan等人的复制数据。 [1]。 希望我对此表示赞赏。 它包含来自土耳其学校的行为实验的数据,土耳其学校的10岁儿童参加了一项课程,以提高名为grit的非认知技能,该技能定义为坚持执业的毅力。 作者对个人特征进行了采样,并进行了行为实验,以测量接受该程序的人( grit == 1
)与参与对照治疗的人( grit == 0
)之间的潜在治疗效果。
The following loads the data from an URL and stores it as a pandas dataframe.
以下内容从URL加载数据并将其存储为pandas数据框。
# To load data from Harvard Dataverse
import io
import requests# load exciting data from URL (at least something else than Iris)
url = ‘https://dataverse.harvard.edu/api/access/datafile/3352340?gbrecs=false'
s = requests.get(url).content# store as dataframe
df_raw = pd.read_csv(io.StringIO(s.decode(‘utf-8’)), sep=’\t’)
预处理和特征工程 (Preprocessing and feature engineering)
For PCA to work, the data needs to be numeric, without missings, and standardized. I put all steps into one function ( clean_data
) which returns a dataframe with standardized features. and conduct steps 1 to 3 of the project work flow (collecting, processing and engineering). To begin with, import necessary modules and packages.
为了使PCA能够正常工作,数据必须是数字化的,无遗漏的并且是标准化的。 我将所有步骤都放在一个函数( clean_data
)中,该函数返回具有标准化功能的数据框。 并执行项目工作流程的第1步到第3步(收集,处理和工程)。 首先,导入必要的模块和软件包。
import pandas as pd
import numpy as np# sklearn module
from sklearn.decomposition import PCA# plots
import matplotlib.pyplot as plt
import seaborn as sns
# seaborn settings
sns.set_style("whitegrid")
sns.set_context("talk")# imports for function
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
Next, the clean_data()
function is defined. It gives a shortcut to transform the raw data into a prepared dataset with (i.) selected features, (ii.) missings replaced by column means, and (iii.) standardized variables.
接下来,定义clean_data()
函数。 它提供了一种将原始数据转换为准备好的数据集的捷径,该数据集具有(i。)选定要素,(ii。)缺失值被列均值替换,以及(iii。)标准化变量。
Note about selected features: I selected features in (iv.) according to their replication scripts, accessible on Harvard Dataverse and solely used sample 2 (“sample B” in the publicly accessible working paper). To be concise, refer to the paper for relevant descriptives (p. 30, Table 2).
有关所选功能的说明:我在(iv。)中根据其复制脚本选择了功能,这些功能可在Harvard Dataverse和单独使用的示例2(可公开访问的工作文件中的“示例B”)中进行访问。 为简明起见,请参阅本文中的相关说明(表2,第30页)。
Preparing the data takes one line of code (v).
准备数据需要一行代码(v)。
def clean_data(data, select_X=None, impute=False, std=False):
"""Returns dataframe with selected, imputed
and standardized features
Input
data: dataframe
select_X: list of feature names to be selected (string)
impute: If True impute np.nan with mean
std: If True standardize data
Return
dataframe: data with selected, imputed
and standardized features
"""
# (i.) select features
if select_X is not None:
data = data.filter(select_X, axis='columns')
print("\t>>> Selected features: {}".format(select_X))
else:
# store column names
select_X = list(data.columns)
# (ii.) impute with mean
if impute:
imp = SimpleImputer()
data = imp.fit_transform(data)
print("\t>>> Imputed missings")
# (iii.) standardize
if std:
std_scaler = StandardScaler()
data = std_scaler.fit_transform(data)
print("\t>>> Standardized data")
return pd.DataFrame(data, columns=select_X)# (iv.) select relevant features in line with Alan et al. (2019)
selected_features = ['grit', 'male', 'task_ability', 'raven', 'grit_survey1', 'belief_survey1', 'mathscore1', 'verbalscore1', 'risk', 'inconsistent']# (v.) select features, impute missings and standardize
X_std = clean_data(df_raw, selected_features, impute=True, std=True)
Now, the data is ready for exploration.
现在,数据已准备好进行探索。
碎石图和因子加载:解释PCA结果 (Scree plots and factor loadings: Interpret PCA results)
A PCA yields two metrics that are relevant for data exploration: Firstly, how much variance each component explains (scree plot), and secondly how much a variable correlates with a component (factor loading). The following sections provide a practical example and guide through the PCA output with a scree plot for explained variance and a heatmap on factor loadings.
PCA产生两个与数据探查相关的指标:首先,每个组件可以解释多少差异(scree图),其次,变量与某个组件有多少关联(因子加载)。 以下各节提供了一个实际示例,并通过一个scree图表指导了PCA输出,以提供解释的方差和因子负载的热图。
解释方差显示跨变量的维数 (Explained variance shows the number of dimensions across variables)
Nowadays, data is abundant and the size of datasets continues to grow. Data scientists routinely deal with hundreds of variables. However, are these variables worth their memory? Put differently: Does a variable capture unique patterns or does it measure similar properties already reflected by other variables?
如今,数据非常丰富,数据集的规模也在不断增长。 数据科学家通常处理数百个变量。 但是,这些变量值得记忆吗? 换句话说:变量是捕获唯一模式还是测量其他变量已经反映的相似属性?
PCA might answer this through the metric of explained variance per component. It details the number of underlying dimensions on which most of the variance is observed.
PCA可以通过解释每个组件的差异度量来回答这一问题。 它详细说明了可观察到大部分差异的基础维数。
The code below initializes a PCA object from sklearn and transforms the original data along the calculated components (i.). Thereafter, information on explained variance is retrieved (ii.) and printed (iii.).
下面的代码从sklearn初始化PCA对象,并沿计算的分量(i。)转换原始数据。 此后,检索(ii。)有关解释的方差的信息并进行打印(iii。)。
# (i.) initialize and compute pca
pca = PCA()
X_pca = pca.fit_transform(X_std)# (ii.) get basic info
n_components = len(pca.explained_variance_ratio_)
explained_variance = pca.explained_variance_ratio_
cum_explained_variance = np.cumsum(explained_variance)
idx = np.arange(n_components)+1df_explained_variance = pd.DataFrame([explained_variance, cum_explained_variance],
index=['explained variance', 'cumulative'],
columns=idx).Tmean_explained_variance = df_explained_variance.iloc[:,0].mean() # calculate mean explained variance# (iii.) Print explained variance as plain text
print('PCA Overview')
print('='*40)
print("Total: {} components".format(n_components))
print('-'*40)
print('Mean explained variance:', round(mean_explained_variance,3))
print('-'*40)
print(df_explained_variance.head(20))
print('-'*40)PCA Overview
========================================
Total: 10 components
----------------------------------------
Mean explained variance: 0.1
----------------------------------------
explained variance cumulative
1 0.265261 0.265261
2 0.122700 0.387962
3 0.113990 0.501951
4 0.099139 0.601090
5 0.094357 0.695447
6 0.083412 0.778859
7 0.063117 0.841976
8 0.056386 0.898362
9 0.052588 0.950950
10 0.049050 1.000000
----------------------------------------
Interpretation: The first component makes up for around 27% of the explained variance. This is relatively low as compared to other datasets, but no matter of concern. It simply indicates that a major share (100%–27%=73%) of observations distributes across more than one dimension. Another way to approach the output is to ask: How much components are required to cover more than X% of the variance? For example, I want to reduce the data’s dimensionality and retain at least 90% variance of the original data. Then I would have to include 9 components to reach at least 90% and even have 95% of explained variance covered in this case. With an overall of 10 variables in the original dataset, the scope to reduce dimensionality is limited. Additionally, this shows that each of the 10 original variables adds somewhat unique patterns and limitedly repeats information from other variables.
解释:第一部分占解释差异的约27%。 与其他数据集相比,这相对较低,但无需担心。 它只是表明观察的主要部分(100%–27%= 73%)分布在多个维度上。 处理输出的另一种方法是问:要覆盖超过X%的方差需要多少分量? 例如,我想减少数据的维数并保留原始数据的至少90%的差异。 然后,在这种情况下,我将必须包含9个组成部分才能至少达到90%,甚至还要涵盖95%的解释方差。 在原始数据集中总共有10个变量,减少维数的范围是有限的。 此外,这表明10个原始变量中的每个变量都添加了一些独特的模式,并有限地重复了来自其他变量的信息。
To give another example, I list explained variance of “the” wine dataset:
再举一个例子,我列出了“ the” 葡萄酒数据集的解释方差:
PCA Overview: Wine dataset
========================================
Total: 13 components
----------------------------------------
Mean explained variance: 0.077
----------------------------------------
explained variance cumulative
1 0.361988 0.361988
2 0.192075 0.554063
3 0.111236 0.665300
4 0.070690 0.735990
5 0.065633 0.801623
6 0.049358 0.850981
7 0.042387 0.893368
8 0.026807 0.920175
9 0.022222 0.942397
10 0.019300 0.961697
11 0.017368 0.979066
12 0.012982 0.992048
13 0.007952 1.000000
----------------------------------------
Here, 8 out of 13 components suffice to capture at least 90% of the original variance. Thus, there is more scope to reduce dimensionality. Furthermore, it indicates that some variables do not contribute much to variance in the data.
在这里,13个成分中的8个足以捕获至少90%的原始方差。 因此,存在减小尺寸的更大范围。 此外,它表明某些变量对数据方差的贡献不大。
Instead of plain text, a scree plot visualizes explained variance across components and informs about individual and cumulative explained variance for each component. The next code chunk creates such a scree plot and includes an option to focus on the first X components to be manageable when dealing with hundreds of components for larger datasets (limit
).
代替纯文本, 碎石图可直观显示各个组件之间的已解释方差,并告知每个组件的单个和累积已解释方差。 下一个代码块创建了这样的scree图,并包括一个选项,该选项着重于在处理较大数据集的数百个组件( limit
)时可管理的前X个组件。
#limit plot to x PC
limit = int(input("Limit scree plot to nth component (0 for all) > "))
if limit > 0:
limit_df = limit
else:
limit_df = n_componentsdf_explained_variance_limited = df_explained_variance.iloc[:limit_df,:]#make scree plot
fig, ax1 = plt.subplots(figsize=(15,6))ax1.set_title('Explained variance across principal components', fontsize=14)
ax1.set_xlabel('Principal component', fontsize=12)
ax1.set_ylabel('Explained variance', fontsize=12)ax2 = sns.barplot(x=idx[:limit_df], y='explained variance', data=df_explained_variance_limited, palette='summer')
ax2 = ax1.twinx()
ax2.grid(False)ax2.set_ylabel('Cumulative', fontsize=14)
ax2 = sns.lineplot(x=idx[:limit_df]-1, y='cumulative', data=df_explained_variance_limited, color='#fc8d59')ax1.axhline(mean_explained_variance, ls='--', color='#fc8d59') #plot mean
ax1.text(-.8, mean_explained_variance+(mean_explained_variance*.05), "average", color='#fc8d59', fontsize=14) #label y axismax_y1 = max(df_explained_variance_limited.iloc[:,0])
max_y2 = max(df_explained_variance_limited.iloc[:,1])
ax1.set(ylim=(0, max_y1+max_y1*.1))
ax2.set(ylim=(0, max_y2+max_y2*.1))plt.show()
A scree plot might show distinct jumps from one component to another. For example, when the first component captures disproportionately more variance than others, it could be a sign that variables inform about the same underlying factor or do not add additional dimensions, but say the same thing from a marginally different angle.
碎石图可能显示从一个组件到另一个组件的明显跳跃。 例如,当第一个组件比其他组件更多地捕获方差时,这可能表明变量告知了相同的潜在因素或未添加其他维度,而是从稍微不同的角度说了同样的话。
To give a direct example and to get a feeling for how distinct jumps might look like, I provide the scree plot of the Boston house prices dataset:
为了给出一个直接的例子并了解不同的跳跃可能是什么样子,我提供了波士顿房价数据集的稀疏图:
PCA节省时间的两个原因 (Two Reasons why PCA saves time down the road)
Assume you have hundreds of variables, apply PCA and discover that over much of the explained variance is captured by the first few components. This might hint at a much lower number of underlying dimensions than the number of variables. Most likely, dropping some hundred variables leads to performance gains for training, validation and testing. There will be more time left to select a suitable model and refine it than to wait for the model itself to discover lack of variance behind several variables.
假设您有数百个变量,应用PCA并发现前几个组件捕获了大部分解释的方差。 这可能暗示底层维数要比变量数少得多 。 最有可能的是,减少数百个变量可以提高培训,验证和测试的性能。 与等待模型本身发现几个变量背后的方差相比,剩下更多的时间来选择合适的模型并进行优化。
In addition to this, imagine that the data was constructed by oneself, e.g. through web scraping, and the scraper extracted pre-specified information from a web page. In that case, the retrieved information could be one-dimensional, when the developer of the scraper had only few relevant items in mind, but forgot to include items that shed light on further aspects of the problem setting. At this stage, it might be worthwhile to go back to the first step of the work flow and adjust data collection.
除此之外,假设数据是由自己构建的,例如通过Web抓取,并且该抓取器从网页中提取了预先指定的信息。 在那种情况下,当刮板的开发人员只考虑很少的相关项目时, 检索到的信息可能是一维的 ,但是却忘了包含那些可以阐明问题更多方面的信息。 在此阶段,可能值得回到工作流程的第一步并调整数据收集 。
通过功能和组件之间的关联发现潜在因素 (Discover underlying factors with correlations between features and components)
PCA offers another valuable statistic besides explained variance: The correlation between each principle component and a variable, also called factor loading. This statistic facilitates to grasp the dimension that lies behind a component. For example, a dataset includes information about individuals such as math score, reaction time and retention span. The overarching dimension would be cognitive skills and a component that strongly correlates with these variables can be interpreted as the cognitive skill dimension. Similarly, another dimension could be non-cognitive skills and personality, when the data has features such as self-confidence, patience or conscientiousness. A component that captures this area highly correlates with those features.
除了解释的方差外,PCA还提供了另一个有价值的统计数据:每个主成分与变量之间的相关性,也称为因子加载。 此统计信息有助于掌握组件后面的尺寸。 例如,数据集包括有关个人的信息,例如数学分数,React时间和保留期。 最重要的方面是认知技能,与这些变量密切相关的组件可以解释为认知技能。 同样,当数据具有自信心,耐心或尽责性等特征时,另一个维度可能是非认知技能和个性。 捕获该区域的组件与这些功能高度相关。
The following code creates a heatmap to inspect these correlations, also called factor loading matrix.
以下代码创建了一个热图来检查这些相关性,也称为因子加载矩阵。
# adjust y-axis size dynamically
size_yaxis = round(X_std.shape[1] * 0.5)
fig, ax = plt.subplots(figsize=(8,size_yaxis))# plot the first top_pc components
top_pc = 3
sns.heatmap(df_c.iloc[:,:top_pc], annot=True, cmap="YlGnBu", ax=ax)
plt.show()
The first component strongly negatively associates with task ability, reasoning score (raven), math score, verbal score and positively links to beliefs about being gritty (grit_survey1). Summarizing this into a common underlying factor is subjective and requires domain knowledge. In my opinion, the first component mainly captures cognitive skills.
第一个组件与任务能力,推理分数( raven ),数学分数,语言分数强烈负相关,并与坚韧不拔的信念( grit_survey1 )呈正相关。 将其概括为常见的基本因素是主观的,并且需要领域知识。 我认为,第一个组成部分主要是捕获认知技能。
The second component correlates negatively with receiving the treatment (grit), gender (male) and positively relates to being inconsistent. Interpreting this dimension is less clear-cut and much more challenging. Nevertheless, it accounts for 12% of explained variance instead of 27% like the first component, which results in less interpretable dimensions as it spans slightly across several topical areas. All components that follow might be analogously difficult to interpret.
第二部分与接受治疗( 沙粒 )负相关,与性别( 男性 )负相关,与前后矛盾呈正相关。 解释这个维度不太明确,更具挑战性。 但是,它占解释差异的12%,而不是像第一个组成部分那样占27%,这导致解释性较小,因为它跨几个主题区域。 类似地,后续的所有组件也可能难以解释。
Evidence that variables capture similar dimensions could be uniformly distributed factor loadings. One example which inspired this article is on of my projects where I relied on Google Trends data and self-constructed keywords about a firm’s sustainability. A list of the 15th highest factor loadings for the first principle component revealed loadings ranging from 0.12 as the highest value to 0.11 as the lowest loading of all 15. Such a uniform distribution of factor loadings could be an issue. This especially applies when data is self-collected and someone preselected what is being considered for collection. Adjusting this selection might add dimensionality to your data which possibly improves model performance at the end.
变量捕获相似维度的证据可能是均匀分布的因子载荷。 启发本文的一个例子是在我的项目中 ,我依靠Google趋势数据和关于公司可持续性的自建关键字。 第一个主成分的第15个最高因子载荷列表显示,载荷从最大值15中的0.12到所有15中最低载荷中的0.11不等。这样的因子载荷的均匀分布可能是一个问题。 当数据是自我收集并且有人预先选择要收集的数据时,这尤其适用。 调整此选择可能会增加数据的维度,从而可能最终改善模型性能。
PCA节省时间的另一个原因 (Another reason why PCA saves time down the road)
If the data was self-constructed, the factor loadings show how each feature contributes to an underlying dimension, which helps to come up with additional perspectives on data collection and what features or dimensions could add valuable variance. Rather than blind guessing which features to add, factor loadings lead to informed decisions for data collection. They may even be an inspiration in the search for more advanced features.
如果数据是自构造的,则因子加载将显示每个要素如何对基础维度做出贡献,这有助于提出 关于数据收集的其他观点 以及哪些要素或维度可以增加有价值的差异。 而不是盲目猜测要添加哪些功能,因子加载可以导致 明智的数据收集决策 。 它们甚至可能是寻求更高级功能的灵感。
结论 (Conclusion)
All in all, PCA is a flexible instrument in the toolbox for data exploration. Its main purpose is to reduce complexity of large datasets. But it also serves well to look beneath the surface of variables, discover latent dimensions and relate variables to these dimensions, making them interpretable. Key metrics to consider are explained variance and factor loading.
总而言之,PCA是工具箱中用于数据探索的灵活工具。 其主要目的是降低大型数据集的复杂性。 但是,它也可以很好地用于查看变量的表面,发现潜在维度并将变量与这些维度相关联,从而使它们易于解释。 要考虑的关键指标是方差和因子加载 。
This article shows how to leverage these metrics for data exploration that goes beyond averages, distributions and correlations and build an understanding of underlying properties of the data. Identifying patterns across variables is valuable to rethink previous steps in the project workflow, such as data collection, processing or feature engineering.
本文展示了如何利用这些指标进行超出平均值,分布和相关性的数据探索,并加深对数据基本属性的理解。 跨变量识别模式对于重新思考项目工作流中的先前步骤(例如数据收集,处理或功能设计)非常重要。
Thanks for reading! I hope you find it as useful as I had fun to write this guide. I am curious of your thoughts on this matter. If you have any feedback I highly appreciate your feedback and look forward receiving your message.
谢谢阅读! 希望您觉得它和我编写本指南一样有用。 我很好奇你对此事的想法。 如果您有任何反馈意见,我将非常感谢您的反馈意见,并期待收到您的消息 。
附录 (Appendix)
访问Jupyter笔记本 (Access the Jupyter Notebook)
I applied PCA to even more exemplary datasets like Boston housing market, wine and iris using do_pca()
. It illustrates how PCA output looks like for small datasets. Feel free to download my notebook or script.
我使用do_pca()
将PCA应用于更多示例性数据集,例如波士顿住房市场,葡萄酒和鸢尾花。 它说明了小型数据集的PCA输出外观。 随时下载我的笔记本或脚本 。
有关因子分析与PCA的注意事项 (Note on factor analysis vs. PCA)
A rule of thumb formulated here states: Use PCA if you want to reduce your correlated observed variables to a smaller set of uncorrelated variables and use factor analysis to test a model of latent factors on observed variables.
这里规定的经验法则是:如果要将相关的观察变量减少为较小的一组不相关变量,请使用PCA,并使用因子分析来测试观察变量上的潜在因子模型。
Even though this distinction is scientifically correct, it becomes less relevant in an applied context. PCA relates closely to factor analysis which often leads to similar conclusions about data properties which is what we care about. Therefore, the distinction can be relaxed for data exploration. This post gives an example in an applied context and another example with hands-on code for factor analysis is attached in the notebook.
即使这种区别在科学上是正确的,但在实际应用中它的重要性就降低了。 PCA与因子分析密切相关,因子分析通常会得出我们关心的关于数据属性的相似结论。 因此,可以放宽对数据探索的区分。 这篇文章给出了一个应用上下文中的示例, 笔记本中还附有另一个具有动因分析代码的示例。
Finally, for those interested in the differences between factor analysis and PCA refer to this post. Note, that throughout this article I never used the term latent factor to be precise.
最后,对于那些对因子分析和PCA之间的差异感兴趣的人,请参考这篇文章 。 请注意,在本文中,我从未使用术语“ 潜在因子”来精确。
翻译自: https://towardsdatascience.com/understand-your-data-with-principle-component-analysis-pca-and-discover-underlying-patterns-d6cadb020939
pca 主成分分析
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388632.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!