数据多重共线性_多重共线性对您的数据科学项目的影响比您所知道的要多

数据多重共线性

Multicollinearity is likely far down on a mental list of things to check for, if it is on a list at all. This does, however, appear almost always in real-life datasets, and it’s important to be aware of how to address it.

多重共线性可能根本不在要检查的事物的清单上,即使它根本不在清单上。 但是,这几乎总是出现在现实生活的数据集中,因此,重要的是要知道如何解决它。

As its name suggests, multicollinearity is when two (or more) features are correlated with each other, or ‘collinear’. This occurs often in real datasets, since one measurement (e.g. family income) may be correlated with another (e.g. school performance). You may be unaware that many algorithms and analysis methods rely on the assumption of no multicollinearity.

顾名思义,多重共线性是指两个(或多个)要素相互关联,即“共线性”。 由于一种度量(例如家庭收入)可能与另一种度量(例如学校成绩)相关,因此这经常发生在真实数据集中。 您可能没有意识到许多算法和分析方法都依赖于没有多重共线性的假设。

Let’s take this dataset for example, which attempts to predict a student’s chance of admission given a variety of factors.

让我们以这个数据集为例,该数据集尝试在各种因素的影响下预测学生的入学机会。

Image for post

We want to achieve a static like “performing research increases your chance of admission by x percent” or “each additional point on the TOEFL increases chance of admission by y percent”. The first thought is to train a linear regression model and interpret the coefficients.

我们希望实现一个静态的状态,例如“进行研究使您被录取的机会增加x %”或“托福每增加一个点,就可以使入学的机会增加y %”。 首先想到的是训练线性回归模型并解释系数。

A multiple regression model achieves a mean absolute error of about 4.5% percentage points, which is fairly accurate. The coefficients are interesting to analyze, and we can make a statement like ‘each point on the GRE increases you chances of admission by 0.2%, whereas each point on your CGPA increases chances by 11.4%.’

多元回归模型的平均绝对误差约为4.5%,这是相当准确的。 系数的分析很有趣,我们可以做出这样的表述:“ GRE的每个点使您被录取的机会增加了0.2%,而CGPA的每个点使您被录取的机会增加了11.4%。”

Image for post

Let’s take a look at the correlation matrix to identify which features are correlated with each other. In general, this dataset is full of highly correlated features, but CGPA is in general correlated heavily with other features.

让我们看一下相关矩阵,以确定哪些要素相互关联。 总的来说,该数据集充满了高度相关的特征,但是CGPA通常与其他特征高度相关。

Image for post

Since the TOEFL score is highly correlated by the GRE score, let’s remove this feature and re-train a linear regression model. (Perhaps) Surprisingly, the mean absolute error decreases to 4.3%. The change of coefficients is also interesting — notably, the importance of the University Rating decreases by almost half, the importance of research doubles, etc.

由于TOEFL分数与GRE分数高度相关,因此让我们删除此功能并重新训练线性回归模型。 (也许)令人惊讶的是,平均绝对误差降低到4.3%。 系数的变化也很有趣-值得注意的是,大学评估的重要性降低了近一半,研究的重要性增加了一倍,等等。

Image for post

We can take the following away from this:

我们可以采取以下措施:

  • The TOEFL score, like any other feature, can be thought of as having two components: information and noise.

    就像其他任何功能一样,TOEFL分数也可以认为具有两个组成部分:信息和噪音。
  • Its information component was already represented in other variables (perhaps performing well on the GRE requires skills to perform well on the TOEFL, etc.), so it did not provide any new information.

    它的信息部分已经用其他变量表示(也许在GRE上表现出色需要技能才能在TOEFL上表现出色,等等),因此它没有提供任何新信息。
  • It had enough noise such that keeping the feature for a minimal information gain is not worth the amount of noise it introduces to the model.

    它具有足够的噪声,以至于为了获得最小的信息增益而保持该功能不值得其引入模型的噪声量。

In other words, the TOEFL score was collinear with many of the other features. At a base level, the performance of the model was damaged, but additionally, more complex analyses on linear models — which can be very insightful — like the interpretation of coefficients need to be adjusted.

换句话说,托福成绩与许多其他特征是共线的。 从根本上来说,模型的性能受到了损害,但此外,对线性模型的更复杂的分析(可能非常有见地)也需要调整,例如系数的解释需要调整。

It’s worth exploring what the coefficients of regression models mean. For instance, if the coefficient for GRE score is 0.2%, this means that holding all other variables fixed, a one-unit increase in GRE score translates to a 0.2% increase in admission. If we include the TOEFL score (and additional highly correlated features), however, we can’t assume that these variables will remain fixed.

值得探索回归模型系数的含义。 例如,如果GRE分数的系数为0.2%,则意味着将所有其他变量固定为 ,GRE分数每增加1单位,入学率就会增加0.2%。 但是,如果我们包括TOEFL分数(以及其他高度相关的功能),则不能假定这些变量将保持不变。

Hence, the coefficients go haywire, and completely uninterpretable, since there is massive information overlap. When such scenarios arise, the modelling capability is limited as well. Because there is so much overlap, everything is amplified — if there is an error in one part, it is likely to spread via overlapping to several other sections.

因此,由于存在大量的信息重叠,因此这些系数非常困难,并且完全无法解释。 当出现这种情况时,建模能力也会受到限制。 由于存在太多的重叠,因此一切都会被放大-如果一个部分中存在错误,则很可能会通过重叠到其他几个部分而扩展。

In general, it’s impractical to memorize whether algorithms or techniques work well with multicollinearity, but it is usually true that any model that treats features the ‘same’ (makes assumptions about feature relationships) or doesn’t measure information content is vulnerable to multicollinearity.

总的来说,记住算法或技术在多重共线性中是否能很好地工作是不切实际的,但是通常确实的是,任何对待特征“相同”(对特征关系做出假设)或不测量信息内容的模型都容易受到多重共线性的影响。

What does this mean?

这是什么意思?

Take, for instance, the decision tree, which is not vulnerable to multicollinearity because it explicitly measures information content (opposite of entropy) and makes no other assumptions or measurements of relationships between features. If columns A and B are correlated with each other, the decision tree will simply choose one and discard the other (or place it very low). In this case, features are considered by their information content.

以决策树为例,该决策树不易受到多重共线性的影响,因为它显式地测量信息内容(与熵相反),并且不进行其他假设或度量要素之间的关系。 如果A列和B列相互关联,则决策树将只选择其中一个并丢弃另一个(或将其放置得很低)。 在这种情况下,要素将通过其信息内容来考虑。

On the other hand, K-Nearest Neighbors is affected by multicollinearity because it assumes every point can be represented in multidimensional space as some coordinate (e.g. (3, 2.5, 6.7, 9.8) on an x training set with four dimensions). It doesn’t measure information content, and treats features as the same. Hence, one can imagine that data points between two highly correlated features would cluster together along a line, and how that would interfere with cross-dimensional distances.

在另一方面,K最近邻由多重共线性,因为它假设每个点都可以在多维空间被表示为一些坐标的影响(例如(3, 2.5, 6.7, 9.8)上的x训练集具有四个尺寸)。 它不度量信息内容,并且将特征视为相同。 因此,可以想象两个高度相关的要素之间的数据点将沿着一条线聚在一起,以及如何干扰跨维距离。

Principal Component Analysis is an unsupervised method, but we can still evaluate it along these criteria! The goal of PCA is to explicitly retain the variance, or structure (information) of a reduced dataset, which is why it is not only generally immune to multicollinearity but is often used to reduce multicollinearity in datasets.

主成分分析是一种无监督的方法,但是我们仍然可以按照这些标准对其进行评估! PCA的目标是明确保留简化数据集的方差或结构(信息),这就是为什么它不仅通常不受多重共线性影响,而且经常用于减少数据集中的多重共线性。

Most efficient solving methods for algorithms rely on matrix mathematics and linear algebra systems — essentially representations of high-dimensional spaces, which are easily screwed with by multicollinearity.

用于算法的最有效的求解方法依赖于矩阵数学和线性代数系统-本质上是高维空间的表示,这些空间很容易被多重共线性所迷惑。

Common techniques like heavy one-hot encoding (dummy variables) in which categorical variables are represented as 0s and 1s can also be damaging because they form a perfectly linear relationship. Say that we have three binary columns A, B, and C, indicating if a row is part of one of the categories. The sum of these columns must add to 1, and hence a perfectly linear relationship A+B+C=1 is established.

诸如重一键编码(虚拟变量)之类的常用技术(其中分类变量以0和1表示)也可能是有害的,因为它们形成了完美的线性关系。 假设我们有三个二进制列A,B和C,它们指示一行是否属于类别之一。 这些列的总和必须加1,因此建立了完美的线性关系A+B+C=1

How can we identify multicollinearity?

我们如何识别多重共线性?

  • Use a VIF (Variance Inflation Factor) score on a regression model to identify is multicollinearity is present in your dataset.

    在回归模型上使用VIF(方差通货膨胀因子)得分来确定数据集中是否存在多重共线性。
  • If standard errors are too high, it may be an indicator that one error is being repeatedly propagated because of information overlap.

    如果标准错误过高,则可能表明由于信息重叠,一个错误正在重复传播。
  • Large changes in parameters when adding or removing new features indicate heavily duplicated information.

    添加或删除新功能时,参数的较大变化表示信息重复很多。
  • Create a correlation matrix. Features with values consistently above 0.4 are indicators of multicollinearity.

    创建一个相关矩阵。 值始终高于0.4的要素表示多重共线性。

There are many solutions to multicollinearity:

多重共线性有多种解决方案:

  • Use an algorithm that is immune to multicollinearity if it is an inherent aspect of the data and other transformations are not feasible. Ridge regression, principal component regression, or partial least squares regression are all good regression alternatives.

    如果它是数据的固有方面并且其他转换不可行,则使用不受多重共线性影响的算法。 岭回归,主成分回归或偏最小二乘回归都是很好的回归选择。
  • Use PCA to reduce the dimensionality of the dataset and only retain variables that are important towards preserving the data’s structure. This is beneficial if the dataset is overall very multicollinear.

    使用PCA可以减少数据集的维数,并且仅保留对于保留数据结构很重要的变量。 如果数据集总体上非常多共线性,这将是有益的。
  • Use a feature selection method to remove highly correlated features.

    使用特征选择方法删除高度相关的特征。
  • Obtain more data — this is the preferred method. More data can allow the model to retain the current amount of information while giving context and perspective to noise.

    获取更多数据-这是首选方法。 更多数据可以使模型保留当前的信息量,同时为噪声提供上下文和透视图。

翻译自: https://medium.com/@andre_ye/multicollinearity-impacts-your-data-science-project-more-than-you-know-8504efd706f

数据多重共线性

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389536.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

K-Means聚类算法思想及实现

K-Means聚类概念: K-Means聚类是最常用的聚类算法,最初起源于信号处理,其目标是将数据点划分为K个类簇, 找到每个簇的中心并使其度量最小化。 该算法的最大优点是简单、便于理解,运算速度较快,缺点是只能应…

(2.1)DDL增强功能-数据类型、同义词、分区表

1.数据类型 (1)常用数据类型  1.整数类型 int 存储范围是-2,147,483,648到2,147,483,647之间的整数,主键列常设置此类型。 (每个数值占用 4字节) smallint 存储范围是-32,768 到 32,767 之间的整数,用…

充分利用昂贵的分析

By Noor Malik努尔马利克(Noor Malik) Let’s say you write a query in Deephaven which performs a lengthy and expensive analysis, resulting in a live table. For example, in a previous project, I wrote a query which pulled data from an RSS feed to create a li…

层次聚类和密度聚类思想及实现

层次聚类 层次聚类的概念: 层次聚类是一种很直观的算法。顾名思义就是要一层一层地进行聚类。 层次法(Hierarchicalmethods)先计算样本之间的距离。每次将距离最近的点合并到同一个类。然后,再 计算类与类之间的距离&#xff0…

通配符 或 怎么浓_浓咖啡的咖啡渣新鲜度

通配符 或 怎么浓How long could you wait to brew espresso after grinding? Ask a barista, any barista, and I suspect their answer is immediately or within a few minutes. The common knowledge on coffee grounds freshness is that after 30 minutes or so, coffee…

《netty入门与实战》笔记-02:服务端启动流程

为什么80%的码农都做不了架构师?>>> 1.服务端启动流程 这一小节,我们来学习一下如何使用 Netty 来启动一个服务端应用程序,以下是服务端启动的一个非常精简的 Demo: NettyServer.java public class NettyServer {public static v…

谱聚类思想及实现

(这个我也没有怎么懂,为了防止以后能用上,还是记录下来) 谱聚类 注意:谱聚类核心聚类算法还是K-means 算法进行聚类~ 谱聚类的实现过程: 1.根据数据构造一个 图结构(Graph) &…

Tengine HTTPS原理解析、实践与调试【转】

本文邀请阿里云CDN HTTPS技术专家金九,分享Tengine的一些HTTPS实践经验。内容主要有四个方面:HTTPS趋势、HTTPS基础、HTTPS实践、HTTPS调试。 一、HTTPS趋势 这一章节主要介绍近几年和未来HTTPS的趋势,包括两大浏览器chrome和firefox对HTTPS的…

opencv:SIFT——尺度不变特征变换

SIFT概念: Sift(尺度不变特征变换),全称是Scale Invariant Feature Transform Sift提取图像的局部特征,在尺度空间寻找极值点,并提取出其位置、尺度、方向信息。 Sfit的应用范围包括 物体辨别、机器人地图…

pca(主成分分析技术)_主成分分析技巧

pca(主成分分析技术)介绍 (Introduction) Principal Component Analysis (PCA) is an unsupervised technique for dimensionality reduction.主成分分析(PCA)是一种无监督的降维技术。 What is dimensionality reduction?什么是降维? Let us start with an exam…

npm link run npm script

npm link & run npm script https://blog.csdn.net/juhaotian/article/details/78672390 npm link命令可以将一个任意位置的npm包链接到全局执行环境,从而在任意位置使用命令行都可以直接运行该npm包。 app-cmd.cmd #!/usr/bin/env nodeecho "666" &a…

一文详解java中对JVM的深度解析、调优工具、垃圾回收

2019独角兽企业重金招聘Python工程师标准>>> jvm监控分析工具一般分为两类,一种是jdk自带的工具,一种是第三方的分析工具。jdk自带工具一般在jdk bin目录下面,以exe的形式直接点击就可以使用,其中包含分析工具已经很强…

借用继承_博物馆正在数字化,并在此过程中从数据中借用

借用继承Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicking up our fee…

高斯噪声,椒盐噪声的思想及多种噪声的实现

图像噪声: 概念: • 图像噪声是图像在获取或是传输过程中受到随机信号干扰,妨碍人们对图像理解及分析处理 的信号。 • 很多时候将图像噪声看做多维随机过程,因而描述噪声的方法完全可以借用随机过程的描述, 也就是使…

如何识别媒体偏见_描述性语言理解,以识别文本中的潜在偏见

如何识别媒体偏见TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackatho…

分享 : 警惕MySQL运维陷阱:基于MyCat的伪分布式架构

分布式数据库已经进入了全面快速发展阶段。这种发展是与时俱进的,与人的需求分不开,因为现在信息时代的高速发展,导致数据量和交易量越来越大。这种现象首先导致的就是存储瓶颈,因为MySQL数据库实质上还是一个单机版本的数据库&am…

数据不平衡处理_如何处理多类不平衡数据说不可以

数据不平衡处理重点 (Top highlight)One of the common problems in Machine Learning is handling the imbalanced data, in which there is a highly disproportionate in the target classes.机器学习中的常见问题之一是处理不平衡的数据,其中目标类别的比例非常…

最小二乘法以及RANSAC(随机采样一致性)思想及实现

线性回归–最小二乘法(Least Square Method) 线性回归: 什么是线性回归? 举个例子,某商品的利润在售价为2元、5元、10元时分别为4元、10元、20元, 我们很容易得出商品的利润与售价的关系符合直线&#xf…

糖药病数据集分类_使用optuna和mlflow进行心脏病分类器调整

糖药病数据集分类背景 (Background) Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to…

Android MVP 框架

为什么80%的码农都做不了架构师?>>> 前言 根据网络上的MVP套路写了一个辣鸡MVP DEMO 用到的 android studio MVPHelper插件,方便自动生成框架代码rxjavaretrofit什么是MVP MVP就是英文的Model View Presenter,然而实际分包并不是只有这三个包…