回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择

回归分析中自变量共线性

介绍 (Introduction)

Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multicollinearity, exogeneity/endogeneity, and threats to external validity. Such problems become difficult to understand and control in the presence of a large number of features. Professors will often tell you to “let theory be your guide” when going about feature selection, but that is not always so easy.

从大量独立变量中进行多元回归分析可能是一项艰巨的任务。 为模型确定最佳的回归子集涉及针对偏差,多重共线性,外生性/内生性以及对外部有效性的威胁等方面的优化。 在存在大量特征的情况下,此类问题变得难以理解和控制。 在进行特征选择时,教授通常会告诉您“让理论作为指导”,但这并不总是那么容易。

This blog considers the issue of multicollinearity and suggests a method of avoiding it. Proposed here is not a “solution” to collinear variables, nor is it a perfect way of identifying them. It is simply one measurement to take into consideration when comparing multiple subsets of variables.

该博客考虑了多重共线性问题,并提出了避免这种问题的方法。 这里提出的不是共线变量的“解决方案”,也不是识别它们的理想方法。 比较变量的多个子集时,它只是一种要考虑的度量。

问题 (The Problem)

There are several ways of identifying the features that are causing problems in a model. The most common approach (and the basis of this post) is to calculate correlations between suspected collinear variables. While effective, it is important to acknowledge the shortcomings of this method. For instance, correlation coefficients are often biased by sample sizes, and bivariate correlation cannot detect two variables that are collinear only in the presence of additional variables. For these reasons, it is a good idea to consider other metrics/methods as well, some of which include the following: look at the significance of coefficients compared to the overall model; look for high standard error; calculate variance inflation factors for different features; conduct principal components analysis; and yes, let theory be your guide.

有几种方法可以识别导致模型出现问题的特征。 最常见的方法(也是本文的基础)是计算可疑共线变量之间的相关性。 尽管有效,但重要的是要认识到此方法的缺点。 例如,相关系数通常受样本量的影响,而双变量相关仅在存在其他变量的情况下无法检测到共线的两个变量。 由于这些原因,考虑其他指标/方法也是一个好主意,其中的一些指标/方法包括:与整体模型相比,考察系数的重要性; 寻找高标准误差; 计算不同特征的方差膨胀因子; 进行主成分分析; 是的,以理论为指导。

With all of this in mind, let us now consider a technique that employs a collection of transformed Pearson correlation coefficients in a multiple-criteria evaluation problem (see Multiple-Criteria Decision Analysis). The goal of the technique is to find a subset of independent variables where every pairwise correlation within the set is as low as possible, while simultaneously, each variable’s correlation with the dependent variable is as high as possible. We may represent the problem in the following way:

考虑到所有这些,现在让我们考虑一种在多准则评估问题中使用一组变换的Pearson相关系数的技术(请参阅多准则决策分析 )。 该技术的目标是找到独立变量的子集,其中集合中每个成对的相关性都应尽可能低,而同时,每个变量与因变量的相关性应尽可能地高。 我们可以通过以下方式表示问题:

Here, r is the Pearson correlation coefficient of two variables, and f(x) is the weighted mean of a set of correlation coefficients. In order to apply this function, the coefficients must first be transformed in order to correct for their bias. Arithmetic operations are invalid on raw correlation coefficients because unstable variances across different values make them biased estimates of the population. To address this, we apply the Fisher z-transformation, normalizing the distribution of correlations and approximating stable variance. The Fisher z-transformation is denoted as:

在此, r是两个变量的皮尔逊相关系数, f (x)是一组相关系数的加权平均值。 为了应用该功能,必须首先对系数进行变换以校正其偏差。 算术运算对原始相关系数无效,因为不同值之间的不稳定方差使其成为总体的有偏估计。 为了解决这个问题,我们应用了Fisher z变换,对相关分布进行了归一化并近似了稳定方差。 Fisher z变换表示为:

Image for post

With this in mind, we now consider the “maximizing” and “minimizing” elements of the problem. Because the magnitude and not the direction of correlation is of concern, the absolute value of coefficients are considered. We might think of maximizing correlation to mean “get as close to 1 as possible” and minimizing correlation to mean “get as close to 0 as possible”. Getting as close to 1 as possible is less intuitive after applying the z-transformation, because arctanh(1)=∞. Therefore, we can change the maximization problem to a minimization problem by subtracting the absolute value of each correlation from 1. Now, we might phrase the problem as follows:

考虑到这一点,我们现在考虑问题的“最大化”和“最小化”要素。 因为关注的是幅度而不是相关方向,所以考虑了系数的绝对值。 我们可能会想到最大化相关性以表示“尽可能接近1”,最小化相关性以表示“尽可能接近0”。 在应用z变换后,尽可能接近1不太直观,因为arctanh (1)=∞。 因此,我们可以通过从1中减去每个相关的绝对值,将最大化问题变为最小化问题。现在,我们可以用以下方式表达问题:

Image for post

We find the set of features that minimizes both of these functions by calculating the distance of each set from the theoretical global minimum (0,0). This solution can be best represented graphically. The figure below plots the two functions against each other for every set of features in a sample dataset. Each blue point represents one subset of variables, while the red area is an arbitrary frontier to visualize which point has the shortest Euclidian distance from the theoretical minimum.

通过计算每个集合与理论全局最小值(0,0)的距离,我们找到了使这两个函数最小化的特征集。 该解决方案最好以图形方式表示。 下图针对样本数据集中的每组特征绘制了两个函数的相对关系。 每个蓝点代表一个变量子集,而红色区域是一个任意边界,可以直观地看到哪个点与理论最小值之间的欧氏距离最短。

Image for post

The subset corresponding to the point with the shortest distance to the origin can be understood as the set where every pairwise correlation is as low as possible, and simultaneously, each correlation with the dependent variable is as high as possible.

可以将与距原点的距离最短的点对应的子集理解为一组,其中每个成对的相关性都尽可能低,同时与因变量的每个相关性都尽可能高。

一个应用程序 (An Application)

For more clarity, let’s now define a real world example. Consider the popular Boston Housing dataset. The dataset provides information on housing prices in Boston as well as information on several features of houses and the housing market there. Say we want to build a model that contains as much explanatory power of housing prices as possible. There are 506 observations in the dataset, each corresponding to a housing unit. There are 14 independent variables, but let’s say we only want to consider two different subsets with 5 independent variables each.

为了更加清晰,让我们现在定义一个真实的示例。 考虑流行的波士顿住房数据集。 该数据集提供有关波士顿住房价格的信息,以及有关房屋的一些特征和那里的住房市场的信息。 假设我们要建立一个模型,其中包含尽可能多的房价解释力。 数据集中有506个观测值,每个观测值对应一个住房单元。 有14个自变量,但假设我们只考虑两个具有5个自变量的不同子集。

The first subset consists of the following variables: proportion of non-retail business acres in the area (INDUS); Nitrus Oxide concentration (NOX); proportion of units built before 1940 in the area (AGE); property tax-rate (TAX); and the accessibility to radial highways (RAD). This subset will be referred to as {INDUS, NOX, AGE, TAX, RAD}.

第一个子集由以下变量组成:该地区非零售业务英亩的比例(INDUS); 一氧化二氮浓度(NOX); 1940年之前在该地区(AGE)建造的单位的比例; 财产税率(TAX); 以及径向公路(RAD)的可及性。 该子集将被称为{INDUS,NOX,AGE,TAX,RAD}。

The second subset consists of the following variables: distance to Boston employment centers (DIS); average number of rooms per dwelling (RM); pupil-to-teacher ratio in the area (PTRATIO); percent of lower status population in the area (LSTAT); and property tax-rate (TAX). This subset will be referred to as {DIS, RM, PTRATIO, LSTAT, TAX}.

第二个子集由以下变量组成:距波士顿就业中心(DIS)的距离; 每个住宅的平均房间数(RM); 该地区的师生比(PTRATIO); 该地区较低地位人口的百分比(LSTAT); 和财产税率(TAX)。 该子集将被称为{DIS,RM,PTRATIO,LSTAT,TAX}。

These subsets will be used to predict the dependent variable, PRICE. Correlograms of the independent variables as well as the correlations with the dependent variable for both subsets are provided below.

这些子集将用于预测因变量PRICE。 下面提供两个子集的自变量的相关图以及与因变量的相关性。

Image for post

The first step is to take the absolute value of every correlation coefficient, subtract correlations with the dependent variable from 1, and transform the correlations into z-scores.

第一步是获取每个相关系数的绝对值,从1中减去与因变量的相关性,并将相关性转换为z得分。

Image for post

Next, we calculate the weighted mean of each correlation with the dependent variable as well as the correlations within the independent variables. Weights are determined by each coefficient’s proportion of the sum of coefficients. With these aggregations, the distance of each set from the theoretical minimum (0,0) is also calculated.This is done for the {INDUS, NOX, AGE, TAX, RAD} subset as follows:

接下来,我们计算与因变量以及自变量内部的每个相关的加权平均值。 权重由每个系数在系数总和中的比例确定。 通过这些聚合,还可以计算出每个集合与理论最小值(0,0)的距离。这是针对{INDUS,NOX,AGE,TAX,RAD}子集完成的,如下所示:

Image for post

And for the {DIS, RM, PTRATIO, LSTAT, TAX} subset as:

对于{DIS,RM,PTRATIO,LSTAT,TAX}子集为:

Image for post

These two values indicate that subset {DIS, RM, PTRATIO, LSTAT, TAX} has higher correlation with PRICE and lower correlation within itself than does subset {INDUS, NOX, AGE, TAX, RAD}, demonstrated by their respective distances from the origin. This tentatively suggests that subset {DIS, RM, PTRATIO, LSTAT, TAX} has the better explanatory power of PRICE. This is not a perfect indication, and other metrics must be also be assessed.

这两个值表明,与子集{INDUS,NOX,AGE,TAX,RAD}相比,子集{DIS,RM,PTRATIO,LSTAT,TAX}与PRICE的相关性更高,而在其内部的相关性较低,这两个子集与原点之间的距离表明。 初步表明,子集{DIS,RM,PTRATIO,LSTAT,TAX}具有更好的PRICE解释能力。 这不是一个完美的指示,还必须评估其他指标。

We can verify which subset is better by actually fitting models now. Below, PRICE has been regressed on DIS, RM, PTRATIO, LSTAT, and TAX. We immediately can recognize that every variable is statistically significant to the model (see P>|t|). We also recognize that the model itself if statistically significant (see P(F)). Take note of the values, the F-statistic, the root mean squared error, and the Akaike/Bayes Information Criteria.

我们现在可以通过实际拟合模型来验证哪个子集更好。 下方,PRICE已针对DIS,RM,PTRATIO,LSTAT和TAX进行了回归。 我们立即可以看出,每个变量对模型都具有统计意义(请参阅P> | t |) 。 我们还认识到该模型本身具有统计学意义(请参阅P(F) )。 注意值, F统计量,均方根误差和Akaike / Bayes信息标准。

Image for post

Next, PRICE has been regressed on INDUS, NOX, AGE, TAX, and RAD. In this model, we can see that there are now at least two independent variables that are not statistically significant. The model itself is still significant, but it has a lower F-statistic than the previous model. Additionally, its values are both lower than that of the previous model, implying less explanatory power. RMSE, AIC, and BIC are also higher here, implying lower quality. This confirms the findings calculated above.

接下来,PRICE已针对INDUS,NOX,AGE,TAX和RAD进行了回归。 在此模型中,我们可以看到,现在至少有两个独立变量在统计上不显着。 该模型本身仍然很重要,但F统计量比以前的模型低。 此外,其值均低于先前模型的值,这意味着较少的解释力。 RMSE,AIC和BIC在这里也较高,这意味着质量较低。 这证实了上面计算的结果。

Image for post

The “z-distance” presented in this blog post has demonstrated its use in this example. The {DIS, RM, PTRATIO, LSTAT, TAX} subset has a shorter distance to 0 than the {INDUS, NOX, AGE, TAX, RAD} subset. DIS, RM, PTRATIO, LSTAT, and TAX were then shown to be better predictors of PRICE. While it was easy to simply fit these two models and compare them, in a feature space of much higher dimension it might be faster to calculate the distances of several subsets.

本博客文章中介绍的“ z -distance”已在示例中证明了其用法。 {DIS,RM,PTRATIO,LSTAT,TAX}子集比{INDUS,NOX,AGE,TAX,RAD}子集的距离短。 然后显示DIS,RM,PTRATIO,LSTAT和TAX是PRICE的更好预测指标。 尽管很容易简单地拟合这两个模型并进行比较,但是在具有更高维度的特征空间中,计算多个子集的距离可能会更快。

结论 (Conclusion)

There are many factors to consider in feature selection. This post does not offer a solution to finding the best subset of variables, but merely a way for one to take a step in the right direction by finding sets of features that do not immediately demonstrate collinearity. It is important to remember that one must rely on more than just correlation coefficients when identifying multicollinearity.

在特征选择中要考虑许多因素。 这篇文章并没有提供找到最佳变量子集的解决方案,而只是提供了一种方法,即通过查找未立即证明共线性的特征集,朝正确的方向迈出了一步。 重要的是要记住,在识别多重共线性时,人们不仅要依赖相关系数。

A Python script for this solution and for automating feature combinations can be found at the following GitHub repository:

可在以下GitHub存储库中找到此解决方案和自动化功能组合的Python脚本:

https://github.com/willarliss/z-Distance/

https://github.com/willarliss/z-Distance/

翻译自: https://towardsdatascience.com/variable-selection-in-regression-analysis-with-a-large-feature-space-2f142f15e5a

回归分析中自变量共线性

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390988.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

python 面试问题_值得阅读的30个Python面试问题

python 面试问题Interview questions are quite tricky to predict. In most cases, even peoples with great programming ability fail to answer some simple questions. Solving the problem with your code is not enough. Often, the interviewer will expect you to hav…

机器学习模型 非线性模型_机器学习:通过预测菲亚特500的价格来观察线性模型的工作原理...

机器学习模型 非线性模型Introduction介绍 In this article, I’d like to speak about linear models by introducing you to a real project that I made. The project that you can find in my Github consists of predicting the prices of fiat 500.在本文中,…

10款中小企业必备的开源免费安全工具

10款中小企业必备的开源免费安全工具 secist2017-05-188共527453人围观 ,发现 7 个不明物体企业安全工具很多企业特别是一些中小型企业在日常生产中,时常会因为时间、预算、人员配比等问题,而大大减少或降低在安全方面的投入。这时候&#xf…

图片主成分分析后的可视化_主成分分析-可视化

图片主成分分析后的可视化If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too …

TP引用样式表和js文件及验证码

TP引用样式表和js文件及验证码 引入样式表和js文件 <script src"__PUBLIC__/bootstrap/js/jquery-1.11.2.min.js"></script> <script src"__PUBLIC__/bootstrap/js/bootstrap.min.js"></script> <link href"__PUBLIC__/bo…

pytorch深度学习_深度学习和PyTorch的推荐系统实施

pytorch深度学习The recommendation is a simple algorithm that works on the principle of data filtering. The algorithm finds a pattern between two users and recommends or provides additional relevant information to a user in choosing a product or services.该…

Java 集合-集合介绍

2017-10-30 00:01:09 一、Java集合的类关系图 二、集合类的概述 集合类出现的原因&#xff1a;面向对象语言对事物的体现都是以对象的形式&#xff0c;所以为了方便对多个对象的操作&#xff0c;Java就提供了集合类。数组和集合类同是容器&#xff0c;有什么不同&#xff1a;数…

Exchange 2016部署实施案例篇-04.Ex基础配置篇(下)

上二篇我们对全新部署完成的Exchange Server做了基础的一些配置&#xff0c;今天继续基础配置这个话题。 DAG配置 先决条件 首先在配置DGA之前我们需要确保DAG成员服务器上磁盘的盘符都是一样的&#xff0c;大小建议最好也相同。 其次我们需要确保有一块网卡用于数据复制使用&…

数据库课程设计结论_结论:

数据库课程设计结论In this article, we will learn about different types[Z Test and t Test] of commonly used Hypothesis Testing.在本文中&#xff0c;我们将学习常用假设检验的不同类型[ Z检验和t检验 ]。 假设是什么&#xff1f; (What is Hypothesis?) This is a St…

配置Java_Home,临时环境变量信息

一、内容回顾 上一篇博客《Java运行环境的搭建---Windows系统》 我们说到了配置path环境变量的目的在于控制台可以在任意路径下都可以找到java的开发工具。 二、配置其他环境变量 1. 原因 为了获取更大的用户群体&#xff0c;所以使用java语言开发系统需要兼容不同版本的jdk&a…

网页缩放与窗口缩放_功能缩放—不同的Scikit-Learn缩放器的效果:深入研究

网页缩放与窗口缩放内部AI (Inside AI) In supervised machine learning, we calculate the value of the output variable by supplying input variable values to an algorithm. Machine learning algorithm relates the input and output variable with a mathematical func…

Python自动化开发01

一、 变量变量命名规则变量名只能是字母、数字或下划线的任意组合变量名的第一个字符不能是数字以下关键字不能声明为变量名 [and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not,…

未越狱设备提取数据_从三星设备中提取健康数据

未越狱设备提取数据Health data is collected every time you have your phone in your pocket. Apple or Android, the phones are equipped with a pedometer that counts your steps. Hence, health data is recorded. This data could be your one free data mart for a si…

[BZOJ2599][IOI2011]Race 点分治

2599: [IOI2011]Race Time Limit: 70 Sec Memory Limit: 128 MBSubmit: 3934 Solved: 1163[Submit][Status][Discuss]Description 给一棵树,每条边有权.求一条简单路径,权值和等于K,且边的数量最小.N < 200000, K < 1000000 Input 第一行 两个整数 n, k第二..n行 每行三…

分词消除歧义_角色标题消除歧义

分词消除歧义折磨数据&#xff0c;它将承认任何事情 (Torture the data, and it will confess to anything) Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whi…

北航教授李波:说AI会有低潮就是胡扯,这是人类长期的追求

这一轮所谓人工智能的高潮&#xff0c;和以往的几次都有所不同&#xff0c;那是因为其受到了产业界的极大关注和参与。而以前并不是这样。 当今世界是一个高度信息化的世界&#xff0c;甚至我们有一只脚已经踏入了智能化时代。而在我们日常交流和信息互动中&#xff0c;迅速发…

在加利福尼亚州投资于新餐馆:一种数据驱动的方法

“It is difficult to make predictions, especially about the future.”“很难做出预测&#xff0c;尤其是对未来的预测。” ~Niels Bohr〜尼尔斯波尔 Everything is better interpreted through data. And data-driven decision making is crucial for success in any ind…

阿里云ESC上的Ubuntu图形界面的安装

系统装的是Ubuntu Server 16.04 64位版的图形界面&#xff0c;这里是转载的一个大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867&#xff0c; 当然自己也再记录一下&#xff0c;加深点印象 1.更新apt-get 保证最新 apt-get update 2.用putty或者Xshell连接远…

近似算法的近似率_选择最佳近似最近算法的数据科学家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安装CentOS之二——最小化安装CentOS

1、上文已经创建了一个虚拟机&#xff0c;现在我们点击开启虚拟机。2、虚拟机进入到安装的界面&#xff0c;在这里我们选择第一行&#xff0c;安装或者升级系统。3、这里会提示要检查光盘&#xff0c;我们直接选择跳过。4、这里会提示我的硬件设备不被支持&#xff0c;点击OK&a…