回归分析中自变量共线性
介绍 (Introduction)
Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multicollinearity, exogeneity/endogeneity, and threats to external validity. Such problems become difficult to understand and control in the presence of a large number of features. Professors will often tell you to “let theory be your guide” when going about feature selection, but that is not always so easy.
从大量独立变量中进行多元回归分析可能是一项艰巨的任务。 为模型确定最佳的回归子集涉及针对偏差,多重共线性,外生性/内生性以及对外部有效性的威胁等方面的优化。 在存在大量特征的情况下,此类问题变得难以理解和控制。 在进行特征选择时,教授通常会告诉您“让理论作为指导”,但这并不总是那么容易。
This blog considers the issue of multicollinearity and suggests a method of avoiding it. Proposed here is not a “solution” to collinear variables, nor is it a perfect way of identifying them. It is simply one measurement to take into consideration when comparing multiple subsets of variables.
该博客考虑了多重共线性问题,并提出了避免这种问题的方法。 这里提出的不是共线变量的“解决方案”,也不是识别它们的理想方法。 比较变量的多个子集时,它只是一种要考虑的度量。
问题 (The Problem)
There are several ways of identifying the features that are causing problems in a model. The most common approach (and the basis of this post) is to calculate correlations between suspected collinear variables. While effective, it is important to acknowledge the shortcomings of this method. For instance, correlation coefficients are often biased by sample sizes, and bivariate correlation cannot detect two variables that are collinear only in the presence of additional variables. For these reasons, it is a good idea to consider other metrics/methods as well, some of which include the following: look at the significance of coefficients compared to the overall model; look for high standard error; calculate variance inflation factors for different features; conduct principal components analysis; and yes, let theory be your guide.
有几种方法可以识别导致模型出现问题的特征。 最常见的方法(也是本文的基础)是计算可疑共线变量之间的相关性。 尽管有效,但重要的是要认识到此方法的缺点。 例如,相关系数通常受样本量的影响,而双变量相关仅在存在其他变量的情况下无法检测到共线的两个变量。 由于这些原因,考虑其他指标/方法也是一个好主意,其中的一些指标/方法包括:与整体模型相比,考察系数的重要性; 寻找高标准误差; 计算不同特征的方差膨胀因子; 进行主成分分析; 是的,以理论为指导。
With all of this in mind, let us now consider a technique that employs a collection of transformed Pearson correlation coefficients in a multiple-criteria evaluation problem (see Multiple-Criteria Decision Analysis). The goal of the technique is to find a subset of independent variables where every pairwise correlation within the set is as low as possible, while simultaneously, each variable’s correlation with the dependent variable is as high as possible. We may represent the problem in the following way:
考虑到所有这些,现在让我们考虑一种在多准则评估问题中使用一组变换的Pearson相关系数的技术(请参阅多准则决策分析 )。 该技术的目标是找到独立变量的子集,其中集合中每个成对的相关性都应尽可能低,而同时,每个变量与因变量的相关性应尽可能地高。 我们可以通过以下方式表示问题:
Here, r is the Pearson correlation coefficient of two variables, and f(x) is the weighted mean of a set of correlation coefficients. In order to apply this function, the coefficients must first be transformed in order to correct for their bias. Arithmetic operations are invalid on raw correlation coefficients because unstable variances across different values make them biased estimates of the population. To address this, we apply the Fisher z-transformation, normalizing the distribution of correlations and approximating stable variance. The Fisher z-transformation is denoted as:
在此, r是两个变量的皮尔逊相关系数, f (x)是一组相关系数的加权平均值。 为了应用该功能,必须首先对系数进行变换以校正其偏差。 算术运算对原始相关系数无效,因为不同值之间的不稳定方差使其成为总体的有偏估计。 为了解决这个问题,我们应用了Fisher z变换,对相关分布进行了归一化并近似了稳定方差。 Fisher z变换表示为:
With this in mind, we now consider the “maximizing” and “minimizing” elements of the problem. Because the magnitude and not the direction of correlation is of concern, the absolute value of coefficients are considered. We might think of maximizing correlation to mean “get as close to 1 as possible” and minimizing correlation to mean “get as close to 0 as possible”. Getting as close to 1 as possible is less intuitive after applying the z-transformation, because arctanh(1)=∞. Therefore, we can change the maximization problem to a minimization problem by subtracting the absolute value of each correlation from 1. Now, we might phrase the problem as follows:
考虑到这一点,我们现在考虑问题的“最大化”和“最小化”要素。 因为关注的是幅度而不是相关方向,所以考虑了系数的绝对值。 我们可能会想到最大化相关性以表示“尽可能接近1”,最小化相关性以表示“尽可能接近0”。 在应用z变换后,尽可能接近1不太直观,因为arctanh (1)=∞。 因此,我们可以通过从1中减去每个相关的绝对值,将最大化问题变为最小化问题。现在,我们可以用以下方式表达问题:
We find the set of features that minimizes both of these functions by calculating the distance of each set from the theoretical global minimum (0,0). This solution can be best represented graphically. The figure below plots the two functions against each other for every set of features in a sample dataset. Each blue point represents one subset of variables, while the red area is an arbitrary frontier to visualize which point has the shortest Euclidian distance from the theoretical minimum.
通过计算每个集合与理论全局最小值(0,0)的距离,我们找到了使这两个函数最小化的特征集。 该解决方案最好以图形方式表示。 下图针对样本数据集中的每组特征绘制了两个函数的相对关系。 每个蓝点代表一个变量子集,而红色区域是一个任意边界,可以直观地看到哪个点与理论最小值之间的欧氏距离最短。
The subset corresponding to the point with the shortest distance to the origin can be understood as the set where every pairwise correlation is as low as possible, and simultaneously, each correlation with the dependent variable is as high as possible.
可以将与距原点的距离最短的点对应的子集理解为一组,其中每个成对的相关性都尽可能低,同时与因变量的每个相关性都尽可能高。
一个应用程序 (An Application)
For more clarity, let’s now define a real world example. Consider the popular Boston Housing dataset. The dataset provides information on housing prices in Boston as well as information on several features of houses and the housing market there. Say we want to build a model that contains as much explanatory power of housing prices as possible. There are 506 observations in the dataset, each corresponding to a housing unit. There are 14 independent variables, but let’s say we only want to consider two different subsets with 5 independent variables each.
为了更加清晰,让我们现在定义一个真实的示例。 考虑流行的波士顿住房数据集。 该数据集提供有关波士顿住房价格的信息,以及有关房屋的一些特征和那里的住房市场的信息。 假设我们要建立一个模型,其中包含尽可能多的房价解释力。 数据集中有506个观测值,每个观测值对应一个住房单元。 有14个自变量,但假设我们只考虑两个具有5个自变量的不同子集。
The first subset consists of the following variables: proportion of non-retail business acres in the area (INDUS); Nitrus Oxide concentration (NOX); proportion of units built before 1940 in the area (AGE); property tax-rate (TAX); and the accessibility to radial highways (RAD). This subset will be referred to as {INDUS, NOX, AGE, TAX, RAD}.
第一个子集由以下变量组成:该地区非零售业务英亩的比例(INDUS); 一氧化二氮浓度(NOX); 1940年之前在该地区(AGE)建造的单位的比例; 财产税率(TAX); 以及径向公路(RAD)的可及性。 该子集将被称为{INDUS,NOX,AGE,TAX,RAD}。
The second subset consists of the following variables: distance to Boston employment centers (DIS); average number of rooms per dwelling (RM); pupil-to-teacher ratio in the area (PTRATIO); percent of lower status population in the area (LSTAT); and property tax-rate (TAX). This subset will be referred to as {DIS, RM, PTRATIO, LSTAT, TAX}.
第二个子集由以下变量组成:距波士顿就业中心(DIS)的距离; 每个住宅的平均房间数(RM); 该地区的师生比(PTRATIO); 该地区较低地位人口的百分比(LSTAT); 和财产税率(TAX)。 该子集将被称为{DIS,RM,PTRATIO,LSTAT,TAX}。
These subsets will be used to predict the dependent variable, PRICE. Correlograms of the independent variables as well as the correlations with the dependent variable for both subsets are provided below.
这些子集将用于预测因变量PRICE。 下面提供两个子集的自变量的相关图以及与因变量的相关性。
The first step is to take the absolute value of every correlation coefficient, subtract correlations with the dependent variable from 1, and transform the correlations into z-scores.
第一步是获取每个相关系数的绝对值,从1中减去与因变量的相关性,并将相关性转换为z得分。
Next, we calculate the weighted mean of each correlation with the dependent variable as well as the correlations within the independent variables. Weights are determined by each coefficient’s proportion of the sum of coefficients. With these aggregations, the distance of each set from the theoretical minimum (0,0) is also calculated.This is done for the {INDUS, NOX, AGE, TAX, RAD} subset as follows:
接下来,我们计算与因变量以及自变量内部的每个相关的加权平均值。 权重由每个系数在系数总和中的比例确定。 通过这些聚合,还可以计算出每个集合与理论最小值(0,0)的距离。这是针对{INDUS,NOX,AGE,TAX,RAD}子集完成的,如下所示:
And for the {DIS, RM, PTRATIO, LSTAT, TAX} subset as:
对于{DIS,RM,PTRATIO,LSTAT,TAX}子集为:
These two values indicate that subset {DIS, RM, PTRATIO, LSTAT, TAX} has higher correlation with PRICE and lower correlation within itself than does subset {INDUS, NOX, AGE, TAX, RAD}, demonstrated by their respective distances from the origin. This tentatively suggests that subset {DIS, RM, PTRATIO, LSTAT, TAX} has the better explanatory power of PRICE. This is not a perfect indication, and other metrics must be also be assessed.
这两个值表明,与子集{INDUS,NOX,AGE,TAX,RAD}相比,子集{DIS,RM,PTRATIO,LSTAT,TAX}与PRICE的相关性更高,而在其内部的相关性较低,这两个子集与原点之间的距离表明。 初步表明,子集{DIS,RM,PTRATIO,LSTAT,TAX}具有更好的PRICE解释能力。 这不是一个完美的指示,还必须评估其他指标。
We can verify which subset is better by actually fitting models now. Below, PRICE has been regressed on DIS, RM, PTRATIO, LSTAT, and TAX. We immediately can recognize that every variable is statistically significant to the model (see P>|t|). We also recognize that the model itself if statistically significant (see P(F)). Take note of the R² values, the F-statistic, the root mean squared error, and the Akaike/Bayes Information Criteria.
我们现在可以通过实际拟合模型来验证哪个子集更好。 下方,PRICE已针对DIS,RM,PTRATIO,LSTAT和TAX进行了回归。 我们立即可以看出,每个变量对模型都具有统计意义(请参阅P> | t |) 。 我们还认识到该模型本身具有统计学意义(请参阅P(F) )。 注意R²值, F统计量,均方根误差和Akaike / Bayes信息标准。
Next, PRICE has been regressed on INDUS, NOX, AGE, TAX, and RAD. In this model, we can see that there are now at least two independent variables that are not statistically significant. The model itself is still significant, but it has a lower F-statistic than the previous model. Additionally, its R² values are both lower than that of the previous model, implying less explanatory power. RMSE, AIC, and BIC are also higher here, implying lower quality. This confirms the findings calculated above.
接下来,PRICE已针对INDUS,NOX,AGE,TAX和RAD进行了回归。 在此模型中,我们可以看到,现在至少有两个独立变量在统计上不显着。 该模型本身仍然很重要,但F统计量比以前的模型低。 此外,其R²值均低于先前模型的R²值,这意味着较少的解释力。 RMSE,AIC和BIC在这里也较高,这意味着质量较低。 这证实了上面计算的结果。
The “z-distance” presented in this blog post has demonstrated its use in this example. The {DIS, RM, PTRATIO, LSTAT, TAX} subset has a shorter distance to 0 than the {INDUS, NOX, AGE, TAX, RAD} subset. DIS, RM, PTRATIO, LSTAT, and TAX were then shown to be better predictors of PRICE. While it was easy to simply fit these two models and compare them, in a feature space of much higher dimension it might be faster to calculate the distances of several subsets.
本博客文章中介绍的“ z -distance”已在示例中证明了其用法。 {DIS,RM,PTRATIO,LSTAT,TAX}子集比{INDUS,NOX,AGE,TAX,RAD}子集的距离短。 然后显示DIS,RM,PTRATIO,LSTAT和TAX是PRICE的更好预测指标。 尽管很容易简单地拟合这两个模型并进行比较,但是在具有更高维度的特征空间中,计算多个子集的距离可能会更快。
结论 (Conclusion)
There are many factors to consider in feature selection. This post does not offer a solution to finding the best subset of variables, but merely a way for one to take a step in the right direction by finding sets of features that do not immediately demonstrate collinearity. It is important to remember that one must rely on more than just correlation coefficients when identifying multicollinearity.
在特征选择中要考虑许多因素。 这篇文章并没有提供找到最佳变量子集的解决方案,而只是提供了一种方法,即通过查找未立即证明共线性的特征集,朝正确的方向迈出了一步。 重要的是要记住,在识别多重共线性时,人们不仅要依赖相关系数。
A Python script for this solution and for automating feature combinations can be found at the following GitHub repository:
可在以下GitHub存储库中找到此解决方案和自动化功能组合的Python脚本:
https://github.com/willarliss/z-Distance/
https://github.com/willarliss/z-Distance/
翻译自: https://towardsdatascience.com/variable-selection-in-regression-analysis-with-a-large-feature-space-2f142f15e5a
回归分析中自变量共线性
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390988.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!