Applied Spatial Statistics（五）线性回归 I

该笔记本演示了：

线性回归系数估计在假设下是无偏的
如何围绕系数估计构建 bootstrap 置信区间
残差图
Q-Q图

1. 线性回归系数估计在假设下是无偏的

import numpy as np
import matplotlib.pyplot as plt

Statsmodels是python中的统计建模包

import statsmodels.api as sm

#Randomly draw X and error
x = np.random.randn(1000) # ~ N(0,1) 
e = np.random.randn(1000) # ~ N(0,1)

定义与一些干扰（噪声）的线性关系

#b0 = 10
#b1 = 4
y = 10 + 4*x + e

plt.scatter(x,y,alpha=0.4)plt.axline((0, 10), (3, 22), linewidth=2, color='r')

<matplotlib.lines.AxLine at 0x29e732150>

在这里插入图片描述

x_cons = sm.add_constant(x)model = sm.OLS(y,x_cons).fit()

线性回归结果总结

model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.944
Model:	OLS	Adj. R-squared:	0.944
Method:	Least Squares	F-statistic:	1.678e+04
Date:	Wed, 07 Feb 2024	Prob (F-statistic):	0.00
Time:	19:25:57	Log-Likelihood:	-1394.3
No. Observations:	1000	AIC:	2793.
Df Residuals:	998	BIC:	2802.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	9.9372	0.031	321.228	0.000	9.876	9.998
x1	4.0251	0.031	129.541	0.000	3.964	4.086

Omnibus:	1.647	Durbin-Watson:	1.924
Prob(Omnibus):	0.439	Jarque-Bera (JB):	1.671
Skew:	0.098	Prob(JB):	0.434
Kurtosis:	2.963	Cond. No.	1.06

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

测试公正性

def reg_unbiased(x,y):estimates_list = []#Repeat the sampling process for 10,000 timesfor i in range(10000):#Get a random indexsample_index = np.random.choice(1000, 300)#Use the index to index y and xsample_y = y[sample_index]sample_x = x[sample_index]#add constant to xsample_x = sm.add_constant(sample_x)#Fit the OLSmodel = sm.OLS(sample_y,sample_x).fit()#Append the parametersestimates_list.append(model.params)return np.array(estimates_list)

rslt = np.array(reg_unbiased(x,y))

rslt

array([[9.98692129, 4.0672853 ],[9.93114785, 4.03665871],[9.99781954, 4.10916308],...,[9.91294925, 3.98044767],[9.88729838, 3.992089  ],[9.87920183, 4.12874911]])

np.mean(rslt,axis=0)

array([9.93654559, 4.02654646])

10000 次重复的估计平均值相当于截距为 10、斜率为 4 的真实回归线。

2. 如何围绕系数估计构建置信区间

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import *
import statsmodels.formula.api as smf#Use the mgp dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv'df = pd.read_csv(url)
df = df.dropna()

如果您使用“pandas.DataFrame”，则可以使用此接口来拟合模型

model = smf.ols(formula='acceleration ~ horsepower', data=df).fit()
subsample_coef = model.params

subsample_coef.values[0]

20.701933699127174

model.summary()

OLS Regression Results
Dep. Variable:	acceleration	R-squared:	0.475
Model:	OLS	Adj. R-squared:	0.474
Method:	Least Squares	F-statistic:	352.8
Date:	Wed, 07 Feb 2024	Prob (F-statistic):	1.58e-56
Time:	19:25:58	Log-Likelihood:	-827.24
No. Observations:	392	AIC:	1658.
Df Residuals:	390	BIC:	1666.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.7019	0.293	70.717	0.000	20.126	21.277
horsepower	-0.0494	0.003	-18.784	0.000	-0.055	-0.044

Omnibus:	29.904	Durbin-Watson:	1.483
Prob(Omnibus):	0.000	Jarque-Bera (JB):	37.989
Skew:	0.608	Prob(JB):	5.63e-09
Kurtosis:	3.919	Cond. No.	322.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

从数据帧中替换引导样本

#Define a bootstrap CI function
def bootstrap_ci(df):np.random.seed(100)bootstrap_b0 = []bootstrap_b1 = []for i in range(1000):#generate a re-sample with the original sample size, with replacementsample = df.sample(len(df), replace=True)#formulate the model model = smf.ols(formula='acceleration ~ horsepower', data=sample).fit()#Get the intercept and slope estimatessubsample_coef = model.params#Append them to the listbootstrap_b0.append(subsample_coef.values[0])bootstrap_b1.append(subsample_coef.values[1])#Get the lower and upper bound for the middle 95%:b0_CI = [np.percentile(bootstrap_b0, 2.5),np.percentile(bootstrap_b0, 97.5)]b1_CI = [np.percentile(bootstrap_b1, 2.5),np.percentile(bootstrap_b1, 97.5)]return b0_CI,b1_CI

rslt = bootstrap_ci(df)

print("95% CI for intercept:",rslt[0])

95% CI for intercept: [20.017892056783964, 21.347498114044548]

print("95% CI for slope:",rslt[1])

95% CI for slope: [-0.0548220191942679, -0.043592380423869134]

import seaborn as snssns.regplot(data=df, x="horsepower", y="acceleration", ci=95, truncate=False,line_kws={"color": "red"})#plt.xlim(300)

<Axes: xlabel='horsepower', ylabel='acceleration'>

在这里插入图片描述

3. 残差图

plt.scatter(model.predict(),model.resid)
plt.hlines(0,8,20,color="black")
plt.ylim(-8,8)
plt.xlabel("Fitted acceleration")
plt.ylabel("Residuals")

Text(0, 0.5, 'Residuals')

在这里插入图片描述

4. Q-Q 图

res = model.resid
fig = sm.qqplot(res,line='q')
#plt.xlim(-8,8)
plt.ylim(-8,8)
plt.show()

在这里插入图片描述

plt.hist(model.resid)

(array([  5.,  29.,  69., 106., 102.,  40.,  25.,   8.,   5.,   3.]),array([-4.99465643, -3.73465643, -2.47465643, -1.21465643,  0.04534357,1.30534357,  2.56534357,  3.82534357,  5.08534357,  6.34534357,7.60534357]),<BarContainer object of 10 artists>)

在这里插入图片描述

5. 线性回归 - 模型规范

该笔记本演示了：

正确指定模型（基线）
型号不详
- 不相关的预测变量
- 相关预测变量
过度指定模型
- 完美的多重共线性
- 不同程度的多重共线性
- 包括不相关的变量

import numpy as np
import statsmodels.api as sm

5.1 正确指定模型（基线）

我们按照正态分布随机生成 100 个不相关的 X1 和 X2。
我们指定一条真正的回归线： $y$ = 3 + 2* $X_1$ + 4* $X_2$ + e
我们在回归模型中正确包含 X1 和 X2：y ~ X1 + X2（带截距）
我们拟合 1000 个这样的回归模型，并检查估计值的偏差和精确度。

estimates = []
for i in range(1000):X1 = np.random.randn(100) + 2X2 = np.random.randn(100) + 2e = np.random.randn(100)y = 3 + 2*X1 + 4*X2 + eX = sm.add_constant(np.column_stack([X1,X2]))model = sm.OLS(y, X).fit()estimates.append(model.params)

#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [3.0037811  2.00347394 3.99617821]
SE of estimates: [0.30086517 0.09971629 0.10145132]

5.2 未指定型号

不相关的预测变量 (#1)

我们按照正态分布随机生成 100 个不相关的 X1 和 X2。
我们指定一条真正的回归线： $y$ = 3 + 2* $X_1$ + 4* $X_2$ + e
我们的回归模型中仅包含 X1：y ~ X1（带截距）
我们拟合 1000 个这样的回归模型并检查偏差。

estimates = []
for i in range(1000):means = [2 , 2]cov = [[1,0], [0,1]]X = np.random.multivariate_normal(means,cov,100)e = np.random.randn(100)y = 3 + 2*X[:,0] + 4*X[:,1] + eX = sm.add_constant(X[:,0])model = sm.OLS(y, X).fit()estimates.append(model.params)

#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [11.01614906  1.99128492]
SE of estimates: [0.95786376 0.42621356]

我们可以看到截距估计是有偏（只有当您省略的变量均值为零时才会无偏），并且 b1 的系数是无偏。估计的 SE 远高于正确指定的模型中的 SE。

5.3 过度指定模型

完美多重共线性 (#1)

我们按照正态分布随机生成 X1 的 100 个数据点。
我们设置 X2 = 1 - X1
我们指定一条真正的回归线： $y$ = 3 + 2* $X_1$ + 4* $X_2$ + e
我们只将回归模型拟合为：y ~ X1 + X2（带截距）

X1 = np.random.randn(100)
X2 = 1 - X1
e = np.random.randn(100)y = 3 + 2*X1 + 4*X2 + eX = sm.add_constant(np.column_stack([X1,X2]))model = sm.OLS(y,X).fit()

model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.826
Model:	OLS	Adj. R-squared:	0.825
Method:	Least Squares	F-statistic:	466.2
Date:	Tue, 13 Feb 2024	Prob (F-statistic):	4.98e-39
Time:	09:50:43	Log-Likelihood:	-141.90
No. Observations:	100	AIC:	287.8
Df Residuals:	98	BIC:	293.0
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	4.0724	0.075	54.100	0.000	3.923	4.222
x1	1.0145	0.072	14.005	0.000	0.871	1.158
x2	3.0579	0.045	67.296	0.000	2.968	3.148

Omnibus:	1.402	Durbin-Watson:	2.156
Prob(Omnibus):	0.496	Jarque-Bera (JB):	1.313
Skew:	0.152	Prob(JB):	0.519
Kurtosis:	2.527	Cond. No.	1.32e+16

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.91e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

多重共线性 (#2)

我们按照双变量正态分布随机生成 100 个相关的 X1 和 X2，并让 X1 和 X2 具有不同程度的相关性（0、0.3、0.6、0.9、0.95 和 0.99）。
我们指定一条真实的回归线： $y$ = 2* $X_1$ + 4* $X_2$ + e
我们在回归模型中包含 X1 和 X2：y ~ X1 + X2（为了简单起见，没有截距）
对于每个预设的相关值，我们重复 1000 次以生成回归估计的采样分布。
绘制采样分布图

#Write a function to return the sampling distribution of regression coefficients 
# based on different degree of correlation between X1 and X2
def simulation_multi_colinearity(cor):params = []for i in range(1000):means = [2,2]cov = [[1,cor], [cor,1]]X = np.random.multivariate_normal(means,cov,100)e = np.random.randn(100)*2y = 2*X[:,0] + 4*X[:,1] + emodel = sm.OLS(y,X).fit()params.append(model.params)return params

sampling_dist_0 = simulation_multi_colinearity(0)
sampling_dist_0_3 = simulation_multi_colinearity(0.3)
sampling_dist_0_6 = simulation_multi_colinearity(0.6)
sampling_dist_0_9 = simulation_multi_colinearity(0.9)
sampling_dist_0_95 = simulation_multi_colinearity(0.95)
sampling_dist_0_99 = simulation_multi_colinearity(0.99)

print("SE of estimates when X1 and X2 have a cor of 0:", np.std(sampling_dist_0,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.3:", np.std(sampling_dist_0_3,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.6:", np.std(sampling_dist_0_6,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.9:", np.std(sampling_dist_0_9,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.95:", np.std(sampling_dist_0_95,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.99:", np.std(sampling_dist_0_99,axis=0))

SE of estimates when X1 and X2 have a cor of 0: [0.14924138 0.15362695]
SE of estimates when X1 and X2 have a cor of 0.3: [0.18184739 0.18149639]
SE of estimates when X1 and X2 have a cor of 0.6: [0.23542159 0.23302496]
SE of estimates when X1 and X2 have a cor of 0.9: [0.46005811 0.46522827]
SE of estimates when X1 and X2 have a cor of 0.95: [0.62053723 0.61689469]
SE of estimates when X1 and X2 have a cor of 0.99: [1.5058649  1.50165901]

print("Mean of estimates when X1 and X2 have a cor of 0:", np.mean(sampling_dist_0,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.3:", np.mean(sampling_dist_0_3,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.6:", np.mean(sampling_dist_0_6,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.9:", np.mean(sampling_dist_0_9,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.95:", np.mean(sampling_dist_0_95,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.99:", np.mean(sampling_dist_0_99,axis=0))

Mean of estimates when X1 and X2 have a cor of 0: [1.99528853 4.00220908]
Mean of estimates when X1 and X2 have a cor of 0.3: [1.99374527 4.00339456]
Mean of estimates when X1 and X2 have a cor of 0.6: [2.01461324 3.98911849]
Mean of estimates when X1 and X2 have a cor of 0.9: [2.02487827 3.97569048]
Mean of estimates when X1 and X2 have a cor of 0.95: [2.04199266 3.95822687]
Mean of estimates when X1 and X2 have a cor of 0.99: [1.96516269 4.03487359]

我们可以观察到，模型中两个预测变量之间的相关程度不同，回归系数仍然无偏，但估计的标准误差根据相关程度夸大。

添加不相关的预测变量 (#2)

我们按照正态分布随机生成 X1、X2 和 X3 的 100 个数据点。
我们指定一条真正的回归线： $y$ = 3 + 2* $X_1$ + 4* $X_2$ + e
我们在回归模型中包括 X1、X2 和 X3：y ~ X1（带截距）。这里X3不应该在模型中，而是包含在内。
我们拟合 1000 个这样的回归模型并检查偏差。

estimates = []
for i in range(1000):X1 = np.random.randn(100)X2 = np.random.randn(100)X3 = np.random.randn(100)e = np.random.randn(100)y = 3 + 2*X1 + 4*X2 + eX = sm.add_constant(np.column_stack([X1,X2,X3]))model = sm.OLS(y, X).fit()estimates.append(model.params)

#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))

Average of estimates: [ 3.00132214e+00  1.99576168e+00  3.99886082e+00 -2.36659646e-03]
SE of estimates: [0.10146959 0.09980508 0.10321398 0.1018764 ]

我们发现 X1 和 X2 的估计是无偏，并且标准误差也很小。 X3 的估计值几乎为零，如果我们从 1000 个模型中检查一个模型，我们会发现该估计值在统计上并不显着。从这个意义上讲，将不相关的随机数据纳入模型并没有多大害处。

model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.962
Model:	OLS	Adj. R-squared:	0.961
Method:	Least Squares	F-statistic:	806.9
Date:	Tue, 13 Feb 2024	Prob (F-statistic):	6.27e-68
Time:	09:50:44	Log-Likelihood:	-138.60
No. Observations:	100	AIC:	285.2
Df Residuals:	96	BIC:	295.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	3.0338	0.099	30.704	0.000	2.838	3.230
x1	1.9591	0.094	20.781	0.000	1.772	2.146
x2	3.9472	0.090	43.918	0.000	3.769	4.126
x3	0.1564	0.099	1.580	0.117	-0.040	0.353