统计学习笔记(4) 线性回归(1)

Basic Introduction

In this chapter, we review some of the key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit this model.

Basic form:


“≈” means “is approximately modeled as”, to estimate the parameters, by far the most common approach involves minimizing the least squares criterion. Let training samples be (x1,y1),...,(xn,yn), then define the residual sum of squares (RSS) as

And we get

About the result, in the ideal case, that is with enough samples, we can get a classification result called population regression line.

Population regression line: the best fitting result

Least squared regression line: with limited samples

To evaluatehow well our estimation meets the true values, we use standard error, e.g. to estimate the mean value. The variance of sample mean  is


where σ is the standard deviation of each of the realization yi (i=1,2,...,n, y1,...yn are uncorrelated). n is the number of samples.

Definition of standard deviation:

Suppose X is a random variable with mean value


It is the square root of variance. About the computation of standard deviation:

Let x1, ..., xN be samples, then we calculate the standard deviation with (here is the number of limited samples):


Note: for limited samples, we use N-1 to divide, which is the common case, for unlimited samples in the ideal case, we use N to divide.

In the following formulation:

                                                                            
                                                       ,
Here SE means standard error. Standard error is standard deviation divided by sqrt(n), it means the accuracy of results while the standard deviation means the accuracy of data.

An estimation of σ is residual standard error. We use residual standard error (RSE) to estimate it.


The 95% confidence interval for β1 and β0:

  

As an example, we can use


to get the probability that β1 is zero, or the non-existence of the relationship between X(predictor) and Y(response).

Assessing the Accuracy of the Model
There are 3 criterions:

1. Residual Standard Error

     

2. R^2 Statistic

    

TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression. An R^2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. On the other hand, we can use


to assess the fit of the linear model. In simple linear model,R^2=Cor(X,Y)^2.

R^2 is normalized, when the actual line is steeper, TSS is larger, and because of

RSS is also larger.

Multiple Linear Regression


we can use least squares to get

we also use F-statistic to

An explanation of the above expression is

                                                                            

The RSS represents the variability left unexplained, TSS is the total variability, as we have already estimated p variables and there are n variables as a whole, so the variance of RSS is n-p-1, TSS-RSS is p.

Subtopic0Hypothesis Testing in Single and Multiple Linear Regression

                   

                                                          


 

The zero in Q剩 should be changed to 1.


Subtopic1:whether each of the predictors is useful in predicting the response.

To indicate which one of

                                                                  

is true. If F-statistic is close to 1, then H0 is true. If F-statistic is greater than 1, then Ha is true.

It turns out that the answer depends on the values of n and p. When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small.

When H0 is true and the errors i have a normal distribution, the F-statistic follows an F-distribution. For any given value of n and p, any statistical software package can be used to compute the p-value associated with the F-statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0.

Here the p-value is defined as the probability, under the assumption of hypothesis H, of obtaining a result equal to or more extreme than what was actually observed. Here the reason that a smaller p can indicate the existence of the relationship between at least one of p parameters and the result is that onlyWhen H0 is true and the errors i have a normal distribution, the F-statistic follows an F-distribution. And when p is small, the hypothesis under consideration may not adequately explain the observation.The smaller p-value is, the more suspectable H0 is.

If we only want to estimate a proportion of the parameters, for example. q parameters,


We fit a second model that uses all the variables except those last q. Suppose that the residual sum of squares for that model is RSS0,The dimension of freedom of RSS0 is q larger than that of RSS.

                                                                                       

When q=1, it satisfies t-distribution. So it reports the partial effect of adding that variable to the model.

F-statistic results in p-value.

Note: But we cannot only estimate the p-value for each predictor. We should also estimate the overall F-statistic. for even if H0 is true, there is only a 5% chance that the F-statistic will result in a p-value below 0.05, regardless of the number of predictors or the number of observations.
Subtopic2: Importance of variables
To make sure of the importance of the predictors, we can try:

Method1: Forward Selection.We begin with the null model, then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS.

Method2: Backward Selection. We start with all variables in the model, and backwardremove the variable with the largest p-value.

Method3: Mixed Selection.

Process: We start with no variables in the model, we add the variable that provides the best fit,

Stop: If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.
Comment: Backward selection cannot be used if p > n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.

Subtopic3: Model fitting quality

Two criterion: R2(the fraction of variance explained) and RSE. By adding predictors, if the RSE is greatly reduced, then the predictor is useful.But models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

In addition to looking at the RSE and R2 statistics just discussed, it can be useful to plot the data.

Graphical summaries can reveal problems with a model that are not visible from numerical statistics. For example, Figure below displays a three-dimensional plot of TV and radio versus sales. We see that some observations lie above and some observations lie below the least squares regression plane. In particular, the linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio. It underestimates sales for instances where the budget was split between the two media. This pronounced non-linear pattern cannot be modeled accurately using linear regression. It suggests a synergy or interaction effect between the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium. In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.


Qualitative Predictors
The predictors can take on 2 values, just like binary, the response is quantitative. 

Further introduction to the linear model

Two basic assumptions:

1. The relationship between the predictors and response are additive and linear.

2. The effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors.

There are some flaws in the above models. e.g. a synergy effect. That is, equal increases in both predictors can contribute more to the increase in response than unbalanced increases among predictors. For example, we can modify

                                                                                                                   

to

                                                                                                    

Non-linear relationships:

For example, if the linear model has low R^2 value, then we change Y=a+bX1 to Y=a+bX1+cX1^2 and represent it as Y=a+bX1+cX2 where X2=X1^2, and we can also use standard linear regression software to estimate a, b and c.

Potential problems:

1. Non-linearity

See the residual plot to see whether there's discernible pattern, if there is, then the model should not be linear. e.g. Left: obvious discernible pattern. Right: Not obvious.


2. Correlation of Error Terms

An important assumption of the linear regression model is that the error terms, 1, 2, . . . , n, are uncorrelated. i represents different samples. Then the estimated standard errors will tend to underestimate the true standard error. Also plot the residual as a function of time.


In the top panel, we see the residuals from a linear regression fit to data generated with uncorrelated errors. There is no evidence of a time-related trend in the residuals. In contrast, the residuals in the bottom panel are from a data set in which adjacent errors had a correlation of 0.9.
3. Non-constant Variance of Error Terms


Cause: It is often the case that the variances of the error terms are non-constant. For instance, the variances of the error terms may increase with the value of the response.

Solution: When faced with this problem, one possible solution is to transform the response Y using a concave function such as log Y or √Y . Such a transformation results in a greater amount of shrinkage of the larger responses. Sometimes we have a good idea of the variance of each response.

                   

Another example, if different observations have different variances, we can fit our model by weighted least squares, with weights proportional to the inverse weighted variances.

4. Outliers

We can plot the studentized residuals: It is normalized by standard errors.

5. High leverage Points

We just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage have an unusual value for xi.

Phenomenon:



Left: Observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed. Center: The red observation is not unusual in terms of its X1 value or its X2 value, but still falls outside the bulk of the data, and hence has high leverage. Right: Observation 41 has a high leverage and a high residual.

Introduction:

we observe that removing the high leverage observation has a much more substantial impact on the least squares line than removing the outlier. It is cause for concern if the least squares line is heavily affected by just a couple of observations.

Detection:

In order to quantify an observation’s leverage, we compute the leverage statistic:


If a given observation has a leverage statistic that greatly exceeds (p+1)/n, then we may suspect that the corresponding point has high leverage.

6. Collinearity


Compared with the left image, the right one is colinearity. The two predictors are correlated with each other.

Difficulties arise from colinearity are


Left: Contour of RSS associated with different possiblecoefficient estimates for the regression of balance (response) on limit (predictor 1) and age (predictor 2).

The result is a small change in the data could cause the pair of coefficient values that yield the smallest RSS—that is, the least squares estimates—to move anywhere along this valley. collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for ^βj to grow.

The probability of correctly power detecting a non-zero coefficient—is reduced by collinearity. A simple way to detect collinearity is to look at the correlation matrix of the predictors.


Multicollinearity-cannot be detected bylooking at the correlation matrices
Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity.

Instead of inspecting the correlation matrix, a better way to assess multi-collinearity collinearity is to compute the variance inflation factor (VIF).

The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. A VIF value that exceeds 5 or 10 indicates a problematic amount of colinearity.

For each predictor,


is R^2 from a regression of Xj(act as response) onto all of the other predictors.

Solution
The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.






















本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/313389.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

简化路径

题目描述 以 Unix 风格给出一个文件的绝对路径,你需要简化它。或者换句话说,将其转换为规范路径。 在 Unix 风格的文件系统中,一个点(.)表示当前目录本身;此外,两个点 (…&#xf…

敏捷这么久,你知道如何开敏捷发布火车吗?

译者:单冰从事项目管理十几年,先后管理传统型项目团队及敏捷创新型团队。负责京东AI事业部敏捷创新、团队工程效率改进及敏捷教练工作。曾经负责手机端京东App项目管理工作5年,带领千人团队实施敏捷转型工作,版本发布从2个月提升为…

Newton Method in Maching Learning

牛顿方法:转自http://blog.csdn.net/andrewseu/article/details/46771947 本讲大纲: 1.牛顿方法(Newton’s method) 2.指数族(Exponential family) 3.广义线性模型(Generalized linear models) 1.牛顿方法 假设有函数:,我们希…

复原IP地址

1.题目描述 给定一个只包含数字的字符串,复原它并返回所有可能的 IP 地址格式。 示例: 输入: "25525511135" 输出: ["255.255.11.135", "255.255.111.35"]2.解法 2.1 回溯剪枝法 private int n;private String s;private Linked…

一键分享博客或新闻到Teams好友或频道

在最近的开发者工具更新中,Teams提供了一个Share to Teams的能力,就是在你的网页上面,放置一个按钮,用户点击后,就可以很方便地将当前网页或者你指定的其他网页,分享到Teams好友或频道中。这个开发文档在这…

C#刷遍Leetcode面试题系列连载(3): No.728 - 自除数

点击蓝字“dotNET匠人”关注我哟加个“星标★”,每日 7:15,好文必达!前言前文传送门:上篇文章中我们分析了一个递归描述的字符串问题,今天我们来分析一个数学问题,一道除法相关的面试题。今天要给大家分析的…

GPU Shader 程序调试方法

转载自: http://blog.csdn.net/pizi0475/article/details/7573939 内容提要:手动调试和使用工具PIX调试Direct3D程序。 3D绘图中常见问题: 1.模型消失,没有出现在画面上; 2.模型在画面上失真…

【.NET Core 3.0】框架之十二 || 跨域 与 Proxy

本文有配套视频:https://www.bilibili.com/video/av58096866/?p8一、为什么会出现跨域的问题跨域问题由来已久,主要是来源于浏览器的”同源策略”。何为同源?只有当协议、端口、和域名都相同的页面,则两个页面具有相同的源。只要…

Boltzmann Machine 入门(1)

根据我的第一篇关于DBM的博文,明白了一个道理,1. v 和h 互相能推测出彼此,表示同一组特征的两种形式,就像时域频域一样。接下来又看了 http://www.cnblogs.com/tianchi/archive/2013/03/14/2959716.html 以热力学分子随机取值变化…

.NET 时间轴:从出生到巨人

点击上方蓝字关注“汪宇杰博客”“ 自1995年互联网战略日以来最雄心勃勃的事业—— 微软.NET战略, 2000年6月30日”2002-02-13.NET Framework 1.0CLR 1.0Visual Studio .NET关键词:跨语言、托管代码2003-04-24.NET Framework 1.1CLR 1.1Visual Studio 2003关键词&am…

Go 语言接口

Go 语言接口 Go 语言提供了另外一种数据类型即接口,它把所有的具有共性的方法定义在一起,任何其他类型只要实现了这些方法就是实现了这个接口。 实例 实例 /* 定义接口 */ type interface_name interface { method_name1 [return_type] method_name2…

Redis缓存雪崩、缓存穿透、热点Key

我们通常使用 缓存 过期时间的策略来帮助我们加速接口的访问速度,减少了后端负载,同时保证功能的更新。 1、缓存穿透 缓存系统,按照KEY去查询VALUE,当KEY对应的VALUE一定不存在的时候并对KEY并发请求量很大的时候,就会对后端造…

Boltzmann Machine 入门(2)

发现RBM 中的能量函数概念需要从Hopfield网络的角度理解,于是找到 http://blog.csdn.net/roger__wong/article/details/43374343 和关于BM的最经典论文 http://www.cs.toronto.edu/~hinton/papers.html#1983-1976 一、限制玻尔兹曼机的感性认识 要回答这个问题大…

针对深度学习的GPU芯片选择

转自:http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/ It is again and again amazing to see how much speedup you get when you use GPUs for deep learning: Compared to CPUs 10x speedups are typical, but on larger problems one can achi…

C# 8 - Range 和 Index(范围和索引)

C# 7 的 Span C# 7 里面出现了Span这个数据类型,它可以表示另一个数据结构里连续相邻的一串数据,并且它是内存安全的。 例子: 这个图的输出是3,4,5,6。 C# 8 的Range类型 而C# 8里面我们可以从一个序列里面…

第k个排列

1、问题描述 给出集合 [1,2,3,…,n],其所有元素共有 n! 种排列。 按大小顺序列出所有排列情况,并一一标记,当 n 3 时, 所有排列如下: “123”“132”“213”“231”“312”“321” 给定 n 和 k,返回第 k 个排列。…

DCT变换学习

http://blog.csdn.net/timebomb/article/details/5960624 timebomb的博客 DCT变换的基本思路是将图像分解为88的子块或1616的子块,并对每一个子块进行单独的DCT变换,然后对变换结果进行量化、编码。随着子块尺寸的增加,算法的复杂度急剧上升…

敏捷回顾会议的套路与实践分享

01—关于敏捷回顾会议实践过敏捷的人都知道,在敏捷中会有很多的会议要开,比如计划会议(Planning)、站立会议(Daily Scrum)、评审会议(Review)以及回顾会议(Retrospective…

光栅化坐标映射公式

Direct3D中投影空间内的点坐标与屏幕上(或视口内)点的对应关系, 设屏幕大小为wh,屏幕左上角像素的中心被定义为(0,0),整个屏幕是从(-0.5,-0.5)-(w-0.5,h-0.5), 像素 将投影空间内的x轴上区间(-1.0-1/w, 1.0…

朋友圈

1、题目描述 班上有 N 名学生。其中有些人是朋友,有些则不是。他们的友谊具有是传递性。如果已知 A 是 B 的朋友,B 是 C 的朋友,那么我们可以认为 A 也是 C 的朋友。所谓的朋友圈,是指所有朋友的集合。 给定一个 N * N 的矩阵 M…