统计学习笔记(1) 监督学习概论(1)

原作品:The Elements of Statistical Learning Data Mining, Inference, and Prediction, Second Edition, by Trevor Hastie, Robert Tibshirani and Jerome Friedman

An Introduction to Statistical Learning. by Gareth JamesDaniela WittenTrevor Hastie andRobert Tibshirani

Brief Introduction to Statistical Learning

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. We want to increase the sales of the product by controling the advertising expenditure in each of the three media. So we determine the associations between medias and sales.

In this setting, the advertising budgets are input variables while sales input is an output variable. The input variables are typically denoted using the variable output variable symbol X, with a subscript to distinguish them. So X1 might be the TV budget, X2 the radio budget, and X3 the newspaper budget. The inputs go by different names, such as predictors, independent variables, features, predictor independent variable feature or sometimes just variables. The output variable—in this case, sales—is variable often called the response or dependent variable, and is typically denoted response dependent variable using the symbol Y . 

Reducible and irreducible error

For quantitative response Y and different predictors X1, ... Xp. We assume that there is some
relationship between Y and X=(X1,X2,...,Xp), and make predictions based on input vector X and get result Y:


The second term is random error with mean zero and is independent of X.

Interpretations on f:

Sales is a response or target that we wish to predict. We generically refer to the response as Y . TV is a feature, or input, or predictor; we name it X1. Likewise name Radio as X2, and so on.

There is an ideal f(X). In particular, what is a good value for f(X) at any selected value of X, say X = 4? There can be many Y values at X = 4. A good value is f(4) = E(Y |X = 4) E(Y |X = 4) means expected value (average) of Y given X = 4. This ideal f(x) = E(Y |X = x) is called the regression function.


We can understand this as, we use the same input x to make several predictions: f1(x), f2(x),..., fm(x), the predictions are still different although the input keeps unchanged, due to distribution. The statistical average value of f1(x), f2(x),..., fm(x) will be our known correct result f(x) and also is the statistical value of Y. So we can make prediction:


(Different components in X have different importance). E(Y|X=x) means expected value (average) of Y given X=x. This ideal f(x)=E(Y|X=x) is called the regression function.

We can certainly make errors in prediction, the error is divided into 2 parts: one is reducible and the other is irreducible. The bias can be reduced, but the variance cannot.

                                                

This error is reducible because we can potentially improve the accuracy of fˆ by using the most appropriate statistical learning technique to estimate f. The quantity may contain unmeasured variables that are useful in predicting Y: since we don’t measure them, f cannot use them for its prediction.

Parametric and Non-parametric methods

We can apply parametricmodels


But Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f.

So we can use interpolation. Such as thin-plate spline. In order to fit a thin-plate spline, the data analyst must select a level of smoothness.

         

But overfitting might occur. To avoid overfitting, refer to Resampling Methods in "An Introduction to Statistical Learning" (Introduction to thin-plate splines: paper "Thin-Plate Spline Interpolation" on SpringerLink and code onhttp://elonen.iki.fi/code/tpsdemo/).

Trade-offs between Prediction Accuracy and Model Interpretability in curve fitting

Several considerations when choosing models:

1. Linear models are easy interpret while thin-plate splines are not.

2. Good fit Versus Overfitting.

3. We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Following is a representation of the tradeoff between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.


Methods in assessing model accuracy:

MSE:


i is the index of observation. We use add in the test set. In practice, we use


the average squared prediction error for these test observations (x0, y0).

Choosing test data:

How can we go about trying to select a method that minimizes the test MSE? In some settings, we may have a test data set available—that is, we may have access to a set of observations that were not used to train the statistical learning method.

Evaluate model:

We choose 3 examples:

The first figure represents a comprehensive example, with high fluctuation and high noise level. The second figure has low fluctuation but high noise level, the third has high fluctuation and low noise level.

The horizontal dashed line indicates , which corresponds to the lowest achievable test MSE among all methods, as is shown in the following graphs, training errorcan be lower than the bound, but test errorcannot.

  

When we try to avoid overfitting by selecting the fresh test data, we find that the fluctuation influences the performance when fitting dimension (flexibility) is low, compare the second figure with the third figure, compare their low-flexibility parts, we can see if fluctuation increases, model with low flexibility cannot deal well, and in the third figure, as the flexibility increases from low values, the mean square error decreases, this is the advantage of high flexibility. From the second point of view, we compare the second figure with the third figure, the second figure suffers from strong noise while the third figure suffers from weak noise, when the flexibility (dimension of model) is very high, we can see from the second image that it fits to the noise well, but it's useless and it's overfitting, we can see when the noise is weak, too high flexibility is almost useless in the third figure. So a moderate in the middle will be enough. The red curve represents the error evaluated on the fresh test set, the gray curve represents the error evaluated on the training set.

Bias-Variance Trade-off

Suppose we have fit a model f^(x) to the training set. The true model is


with f(x)=E(Y|X=x), and


The left side of equal mark represents the expected test MSE. Refers to the average test MSE that we would obtain if we repeatedly estimated test MSE f using a large number of training sets and calculate theaverage value.

Variance: Variance refers to the amount by which fˆ would change if we estimated it using a different training data set. If a method has high variance, then small changes in the training set can result in large changes in f^.

Bias


Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between Y and X1, X2, . . . , Xp. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f.

As the flexibility (order of model) increases, the variance of estimation increases, the bias decreases. So choosing the flexibility based on average test error amounts to a bias-variance trade-off.

Corresponding to the above 3 graphs, blue-squared bias, orange-variance, red-test MSE, dashed line-.

To help us evaluat test MSE with only training data, we can refer to cross validation in "Resampling Methods".

Classification

Suppose that we seek to estimate f on the basis of training observations {(x1, y1), . . . , (xn, yn)}, where now y1, . . . , yn arequalitative. Training error rate:

                                                                                                                      

Bayes Classifier:

It is possible to show that the test error rate given is minimized, on average, by a very simple classifier that assigns each observation to the most likely class.

As to classification problems, consider a classifier C(X) that assigns a class label to observation X. Denote


The Bayes optimal classifier is (plot)


Typically we measure the estimation performance using 


The Bayes classifier has smallest error.

SVMs build structured models for C(x), we also build structured models for representing the pk(x), e.g. Logistic regression, generalized additive models, K-nearest neighbors.









本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/313399.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

.NET Core 3.0之深入源码理解ObjectPool(一)

写在前面对象池是一种比较常用的提高系统性能的软件设计模式,它维护了一系列相关对象列表的容器对象,这些对象可以随时重复使用,对象池节省了频繁创建对象的开销。它使用取用/归还的操作模式,并重复执行这些操作。如下图所示&…

Deep Boltzmann Machines

转载自:http://blog.csdn.net/win_in_action/article/details/25333671 http://blog.csdn.net/zouxy09/article/details/8775518 深度神经网络(Deep neural network) 深度学习的概念源于人工神经网络的研究。含多隐层的多层感知器就是一种…

.NET斗鱼直播弹幕客户端(下)

前言在上篇文章中,我们提到了如何使用 .NET连接斗鱼TV直播弹幕的基本操作。然而想要做得好,做得容易扩展,就需要做进一步的代码整理。本文将涉及以下内容:介绍如何使用 ReactiveExtensions( Rx)&#xff0c…

【 .NET Core 3.0 】框架之十 || AOP 切面思想

本文有配套视频:https://www.bilibili.com/video/av58096866/?p6前言上回《【 .NET Core3.0 】框架之九 || 依赖注入IoC学习 AOP界面编程初探》咱们说到了依赖注入Autofac的使用,不知道大家对IoC的使用是怎样的感觉,我个人表示还是比较可行…

[ASP.NET Core 3框架揭秘] 跨平台开发体验: Docker

对于一个 .NET Core开发人员,你可能没有使用过Docker,但是你不可能没有听说过Docker。Docker是Github上最受欢迎的开源项目之一,它号称要成为所有云应用的基石,并把互联网升级到下一代。Docker是dotCloud公司开源的一款产品&#…

统计学习笔记(4) 线性回归(1)

Basic Introduction In this chapter, we review some of the key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit this model. Basic form: “≈” means “is approximately modeled as”, to …

敏捷这么久,你知道如何开敏捷发布火车吗?

译者:单冰从事项目管理十几年,先后管理传统型项目团队及敏捷创新型团队。负责京东AI事业部敏捷创新、团队工程效率改进及敏捷教练工作。曾经负责手机端京东App项目管理工作5年,带领千人团队实施敏捷转型工作,版本发布从2个月提升为…

Newton Method in Maching Learning

牛顿方法:转自http://blog.csdn.net/andrewseu/article/details/46771947 本讲大纲: 1.牛顿方法(Newton’s method) 2.指数族(Exponential family) 3.广义线性模型(Generalized linear models) 1.牛顿方法 假设有函数:,我们希…

一键分享博客或新闻到Teams好友或频道

在最近的开发者工具更新中,Teams提供了一个Share to Teams的能力,就是在你的网页上面,放置一个按钮,用户点击后,就可以很方便地将当前网页或者你指定的其他网页,分享到Teams好友或频道中。这个开发文档在这…

C#刷遍Leetcode面试题系列连载(3): No.728 - 自除数

点击蓝字“dotNET匠人”关注我哟加个“星标★”,每日 7:15,好文必达!前言前文传送门:上篇文章中我们分析了一个递归描述的字符串问题,今天我们来分析一个数学问题,一道除法相关的面试题。今天要给大家分析的…

【.NET Core 3.0】框架之十二 || 跨域 与 Proxy

本文有配套视频:https://www.bilibili.com/video/av58096866/?p8一、为什么会出现跨域的问题跨域问题由来已久,主要是来源于浏览器的”同源策略”。何为同源?只有当协议、端口、和域名都相同的页面,则两个页面具有相同的源。只要…

.NET 时间轴:从出生到巨人

点击上方蓝字关注“汪宇杰博客”“ 自1995年互联网战略日以来最雄心勃勃的事业—— 微软.NET战略, 2000年6月30日”2002-02-13.NET Framework 1.0CLR 1.0Visual Studio .NET关键词:跨语言、托管代码2003-04-24.NET Framework 1.1CLR 1.1Visual Studio 2003关键词&am…

Boltzmann Machine 入门(2)

发现RBM 中的能量函数概念需要从Hopfield网络的角度理解,于是找到 http://blog.csdn.net/roger__wong/article/details/43374343 和关于BM的最经典论文 http://www.cs.toronto.edu/~hinton/papers.html#1983-1976 一、限制玻尔兹曼机的感性认识 要回答这个问题大…

针对深度学习的GPU芯片选择

转自:http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/ It is again and again amazing to see how much speedup you get when you use GPUs for deep learning: Compared to CPUs 10x speedups are typical, but on larger problems one can achi…

C# 8 - Range 和 Index(范围和索引)

C# 7 的 Span C# 7 里面出现了Span这个数据类型,它可以表示另一个数据结构里连续相邻的一串数据,并且它是内存安全的。 例子: 这个图的输出是3,4,5,6。 C# 8 的Range类型 而C# 8里面我们可以从一个序列里面…

DCT变换学习

http://blog.csdn.net/timebomb/article/details/5960624 timebomb的博客 DCT变换的基本思路是将图像分解为88的子块或1616的子块,并对每一个子块进行单独的DCT变换,然后对变换结果进行量化、编码。随着子块尺寸的增加,算法的复杂度急剧上升…

敏捷回顾会议的套路与实践分享

01—关于敏捷回顾会议实践过敏捷的人都知道,在敏捷中会有很多的会议要开,比如计划会议(Planning)、站立会议(Daily Scrum)、评审会议(Review)以及回顾会议(Retrospective…

.Net Core AA.FrameWork应用框架介绍

开发多年,一直在从社区获取开源的便利,也深感社区力量的重要性,今天开源一个应用基础框架AA.FrameWork,也算是回馈社区,做出一点点贡献,希望能够帮助类似当年入行的我。AA.FrameWork 是基于.NET core流行的开源类库创建…

RBM/DBN训练中的explaining away概念

可以参照 Stanford大神DaphneKoller的概率图模型,里面贝叶斯网络一节讲到了explaining away。我看过之后试着谈谈自己的理解。 explainingaway指的是这样一种情况:对于一个多因一果的问题,假设各种“因”之间都是相互独立的,如果…

.NET Core使用gRPC打造服务间通信基础设施

一、什么是RPCrpc(远程过程调用)是一个古老而新颖的名词,他几乎与http协议同时或更早诞生,也是互联网数据传输过程中非常重要的传输机制。利用这种传输机制,不同进程(或服务)间像调用本地进程中…