最佳子集aic选择_AutoML的起源:最佳子集选择

最佳子集aic选择

As there is a lot of buzz about AutoML, I decided to write about the original AutoML; step-wise regression and best subset selection. Then I decided to ignore step-wise regression because it is bad and should probably stop being taught. That leaves best subset selection to discuss.

由于有关AutoML的话题很多,因此我决定写有关原始AutoML的文章。 逐步回归和最佳子集选择。 然后我决定忽略逐步回归,因为它很糟糕 ,应该停止讲授。 剩下的最佳子集选择需要讨论。

The idea behind best subset selection is choose the “best” subset of variables to include in a model, looking at groups of variables together as opposed to step-wise regression which compares them one at a time. We determine which set of variables are “best” by assessing which sub-model fits the data best while penalizing for the number of independent variables in the model to avoid over-fitting. There are multiple metrics for assessing how well a model fits: adjusted 𝑅-squared, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Mallow’s 𝐶𝑝 are probably the best known.

最佳子集选择的思想是选择要包括在模型中的变量的“最佳”子集,将变量组放在一起查看,而不是逐步比较一次将它们进行比较的逐步回归。 我们通过评估哪个子模型最适合数据,同时对模型中自变量的数量进行惩罚以避免过度拟合,从而确定哪个变量组“最佳”。 有多种评估模型拟合程度的指标:调整后的𝑅平方,Akaike信息准则(AIC),贝叶斯信息准则(BIC)和Mallow's𝐶𝑝 是最有名的。

The formulas for each are below.

每个公式如下。

Image for post

With Adjusted R-squared, you want to find the model with the largest Adjusted R-squared because it explains the most variance in the dependent variable, penalized for model complexity. For the others, you want to find the model with the smallest Information Criterion because it is the model with the least unexplained variance in the dependent variable, penalizing for model complexity. They’re the same idea, i.e. maximizing something good versus minimizing something bad.

使用“已调整R平方”时,您要查找“已调整R平方”最大的模型,因为它解释了因变量中最大的方差,并因模型复杂性而受到惩罚。 对于其他模型,您希望找到信息准则最小的模型,因为它是因变量中无法解释的方差最小的模型,因此对模型的复杂性不利。 他们是同一主意,即最大化好事物与最小化坏事物。

Both the AIC and Mallow’s 𝐶𝑝 tend to give better predictive models, while BIC tends to give models with fewer independent variables because it penalizes complex models more than the other two.

AIC和Mallow的𝐶𝑝 BIC倾向于提供更好的预测模型,而BIC倾向于提供具有更少自变量的模型,因为它比其他两个模型对复杂模型的惩罚更大。

Like most things in life, automating model selection comes at a cost. If you use your data to select a linear model, the coefficients of the selected variables will be biased away from zero! The null hypotheses of both the individual t-tests for each coefficient and the F-test for overall model significance are based on the assumption that each coefficient is normally distributed with mean 0. Since we have introduced bias into our coefficients, the Type I error level increases for these tests! This may not be an issue if you just need a predictive model, but it completely invalidates any statistical inferences made with the selected model. AutoML may be able to generate decent predictive models, but inference still requires a person to think carefully about the problem and follow the scientific method.

就像生活中的大多数事情一样,自动选择模型是有代价的。 如果您使用的数据来选择线性模型,所选择的变量的系数将从0 偏向了! 每个系数的单独t检验和整体模型显着性的F检验的零假设均基于以下假设:每个系数的正态分布均值为0。由于我们在系数中引入了偏差,因此I型错误这些测试的水平提高了! 如果您只需要预测模型,这可能就不成问题了,但是这完全会使使用所选模型做出的任何统计推断无效。 AutoML可能能够生成不错的预测模型,但是推理仍然需要人们仔细考虑问题并遵循科学方法。

展示最佳子集选择的偏见 (Demonstrating the Bias of Best Subset Selection)

I performed a simulation study to demonstrate the bias caused by best subset selection. Instead of looking at the bias in the coefficients, we will look at the bias in the estimated standard deviation of the error term in the model

我进行了仿真研究,以证明最佳子集选择所引起的偏差。 除了查看系数中的偏差之外,我们还将查看模型中误差项的估计标准偏差中的偏差。

Image for post

where the error terms are identically and independently distributed 𝑁(0,𝜎) random variables.

其中误差项是相同且独立分布的𝑁(0,𝜎)随机变量。

At each round of the simulation, a sample of 100 observations are generated from the same distribution. The true model, which contains only the truly significant variables, as well as the best subset models selected by AIC and BIC are also estimated. From each model, I estimate the 𝜎 of the error term using the formula

在模拟的每一轮中,从同一分布生成100个观测值的样本。 还估计仅包含真正重要变量的真实模型,以及由AIC和BIC选择的最佳子集模型。 从每一个模型,我估计σ 使用公式计算误差项

Image for post

This is performed 500 times.

这执行了500次。

The particular parameters of my simulation are as follows: 𝑛 = 100, # of independent variables = 6, 𝜎 = 1, and the number of significant independent variables is 2. The intercept is significant as well, so 3 coefficients are non-zero. The non-zero coefficients are selected using 𝑁(5,1) random numbers because I am too lazy to define fixed numbers, but they remain fixed for all rounds of the simulation.

我的模拟的特定参数如下: 𝑛 = 100,自变量= 6,σ= 1,和显著独立变量的数目的#是2。截距是显著为好,这样的3个系数为非零。 非零系数是使用𝑁(5,1)随机数选择的,因为我懒于定义固定数,但是在所有模拟回合中它们都保持固定。

I first defined my own function to perform best subset selection using AIC or BIC using a naive approach by looking at every combination of variables. It only works for a small number of variables because the number of models it has to consider blows up as the number of variables increases. The number of models considered is

我首先定义了自己的函数,以天真的方法通过查看变量的每种组合来使用AIC或BIC来执行最佳子集选择。 它仅适用于少量变量,因为随着变量数量的增加,必须考虑的模型数量会激增。 考虑的模型数量是

Image for post

but smarter implementations of best subset selection use a tree search to reduce the number of models considered.

但是最佳子集选择的更智能实现使用树搜索来减少所考虑模型的数量。

The graphs of interest are below these chunks of code for the best subset selection function and for the simulation.

感兴趣的图在这些代码块的下方,用于最佳子集选择功能和仿真。

The red line is the line where the y-axis equals the x-axis, which is the unbiased estimate of 𝜎 from the true model. As you can see in the plots below, the estimates of 𝜎 are biased from the models selected by best AIC and BIC. In fact they will always be less than or equal to the unbiased estimate of 𝜎 from the true model. This demonstrates why models selected via best subset selection are invalid for inference.

红线是y轴等于x轴的线,这是真实模型中𝜎的无偏估计。 如下图所示, 𝜎的估计值是从最佳AIC和BIC选择的模型中得出的。 实际上,它们将始终小于或等于真实模型中𝜎的无偏估计。 这说明了为什么通过最佳子集选择所选择的模型对于推理无效。

Image for post

奖励部分:调查LASSO和岭回归中误差项的估计标准偏差的偏差 (Bonus Section: Investigating Bias in Estimated Standard Deviation of the Error term in LASSO and Ridge Regression)

While working on the simulation study above, I became interested in the potential bias of regularization methods on estimates of the standard deviation of the error term in a linear model, although one wouldn’t use a regularized model to estimate a parameter for the purposes of inference. As you most likely know, LASSO and Ridge regression intentionally bias estimated coefficients towards zero to reduce the amount of variance in the model (how much estimated coefficients change from sample to sample from the same population). The LASSO can set coefficients equal to zero, performing variable selection. Ridge regression biases coefficients towards zero, but will not set them equal to zero, so it isn’t a variable selection tool like best subset selection or the LASSO.

在进行上述仿真研究时,我对正则化方法对线性模型中误差项的标准偏差的估计的潜在偏差感兴趣,尽管出于以下目的,人们不会使用正则化模型来估计参数:推理。 如您最可能知道的那样,LASSO和Ri​​dge回归有意将估计系数偏向零,以减少模型中的方差量(在同一总体中,样本之间的估计系数变化了多少)。 LASSO可以将系数设置为零,从而执行变量选择。 岭回归将系数偏向零,但不会将其设置为零,因此它不是像最佳子集选择或LASSO这样的变量选择工具。

I used the same set up as before, but upped the sample size from 100 to 200, the number of independent variables from 6 to 100, and the number of significant independent variables from 2 to 50. The shrinkage parameter in both the LASSO and Ridge models were chosen among 0.01, 0.1, 1.0, and 10.0 using 3-fold cross validation. I counted the number of non-zero coefficients in the LASSO model for purposes of calculating 𝜎̂ and used all 100, plus 1 for the intercept, for the Ridge model, since it biases coefficients to zero but doesn’t set them to zero.

我使用了与以前相同的设置,但是样本量从100增加到200,自变量从6增加到100,有效自变量从2增加到50。LASSO和Ri​​dge中的收缩参数使用3倍交叉验证从0.01、0.1、1.0和10.0中选择模型。 我数的非零系数的数目在LASSO模型用于计算σ的目的,所使用的所有100,再加上1截距,对于岭模型,因为它偏置系数为零,但不将它们设置为零。

Obviously, regularized linear models are not valid for the purposes of inference because they bias estimates of coefficients. I still thought investigating any bias in the estimated standard deviation of the error term was worth writing a little code.

显然,正则化线性模型出于推论的目的是无效的,因为它们会使系数的估计产生偏差。 我仍然认为调查误差项的估计标准偏差中的任何偏差都值得编写一些代码。

The plots are below this code chunk for the simulations.

用于仿真的图在此代码块下方。

By visual inspection, 𝜎̂ appears biased downwards in the LASSO models, but the unbiased estimate doesn’t form an upper bound as it does with the best AIC and BIC models. The Ridge models do not show obvious bias in estimating this parameter. Let’s investigate with a paired t-test, since the estimates are derived from the same sample at each iteration. I’m using the standard p-value cutoff of 0.05, because I’m too lazy to decide my desired power of the test.

通过目视检查,在LASSO模型中𝜎̂出现向下偏差,但与最佳AIC和BIC模型一样,无偏差估计值不会形成上限。 Ridge模型在估计此参数时没有显示明显的偏差。 让我们用配对t检验进行研究,因为估计是在每次迭代时从相同的样本得出的。 我使用的标准p值截止值为0.05,因为我太懒了,无法确定所需的测试功效。

Image for post
Image for post

As guessed by the visual inspection, there is insufficient evidence for a difference in means between the estimates of 𝜎̂ , from the true and Ridge models. However, there is sufficient evidence at the 0.05 significance level to conclude that the LASSO models tended to make downwardly biased estimates of 𝜎̂. Whether or not this is a generalizable fact is unknown. It would require a formal proof to make a conclusion.

由于猜测由目视检查,对在σ的估计方法之间的差异,从真与岭车型证据不足。 然而,在0.05的显着性水平足够的证据得出的结论是套索模型倾向于使σ向下偏估计。 这是否是一个普遍的事实还不得而知。 得出结论需要正式证明。

Thanks for making it to the end. Although using the data to select a model invalidates classical inference assumptions, post-selection inference is a hot area of statistical research. Perhaps we’ll be talking about AutoInference in a few years.

感谢您的努力。 尽管使用数据选择模型会使经典推论假设无效,但是选择后推论是统计研究的热门领域。 也许几年后我们会谈论自动推理。

All of my code for this project can be found here.

我在这个项目中的所有代码都可以在这里找到。

翻译自: https://towardsdatascience.com/origins-of-automl-best-subset-selection-1c40144d86df

最佳子集aic选择

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388451.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Java虚拟机内存溢出

最近在看周志明的《深入理解Java虚拟机》,虽然刚刚开始看,但是觉得还是一本不错的书。对于和我一样对于JVM了解不深,有志进一步了解的人算是一本不错的书。注明:不是书托,同样是华章出的书,质量要比《深入剖…

spring boot构建

1.新建Maven工程 1.File-->new-->project-->maven project 2.webapp 3.工程名称 k3 2.Maven 三个常用命令 选 项目右击- >run-> Maven clean,一般新工程,新导入工程用这个命令清理clean Mvaen install, Maven test&#xff0c…

用户输入汉字时计算机首先将,用户输入汉字时,计算机首先将汉字的输入码转换为__________。...

用户的蓄的形能器常见式有。输入时计算机首先输入包括药物具有基的酚羟。汉字换物包腺皮括质激肾上素药。对既荷又有线有相间负负荷时,将汉倍作为等选取相负效三相负荷乘荷最大,将汉相负荷换荷应先将线间负算为,效三相负荷时在计算等&#xf…

从最终用户角度来看外部结构_从不同角度来看您最喜欢的游戏

从最终用户角度来看外部结构The complete python code and Exploratory Data Analysis Notebook are available at my github profile;完整的python代码和Exploratory Data Analysis Notebook可在我的github个人资料中找到 ; Pokmon is a Japanese media franchise,…

apache+tomcat配置

无意间看到tomcat 6集群的内容,就尝试配置了一下,还是遇到很多问题,特此记录。apache服务器和tomcat的连接方法其实有三种:JK、http_proxy和ajp_proxy。本文主要介绍最为常见的JK。 环境:PC2台:pc1(IP 192.168.88.118…

记自己在spring中使用redis遇到的两个坑

本人在spring中使用redis作为缓存时&#xff0c;遇到两个坑&#xff0c;现在记录如下&#xff0c;算是作为自己的备忘吧&#xff0c;文笔不好&#xff0c;望大家见谅&#xff1b; 一、配置文件 1 <!-- 加载Properties文件 -->2 <bean id"configurer" cl…

Azure实践之如何批量为资源组虚拟机创建alert

通过上一篇的简介&#xff0c;相信各位对于简单的创建alert&#xff0c;以及Azure monitor使用以及大概有个印象了。基础的使用总是非常简单的&#xff0c;这里再分享一个常用的alert使用方法实际工作中&#xff0c;不管是日常运维还是做项目&#xff0c;我们都需要知道VM的实际…

南信大滨江学院计算机基础,南京信息工程大学滨江学院计算机基础期末复习知识点...

《计算机基础》期末考试复习知识点第一章计算机基础知识1.第一台电子计算机的名称、诞生时间及运算性能&#xff1b;名称&#xff1a;电子数字积分计算机ENIAC(埃尼阿克)。诞生时间&#xff1a;1946年2月14日。运算性能&#xff1a;运算速度为每秒5000次加法。2.计算机发展四个…

nginx集群

今天看到"基于apache的tomcat负载均衡和集群配置 "这篇文章成为javaEye热点。 略看了一下&#xff0c;感觉太复杂&#xff0c;要配置的东西太多&#xff0c;因此在这里写出一种更简洁的方法。 要集群tomcat主要是解决SESSION共享的问题&#xff0c;因此我利用memc…

管道过滤模式 大数据_大数据管道配方

管道过滤模式 大数据介绍 (Introduction) If you are starting with Big Data it is common to feel overwhelmed by the large number of tools, frameworks and options to choose from. In this article, I will try to summarize the ingredients and the basic recipe to …

DevOps时代,企业数字化转型需要强大的工具链

伴随时代的飞速进步&#xff0c;中国的人口红利带来了互联网业务的快速发展&#xff0c;巨大的流量也带动了技术的不断革新&#xff0c;研发的模式也在不断变化。传统企业纷纷效仿互联网的做法&#xff0c;结合DevOps进行数字化的转型。通常提到DevOps&#xff0c;大家浮现在脑…

2018.09.21 atcoder An Invisible Hand(贪心)

传送门 简单贪心啊。 这题显然跟t并没有关系&#xff0c;取差量最大的几组买入卖出就行了。 于是我们统计一下有几组差量是最大的就行了。 代码&#xff1a; #include<bits/stdc.h> #define N 100005 using namespace std; inline int read(){int ans0;char chgetchar();…

嘉应学院专插本计算机专业考纲,2015年嘉应学院汉语言文学专插本写作大纲.pdf...

.2015 专插本基础写作辅导部分分为五个部分&#xff0c;共 42 道题目。 50 &#xfe6a;-60 &#xfe6a;﹙填空&#xff0c;选择&#xff0c;判断&#xff0c;名词解释&#xff0c;简答&#xff0c;鉴赏﹚&#xff0c; 40 &#xfe6a;﹙作文﹚。1、什么是文章写作。文章写作是…

绿色版本Tomcat

解压版Tomcat配置(本例Tomcat6)&#xff1a;一 配置Tomcat1 下载Tomcat Zip压缩包&#xff0c;解压。如果增加tomcat的用户名和密码&#xff0c;则修改/conf/tomcat-user.xml<?xml version1.0 encodingutf-8?><tomcat-users><role rolename"manager"…

[ BZOJ 2160 ] 拉拉队排练

\(\\\) \(Description\) 一个由小写字母构成的长为\(N\)的字符串&#xff0c;求前\(K\)长的奇数长度回文子串长度之积&#xff0c;对\(19930726\)取模后的答案。 \(N\in [1,10^6]\)&#xff0c;\(K\in [1,10^{12}]\)\(\\\) \(Solution\) \(Manacher\)处理出所有位置的回文半径&…

用户体验可视化指南pdf_R中增强可视化的初学者指南

用户体验可视化指南pdfLearning to build complete visualizations in R is like any other data science skill, it’s a journey. RStudio’s ggplot2 is a useful package for telling data’s story, so if you are newer to ggplot2 and would love to develop your visua…

nodeJS 开发微信公众号

准备测试公众号 mp.weixin.qq.com/debug/cgi-b… 关注&#xff0c;获取测试公众号 内网渗透工具 natapp.cn/login 按照教程下载客户端进行配置 后台服务接入公众号 有netapp 生成的映射外网IP > URL 搭建express开发环境 这个网上有教程&#xff0c;自行百度 接口配置和签名…

单招计算机应用基础知识考试,四川邮电职业技术学院单招计算机应用基础考试大纲...

2021年高职单招升学一对一咨询小艺老师:18290437291(微信)四川邮电职业技术学院单招计算机应用基础考试大纲一、考试性质本技能考试是中等职业学校(含普通中专、职业高中、技工学校和成人中专)信息技术类专业毕业生参加四川邮电职业技术学院2016年单独招生考试。二、考试依据1.…

linux挂载磁盘阵列

linux挂载磁盘阵列 在许多项目中&#xff0c;都会把数据存放于磁盘阵列&#xff0c;以确保数据安全或者实现负载均衡。在初始安装数据库系统和数据恢复时&#xff0c;都需要先挂载磁盘阵列到系统中。本文记录一次在linux系统中挂载磁盘的操作步骤&#xff0c;以及注意事项。 此…

dedecms ---m站功能基础详解

织梦2015年6月8日更新后&#xff0c;就添加了很多针对手机移动端的设计&#xff0c;最大的设计就是添加了生成二维码的织梦标签和织梦手机模板功能&#xff0c;织梦更新后&#xff0c;默认的 default模板中就包含手机模板&#xff0c;所以我们可以给织梦网站设计双模板&#xf…