最佳子集aic选择_AutoML的起源:最佳子集选择

最佳子集aic选择

As there is a lot of buzz about AutoML, I decided to write about the original AutoML; step-wise regression and best subset selection. Then I decided to ignore step-wise regression because it is bad and should probably stop being taught. That leaves best subset selection to discuss.

由于有关AutoML的话题很多,因此我决定写有关原始AutoML的文章。 逐步回归和最佳子集选择。 然后我决定忽略逐步回归,因为它很糟糕 ,应该停止讲授。 剩下的最佳子集选择需要讨论。

The idea behind best subset selection is choose the “best” subset of variables to include in a model, looking at groups of variables together as opposed to step-wise regression which compares them one at a time. We determine which set of variables are “best” by assessing which sub-model fits the data best while penalizing for the number of independent variables in the model to avoid over-fitting. There are multiple metrics for assessing how well a model fits: adjusted 𝑅-squared, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Mallow’s 𝐶𝑝 are probably the best known.

最佳子集选择的思想是选择要包括在模型中的变量的“最佳”子集,将变量组放在一起查看,而不是逐步比较一次将它们进行比较的逐步回归。 我们通过评估哪个子模型最适合数据,同时对模型中自变量的数量进行惩罚以避免过度拟合,从而确定哪个变量组“最佳”。 有多种评估模型拟合程度的指标:调整后的𝑅平方,Akaike信息准则(AIC),贝叶斯信息准则(BIC)和Mallow's𝐶𝑝 是最有名的。

The formulas for each are below.

每个公式如下。

Image for post

With Adjusted R-squared, you want to find the model with the largest Adjusted R-squared because it explains the most variance in the dependent variable, penalized for model complexity. For the others, you want to find the model with the smallest Information Criterion because it is the model with the least unexplained variance in the dependent variable, penalizing for model complexity. They’re the same idea, i.e. maximizing something good versus minimizing something bad.

使用“已调整R平方”时,您要查找“已调整R平方”最大的模型,因为它解释了因变量中最大的方差,并因模型复杂性而受到惩罚。 对于其他模型,您希望找到信息准则最小的模型,因为它是因变量中无法解释的方差最小的模型,因此对模型的复杂性不利。 他们是同一主意,即最大化好事物与最小化坏事物。

Both the AIC and Mallow’s 𝐶𝑝 tend to give better predictive models, while BIC tends to give models with fewer independent variables because it penalizes complex models more than the other two.

AIC和Mallow的𝐶𝑝 BIC倾向于提供更好的预测模型,而BIC倾向于提供具有更少自变量的模型,因为它比其他两个模型对复杂模型的惩罚更大。

Like most things in life, automating model selection comes at a cost. If you use your data to select a linear model, the coefficients of the selected variables will be biased away from zero! The null hypotheses of both the individual t-tests for each coefficient and the F-test for overall model significance are based on the assumption that each coefficient is normally distributed with mean 0. Since we have introduced bias into our coefficients, the Type I error level increases for these tests! This may not be an issue if you just need a predictive model, but it completely invalidates any statistical inferences made with the selected model. AutoML may be able to generate decent predictive models, but inference still requires a person to think carefully about the problem and follow the scientific method.

就像生活中的大多数事情一样,自动选择模型是有代价的。 如果您使用的数据来选择线性模型,所选择的变量的系数将从0 偏向了! 每个系数的单独t检验和整体模型显着性的F检验的零假设均基于以下假设:每个系数的正态分布均值为0。由于我们在系数中引入了偏差,因此I型错误这些测试的水平提高了! 如果您只需要预测模型,这可能就不成问题了,但是这完全会使使用所选模型做出的任何统计推断无效。 AutoML可能能够生成不错的预测模型,但是推理仍然需要人们仔细考虑问题并遵循科学方法。

展示最佳子集选择的偏见 (Demonstrating the Bias of Best Subset Selection)

I performed a simulation study to demonstrate the bias caused by best subset selection. Instead of looking at the bias in the coefficients, we will look at the bias in the estimated standard deviation of the error term in the model

我进行了仿真研究,以证明最佳子集选择所引起的偏差。 除了查看系数中的偏差之外,我们还将查看模型中误差项的估计标准偏差中的偏差。

Image for post

where the error terms are identically and independently distributed 𝑁(0,𝜎) random variables.

其中误差项是相同且独立分布的𝑁(0,𝜎)随机变量。

At each round of the simulation, a sample of 100 observations are generated from the same distribution. The true model, which contains only the truly significant variables, as well as the best subset models selected by AIC and BIC are also estimated. From each model, I estimate the 𝜎 of the error term using the formula

在模拟的每一轮中,从同一分布生成100个观测值的样本。 还估计仅包含真正重要变量的真实模型,以及由AIC和BIC选择的最佳子集模型。 从每一个模型,我估计σ 使用公式计算误差项

Image for post

This is performed 500 times.

这执行了500次。

The particular parameters of my simulation are as follows: 𝑛 = 100, # of independent variables = 6, 𝜎 = 1, and the number of significant independent variables is 2. The intercept is significant as well, so 3 coefficients are non-zero. The non-zero coefficients are selected using 𝑁(5,1) random numbers because I am too lazy to define fixed numbers, but they remain fixed for all rounds of the simulation.

我的模拟的特定参数如下: 𝑛 = 100,自变量= 6,σ= 1,和显著独立变量的数目的#是2。截距是显著为好,这样的3个系数为非零。 非零系数是使用𝑁(5,1)随机数选择的,因为我懒于定义固定数,但是在所有模拟回合中它们都保持固定。

I first defined my own function to perform best subset selection using AIC or BIC using a naive approach by looking at every combination of variables. It only works for a small number of variables because the number of models it has to consider blows up as the number of variables increases. The number of models considered is

我首先定义了自己的函数,以天真的方法通过查看变量的每种组合来使用AIC或BIC来执行最佳子集选择。 它仅适用于少量变量,因为随着变量数量的增加,必须考虑的模型数量会激增。 考虑的模型数量是

Image for post

but smarter implementations of best subset selection use a tree search to reduce the number of models considered.

但是最佳子集选择的更智能实现使用树搜索来减少所考虑模型的数量。

The graphs of interest are below these chunks of code for the best subset selection function and for the simulation.

感兴趣的图在这些代码块的下方,用于最佳子集选择功能和仿真。

The red line is the line where the y-axis equals the x-axis, which is the unbiased estimate of 𝜎 from the true model. As you can see in the plots below, the estimates of 𝜎 are biased from the models selected by best AIC and BIC. In fact they will always be less than or equal to the unbiased estimate of 𝜎 from the true model. This demonstrates why models selected via best subset selection are invalid for inference.

红线是y轴等于x轴的线,这是真实模型中𝜎的无偏估计。 如下图所示, 𝜎的估计值是从最佳AIC和BIC选择的模型中得出的。 实际上,它们将始终小于或等于真实模型中𝜎的无偏估计。 这说明了为什么通过最佳子集选择所选择的模型对于推理无效。

Image for post

奖励部分:调查LASSO和岭回归中误差项的估计标准偏差的偏差 (Bonus Section: Investigating Bias in Estimated Standard Deviation of the Error term in LASSO and Ridge Regression)

While working on the simulation study above, I became interested in the potential bias of regularization methods on estimates of the standard deviation of the error term in a linear model, although one wouldn’t use a regularized model to estimate a parameter for the purposes of inference. As you most likely know, LASSO and Ridge regression intentionally bias estimated coefficients towards zero to reduce the amount of variance in the model (how much estimated coefficients change from sample to sample from the same population). The LASSO can set coefficients equal to zero, performing variable selection. Ridge regression biases coefficients towards zero, but will not set them equal to zero, so it isn’t a variable selection tool like best subset selection or the LASSO.

在进行上述仿真研究时,我对正则化方法对线性模型中误差项的标准偏差的估计的潜在偏差感兴趣,尽管出于以下目的,人们不会使用正则化模型来估计参数:推理。 如您最可能知道的那样,LASSO和Ri​​dge回归有意将估计系数偏向零,以减少模型中的方差量(在同一总体中,样本之间的估计系数变化了多少)。 LASSO可以将系数设置为零,从而执行变量选择。 岭回归将系数偏向零,但不会将其设置为零,因此它不是像最佳子集选择或LASSO这样的变量选择工具。

I used the same set up as before, but upped the sample size from 100 to 200, the number of independent variables from 6 to 100, and the number of significant independent variables from 2 to 50. The shrinkage parameter in both the LASSO and Ridge models were chosen among 0.01, 0.1, 1.0, and 10.0 using 3-fold cross validation. I counted the number of non-zero coefficients in the LASSO model for purposes of calculating 𝜎̂ and used all 100, plus 1 for the intercept, for the Ridge model, since it biases coefficients to zero but doesn’t set them to zero.

我使用了与以前相同的设置,但是样本量从100增加到200,自变量从6增加到100,有效自变量从2增加到50。LASSO和Ri​​dge中的收缩参数使用3倍交叉验证从0.01、0.1、1.0和10.0中选择模型。 我数的非零系数的数目在LASSO模型用于计算σ的目的,所使用的所有100,再加上1截距,对于岭模型,因为它偏置系数为零,但不将它们设置为零。

Obviously, regularized linear models are not valid for the purposes of inference because they bias estimates of coefficients. I still thought investigating any bias in the estimated standard deviation of the error term was worth writing a little code.

显然,正则化线性模型出于推论的目的是无效的,因为它们会使系数的估计产生偏差。 我仍然认为调查误差项的估计标准偏差中的任何偏差都值得编写一些代码。

The plots are below this code chunk for the simulations.

用于仿真的图在此代码块下方。

By visual inspection, 𝜎̂ appears biased downwards in the LASSO models, but the unbiased estimate doesn’t form an upper bound as it does with the best AIC and BIC models. The Ridge models do not show obvious bias in estimating this parameter. Let’s investigate with a paired t-test, since the estimates are derived from the same sample at each iteration. I’m using the standard p-value cutoff of 0.05, because I’m too lazy to decide my desired power of the test.

通过目视检查,在LASSO模型中𝜎̂出现向下偏差,但与最佳AIC和BIC模型一样,无偏差估计值不会形成上限。 Ridge模型在估计此参数时没有显示明显的偏差。 让我们用配对t检验进行研究,因为估计是在每次迭代时从相同的样本得出的。 我使用的标准p值截止值为0.05,因为我太懒了,无法确定所需的测试功效。

Image for post
Image for post

As guessed by the visual inspection, there is insufficient evidence for a difference in means between the estimates of 𝜎̂ , from the true and Ridge models. However, there is sufficient evidence at the 0.05 significance level to conclude that the LASSO models tended to make downwardly biased estimates of 𝜎̂. Whether or not this is a generalizable fact is unknown. It would require a formal proof to make a conclusion.

由于猜测由目视检查,对在σ的估计方法之间的差异,从真与岭车型证据不足。 然而,在0.05的显着性水平足够的证据得出的结论是套索模型倾向于使σ向下偏估计。 这是否是一个普遍的事实还不得而知。 得出结论需要正式证明。

Thanks for making it to the end. Although using the data to select a model invalidates classical inference assumptions, post-selection inference is a hot area of statistical research. Perhaps we’ll be talking about AutoInference in a few years.

感谢您的努力。 尽管使用数据选择模型会使经典推论假设无效,但是选择后推论是统计研究的热门领域。 也许几年后我们会谈论自动推理。

All of my code for this project can be found here.

我在这个项目中的所有代码都可以在这里找到。

翻译自: https://towardsdatascience.com/origins-of-automl-best-subset-selection-1c40144d86df

最佳子集aic选择

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388451.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Java虚拟机内存溢出

最近在看周志明的《深入理解Java虚拟机》,虽然刚刚开始看,但是觉得还是一本不错的书。对于和我一样对于JVM了解不深,有志进一步了解的人算是一本不错的书。注明:不是书托,同样是华章出的书,质量要比《深入剖…

用户输入汉字时计算机首先将,用户输入汉字时,计算机首先将汉字的输入码转换为__________。...

用户的蓄的形能器常见式有。输入时计算机首先输入包括药物具有基的酚羟。汉字换物包腺皮括质激肾上素药。对既荷又有线有相间负负荷时,将汉倍作为等选取相负效三相负荷乘荷最大,将汉相负荷换荷应先将线间负算为,效三相负荷时在计算等&#xf…

从最终用户角度来看外部结构_从不同角度来看您最喜欢的游戏

从最终用户角度来看外部结构The complete python code and Exploratory Data Analysis Notebook are available at my github profile;完整的python代码和Exploratory Data Analysis Notebook可在我的github个人资料中找到 ; Pokmon is a Japanese media franchise,…

apache+tomcat配置

无意间看到tomcat 6集群的内容,就尝试配置了一下,还是遇到很多问题,特此记录。apache服务器和tomcat的连接方法其实有三种:JK、http_proxy和ajp_proxy。本文主要介绍最为常见的JK。 环境:PC2台:pc1(IP 192.168.88.118…

记自己在spring中使用redis遇到的两个坑

本人在spring中使用redis作为缓存时&#xff0c;遇到两个坑&#xff0c;现在记录如下&#xff0c;算是作为自己的备忘吧&#xff0c;文笔不好&#xff0c;望大家见谅&#xff1b; 一、配置文件 1 <!-- 加载Properties文件 -->2 <bean id"configurer" cl…

Azure实践之如何批量为资源组虚拟机创建alert

通过上一篇的简介&#xff0c;相信各位对于简单的创建alert&#xff0c;以及Azure monitor使用以及大概有个印象了。基础的使用总是非常简单的&#xff0c;这里再分享一个常用的alert使用方法实际工作中&#xff0c;不管是日常运维还是做项目&#xff0c;我们都需要知道VM的实际…

管道过滤模式 大数据_大数据管道配方

管道过滤模式 大数据介绍 (Introduction) If you are starting with Big Data it is common to feel overwhelmed by the large number of tools, frameworks and options to choose from. In this article, I will try to summarize the ingredients and the basic recipe to …

DevOps时代,企业数字化转型需要强大的工具链

伴随时代的飞速进步&#xff0c;中国的人口红利带来了互联网业务的快速发展&#xff0c;巨大的流量也带动了技术的不断革新&#xff0c;研发的模式也在不断变化。传统企业纷纷效仿互联网的做法&#xff0c;结合DevOps进行数字化的转型。通常提到DevOps&#xff0c;大家浮现在脑…

用户体验可视化指南pdf_R中增强可视化的初学者指南

用户体验可视化指南pdfLearning to build complete visualizations in R is like any other data science skill, it’s a journey. RStudio’s ggplot2 is a useful package for telling data’s story, so if you are newer to ggplot2 and would love to develop your visua…

linux挂载磁盘阵列

linux挂载磁盘阵列 在许多项目中&#xff0c;都会把数据存放于磁盘阵列&#xff0c;以确保数据安全或者实现负载均衡。在初始安装数据库系统和数据恢复时&#xff0c;都需要先挂载磁盘阵列到系统中。本文记录一次在linux系统中挂载磁盘的操作步骤&#xff0c;以及注意事项。 此…

sql横着连接起来sql_SQL联接的简要介绍(到目前为止)

sql横着连接起来sqlSQL Join是什么意思&#xff1f; (What does a SQL Join mean?) A SQL join describes the process of merging rows in two different tables or files together.SQL连接描述了将两个不同表或文件中的行合并在一起的过程。 Rows of data are combined bas…

《Python》进程收尾线程初识

一、数据共享 from multiprocessing import Manager 把所有实现了数据共享的比较便捷的类都重新又封装了一遍&#xff0c;并且在原有的multiprocessing基础上增加了新的机制list、dict 机制&#xff1a;支持的数据类型非常有限 list、dict都不是数据安全的&#xff0c;需要自己…

北京修复宕机故障之旅

2012-12-18日 下午开会探讨北京项目出现的一些问题&#xff0c;当时记录的问题是由可能因为有一定数量的客户上来后&#xff0c;就造成了Web服务器宕机&#xff0c;而且没有任何时间上的规律性&#xff0c;让我准备出差到北京&#xff0c;限定三天时间&#xff0c;以及准备测试…

一般线性模型和混合线性模型_从零开始的线性混合模型

一般线性模型和混合线性模型生命科学的数学统计和机器学习 (Mathematical Statistics and Machine Learning for Life Sciences) This is the eighteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain som…

《企业私有云建设指南》-导读

内容简介第1章总结性地介绍了云计算的参考架构、典型解决方案架构和涉及的关键技术。 第2章从需求分析入手&#xff0c;详细讲解了私有云的技术选型、资源管理、监控和运维。 第3章从计算、网络、存储资源池等方面讲解了私有云的规划和建设&#xff0c;以及私有云建设的总体原则…

太原冶金技师学院计算机系,山西冶金技师学院专业都有什么

山西冶金技师学院专业大全大家在考试之后对除了选择学校之外&#xff0c;还更关注专业的选择&#xff0c;山西冶金技师学院有哪些专业成了大家最为关心的问题。有些同学一般是选择好专业再选择自己满意的学校&#xff0c;下面小编将为你介绍山西冶金技师学院开设的专业及其哪些…

海南首例供港造血干细胞志愿者启程赴广东捐献

海南首例供港造血干细胞志愿者启程赴广东捐献。 张瑶 摄 海南首例供港造血干细胞志愿者启程赴广东捐献。 张瑶 摄 中新网海口1月23日电 (张茜翼 张瑶)海南省首例供港造血干细胞捐献者晶晶(化名)23日启程赴广东进行捐献&#xff0c;将于28号正式捐献采集造血干细胞&#xff0c;为…

如何击败Python的问题

Following the previous article written about solving Python dependencies, we will take a look at the quality of software. This article will cover “inspections” of software stacks and will link a free dataset available on Kaggle. Even though the title say…

KindEditor解决上传视频不能在手机端显示的问题

KindEditor自带的上传视频生成的HTML代码为<embed>&#xff0c;在手机端并不支持。于是可以自己在控件里增加生成video标签相关代码。 参考https://www.jianshu.com/p/047198ffed92。。 然而对着修改后没有成功&#xff0c;可能是那里没有改对吧。依然生成的是<embed&…

《独家记忆》见面会高甜宠粉 张超现场解锁隐藏技能

1月23日&#xff0c;由爱奇艺出品&#xff0c;小糖人联合出品的沉浸式成长练爱剧《独家记忆》在京举行粉丝见面会。爱奇艺高级副总裁陈宏嘉&#xff0c;爱奇艺副总裁、自制剧开发中心总经理、《独家记忆》总制片人戴莹&#xff0c;小糖人董事长、《独家记忆》总制片人朱振华&am…