单变量线性回归模型_了解如何为单变量模型选择效果最好的线性回归

单变量线性回归模型

by Björn Hartmann

比约恩·哈特曼(BjörnHartmann)

找出哪种线性回归模型最适合您的数据 (Find out which linear regression model is the best fit for your data)

Inspired by a question after my previous article, I want to tackle an issue that often comes up after trying different linear models: You need to make a choice which model you want to use. More specifically, Khalifa Ardi Sidqi asked:

在上一篇文章之后受到一个问题的启发,我想解决在尝试不同的线性模型后经常出现的一个问题:您需要选择要使用的模型。 更具体地说, Khalifa Ardi Sidqi问:

“How to determine which model suits best to my data? Do I just look at the R square, SSE, etc.?
“如何确定哪种模型最适合我的数据? 我是否只看R平方,SSE等?
As the interpretation of that model (quadratic, root, etc.) will be very different, won’t it be an issue?”
由于该模型(二次方,根等)的解释将非常不同,这不是问题吗?”

The second part of the question can be answered easily. First, find a model that best suits to your data and then interpret its results. It is good if you have ideas how your data might be explained. However, interpret the best model, only.

问题的第二部分很容易回答。 首先,找到最适合您的数据的模型,然后解释其结果。 如果您有想法可以解释您的数据,这是很好的。 但是,仅解释最佳模型。

The rest of this article will address the first part of his question. Please note that I will share my approach on how to select a model. There are multiple ways, and others might do it differently. But I will describe the way that works best for me.

本文的其余部分将解决他的问题的第一部分。 请注意,我将分享 我的方法 如何 选择一个模型。 有多种方法,其他方法可能会有所不同。 但是我将描述最适合我的方式。

In addition, this approach only applies to univariate models. Univariate models have just one input variable. I am planning a further article, where I will show you how to assess multivariate models with more input variables. For today, however, let us focus on the basics and univariate models.

另外, 这种方法仅适用于单变量模型 。 单变量模型只有一个输入变量。 我正在计划另一篇文章,我将向您展示如何评估具有更多输入变量的多元模型。 但是,今天,让我们关注基础知识和单变量模型。

To practice and get a feeling for this, I wrote a small ShinyApp. Use it and play around with different datasets and models. Notice how parameters change and become more confident with assessing simple linear models. Finally, you can also use the app as a framework for your data. Just copy it from Github.

为了练习并对此有所了解,我编写了一个小的ShinyApp。 使用它并使用不同的数据集和模型。 注意参数如何变化,并通过评估简单的线性模型变得更加自信。 最后,您还可以将应用程序用作数据框架。 只需从Github复制它即可 。

将调整后的R2用于单变量模型 (Use the Adjusted R2 for univariate models)

If you only use one input variable, the adjusted R2 value gives you a good indication of how well your model performs. It illustrates how much variation is explained by your model.

如果仅使用一个输入变量,则adjusted R2值可以很好地指示模型的性能。 它说明了您的模型解释了多少变化。

In contrast to the simple R2, the adjusted R2 takes the number of input factors into account. It penalizes too many input factors and favors parsimonious models.

与简单的R2adjusted R2考虑了输入因子的数量。 它惩罚了太多的输入因素,并偏爱简约模型。

In the screenshot above, you can see two models with a value of 71.3 % and 84.32%. Apparently, the second model is better than the first one. Models with low values, however, can still be useful because the adjusted R2 is sensitive to the amount of noise in your data. As such, only compare this indicator of models for the same dataset than comparing it across different datasets.

在上面的屏幕截图中,您可以看到两个模型,其值分别为71.3%和84.32%。 显然,第二种模式比第一种更好。 但是,低值的模型仍然有用,因为adjusted R2对数据中的噪声量很敏感。 因此,仅比较同一数据集的模型指标而不是比较不同数据集的模型指标。

通常,对SSE的需求很少 (Usually, there is little need for the SSE)

Before you read on, let’s make sure we are talking about the same SSE. On Wikipedia, SSE refers to the sum of squared errors. In some statistic textbooks, however, SSE can refer to the explained sum of squares (the exact opposite). So for now, suppose SSE refers to the sum of squared errors.

在继续阅读之前,请确保我们正在谈论相同的SSE。 在Wikipedia上 ,SSE是指平方误差的总和。 但是,在一些统计教科书中,SSE可以参考所解释的平方和(正好相反)。 因此,现在,假设SSE是指平方误差的总和。

Hence, the adjusted R2 is approximately 1 — SSE /SST. With SST referring to the total sum of squares.

因此, adjusted R2约为1 -SSE / SST。 SST是指平方和的总和。

I do not want to dive deeper into the math behind this. What I want to show you is that the adjusted R2 is computed with the SSE. So the SSE usually does not give you any additional information.

我不想深入探讨其背后的数学原理。 我想向您展示的是, adjusted R2是使用SSE计算的 。 因此,SSE通常不会为您提供任何其他信息

Furthermore, the adjusted R2 is normalized such that it is always between zero and one. So it is easier for you and others to interpret an unfamiliar model with an adjusted R2 of 75% rather than an SSE of 394 — even though both figures might explain the same model.

此外,将adjusted R2归一化,使其始终在零和一之间。 因此,您和其他人更容易解释adjusted R2为75%而不是394的SSE的陌生模型,即使两个数字都可能解释了相同的模型。

看一下残差或误差项! (Have a look at the residuals or error terms!)

What is often ignored are error terms or so-called residuals. They often tell you more than what you might think.

通常忽略的是误差项或所谓的残差。 他们经常告诉您比您想的更多的信息。

残差是您的预测值和实际值之间的差。 (The residuals are the difference between your predicted values and the actual values.)

Their benefit is that they can show you both the magnitude as well as the direction of your errors. Let’s have a look at an example:

它们的好处是,它们可以向您显示错误的幅度和方向。 让我们看一个例子

Here, I tried to predict a polynomial dataset with a linear function. Analyzing the residuals shows that there are areas where the model has an upward or downward bias.

在这里,我试图用线性函数预测多项式数据集。 分析残差表明,在某些区域中模型具有向上或向下的偏差。

For 50 < x < 100, the residuals are above zero. So in this area, the actual values have been higher than the predicted values — our model has a downward bias.

50 &l t ; x &l 50 &l t ; x &l t; 100,残差大于零。 因此,在该区域中,实际值高于预测值-我们的模型存在向下偏差。

For100 < x &lt; 150, however, the residuals are below zero. Thus, the actual values have been lower than the predicted values — the model has an upward bias.

对于100 < x &l t; 150,但是,残差低于零。 因此,实际值已低于预测值-模型具有向上偏差。

It is always good to know, whether your model suggests too high or too low values. But you usually do not want to have patterns like this.

总是很高兴知道您的模型建议的值是太高还是太低。 但是您通常不希望有这样的模式。

The residuals should be zero on average (as indicated by the mean) and they should be equally distributed. Predicting the same dataset with a polynomial function of 3 degrees suggests a much better fit:

残差平均应为零(如平均值所示),并且它们应平均分布。 用3 degrees的多项式函数预测相同的数据集将显示出更好的拟合度:

In addition, you can observe whether the variance of your errors increases. In statistics, this is called Heteroscedasticity. You can fix this easily with robust standard errors. Otherwise, your hypothesis tests are likely to be wrong.

此外,您可以观察误差的方差是否增加。 在统计上,这称为异方差性 。 您可以通过强大的标准错误轻松解决此问题。 否则,您的假设检验可能是错误的。

残差直方图 (Histogram of residuals)

Finally, the histogram summarizes the magnitude of your error terms. It provides information about the bandwidth of errors and indicates how often which errors occurred.

最后,直方图总结了误差项的大小。 它提供有关错误带宽的信息,并指示发生错误的频率。

The above screenshots show two models for the same dataset. In the left histogram, errors occur within a range of -338 and 520.

上面的屏幕截图显示了同一数据集的两个模型。 在左侧的直方图中,误差发生在-338520的范围内。

In the right histogram, errors occur within -293 and 401. So the outliers are much lower. Furthermore, most errors in the model of the right histogram are closer to zero. So I would favor the right model.

右边的直方图中,错误发生在-293401 。 因此,异常值要低得多。 此外,右直方图模型中的大多数误差都接近于零。 因此,我倾向于正确的模型。

摘要 (Summary)

When choosing a linear model, these are factors to keep in mind:

选择线性模型时,请牢记以下因素:

  • Only compare linear models for the same dataset.

    仅比较同一数据集的线性模型。
  • Find a model with a high adjusted R2

    查找调整后的R2高的模型
  • Make sure this model has equally distributed residuals around zero

    确保该模型的残差均匀分布在零附近
  • Make sure the errors of this model are within a small bandwidth

    确保此模型的误差在较小的带宽内

If you have any questions, write a comment below or contact me. I appreciate your feedback.

如有任何疑问,请在下面写评论或与我联系 。 感谢您的反馈。

翻译自: https://www.freecodecamp.org/news/learn-how-to-select-the-best-performing-linear-regression-for-univariate-models-e9d429c40581/

单变量线性回归模型

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/395120.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

java javax.xml.ws_如何通过javax.xml.ws.Service进行调用

在Eclipse中创建了一个新的标准java 7项目,并成功设法获取javax.xml.ws.Service的实例,如下所示&#xff1a;String wsdlURL "http://example.com:3000/v1_0/foo/bar/SomeService?wsdl";String namespace "http://foo.bar.com/webservice";String servi…

汉能:让人类像叶绿素一样利用太阳能

6月初&#xff0c;一批在车筐里同时标识了摩拜“Mobike”和汉能“Hanergy”的摩拜单车在北京投入使用。这是由汉能与摩拜合作开发的第一批装有汉能薄膜太阳能组件的共享单车。 这批共享单车所装载的5.5瓦的汉能MiaSol的柔性薄膜太阳能组件&#xff0c;将为摩拜车载智能锁中内置…

Java Annotation

一、了解注释注释是java1.5 jdk这后引入的特性。Java库自己带的注释有Deprecated, Overwrite等。注释是加在类&#xff0c;方法&#xff0c;变量等上的一种标记。并且&#xff0c;可以通过javaj反射操作把这个标记取出来。主要用途是用于对方法&#xff0c;变量&#xff0c;类等…

pycharm显示全部数据_PyCharm第一次安装及使用教程

pycharm简介PyCharm是一种Python IDE&#xff0c;带有一整套可以帮助用户在使用Python语言开发时提高其效率的工具&#xff0c;比如调试、语法高亮、Project管理、代码跳转、智能提示、自动完成、单元测试、版本控制。此外&#xff0c;该IDE提供了一些高级功能&#xff0c;以用…

UOJ #150 【NOIP2015】 运输计划

题目描述 公元 \(2044\) 年&#xff0c;人类进入了宇宙纪元。 \(L\) 国有 \(n\) 个星球&#xff0c;还有 \(n-1\) 条双向航道&#xff0c;每条航道建立在两个星球之间&#xff0c;这 \(n-1\) 条航道连通了 \(L\) 国的所有星球。 小 \(P\) 掌管一家物流公司&#xff0c; 该公司有…

css 属性选择器笔记

1、基本选择器&#xff1a; eg&#xff1a; *{margin:0;padding:0}p{color:black}.content{background:red;}#intro{padding-left:2em;} 2、多元素组合选择器 div p { color:#f00; }#nav li { display:inline; }#nav a { font-weight:bold; }div > strong { color:#f00; }h2…

scuba 报表_是否想了解JavaScript的for循环? 这个动画的SCUBA潜水员可以提供帮助!...

scuba 报表by Kevin Kononenko凯文科诺年科(Kevin Kononenko) 是否想了解JavaScript的for循环&#xff1f; 这个动画的SCUBA潜水员可以提供帮助&#xff01; (Want to learn about JavaScript’s for loops? This animated SCUBA diver can help!) For loops can be tough to…

力扣——寻找两个有序数组的中位数

给定两个大小为 m 和 n 的有序数组 nums1 和 nums2。 请你找出这两个有序数组的中位数&#xff0c;并且要求算法的时间复杂度为 O(log(m n))。 你可以假设 nums1 和 nums2 不会同时为空。 示例 1: nums1 [1, 3] nums2 [2]则中位数是 2.0示例 2: nums1 [1, 2] nums2 [3, 4]…

uva-10152-乌龟排序

uva-10152-乌龟排序 求从待排序的到期望的顺序的最小操作顺序,只能进行一个操作,将当前的乌龟拿出来,上面的下移,拿出来的放到最上面 发现voj没有PE, 解题方法,把俩个串反过来使用,从期望的顺序到待排序的顺序. AC:170ms #include <iostream> #include<stdio.h> #i…

笔记本win10玩红警黑屏_【买笔记本电脑差评真的有参考意义?】

每次推荐笔记本电脑都会遇到一个重要的问题就是&#xff1a;“大多数消费者会下意识的去看京东评论&#xff0c;参考买的人是怎么说的&#xff0c;往往会出现不懂电脑的人继续误导不懂的人&#xff0c;导致越来越多的人被误导”本文聊聊关于京东评论究竟有没有参考价值。1&…

2.sed命令

2.sed命令 sed基本用法&#xff1a; sed: Stream EDitor 行编辑器 (全屏编辑器: vi) sed: 模式空间 默认不编辑原文件&#xff0c;仅对模式空间中的数据做处理&#xff1b;而后&#xff0c;处理结束后&#xff0c;将模式空间打印至屏幕&#xff1b; sed [options] AddressComma…

因此,您是一名新软件工程师。 让我们面对一些事实,揭穿一些神话。

by Trey Huffine通过Trey Huffine 因此&#xff0c;您是一名新软件工程师。 让我们面对一些事实&#xff0c;揭穿一些神话。 (So you’re a new Software Engineer. Let’s face some facts and debunk some myths.) When we’re learning to become software engineers, we’…

java前端接收回显图片_图片上传并回显后端篇

图片上传并回显后端篇我们先看一下效果继上一篇的图片上传和回显&#xff0c;我们来实战一下图片上传的整个过程&#xff0c;今天我们将打通前后端&#xff0c;我们来真实的了解一下&#xff0c;我们上传的文件&#xff0c;是以什么样的形式上传到服务器&#xff0c;难道也是一…

关于scanf和cin的大数据读入效率

关于scanf和cin的大数据读入效率好多大佬都说scanf的读入效率比cin高&#xff0c;我也当练手&#xff0c;用书上的程序用了个测试&#xff0c;程序如下&#xff1a;#include<iostream>#include<ctime>#include<cstdio>#include<windows.h>using namesp…

OBJECT_ID()的使用方法

数据库中每个对像都有一个唯一的ID值&#xff0c;用Object_name(id)可以根据ID值得到对像的名称&#xff0c;object_id(name)可以根据对像名称得到对象的IDobject_id()只能返回用户创建的对像的ID,像以sys开头的表都是系统表所以返回不了的 如下列&#xff1a; select object_n…

Django之model补充:一对多、跨表操作

表结构概述 model.py : class Something(models.Model):name models.CharField(max_length32)class UserType(models.Model):caption models.CharField(max_length32)s models.ForeignKey(Something)# 超级管理员&#xff0c;普通用户&#xff0c;游客&#xff0c;黑河class…

农民约翰是一个惊人的会计_我的朋友约翰在CSS Grid中犯了一个错误。 不要像约翰-这样做。

农民约翰是一个惊人的会计It had been two years and John had no job.已经两年了&#xff0c;约翰没有工作。 John was a smart 20-something guy. Okay, he had a job — but it wasn’t one he liked. It was too monotonous and was not nearly creative enough. His day …

zip直链生成网站_手把手教你如何用飞桨自动生成二次元人物头像

【飞桨开发者说】李思佑&#xff0c;昆明理工大学信息与计算科学大四本科生&#xff1b;2018年和2019年两次获得全国大学生数学建模比赛国家二等奖&#xff1b;2020年美国数学建模比赛获M奖。指导老师&#xff1a;昆明理工大学理学院朱志宁想画出独一无二的动漫头像吗&#xff…

Gradle入门到实战(一) — 全面了解Gradle

声明&#xff1a;本文来自汪磊的博客&#xff0c;转载请注明出处 可关注个人公众号&#xff0c;那里更新更及时&#xff0c;阅读体验更好&#xff1a; 友情提示由于文章是从个人公众号拷贝过来整理的&#xff0c;发现图片没有正常显示&#xff0c;没关注公众号的同学可通过如下…

java 0-9所有排列_java实现:键盘输入从0~9中任意5个数,排列组合出所有不重复的组合,打印出来...

必有追加大分&#xff01;&#xff01;&#xff01;比如1.2.3.4.5共有120个组合12345&#xff0c;12354&#xff0c;12435&#xff0c;12453&#xff0c;12534&#xff0c;12543&#xff1b;13245&#xff0c;13254&#xff0c;13425&#xff0c;13452&#xff0c;13524&#x…