协方差意味着什么_“零”到底意味着什么?

协方差意味着什么

When I was an undergraduate student studying Data Science, one of my professors always asked the same question for every data set we worked with — “What does zero mean?”

当我是一名研究数据科学的本科生时,我的一位教授总是对我们处理的每个数据集提出相同的问题-“零意味着什么?”

On the surface, this seems trivial. If the scenario is how many apples does each student have, zero means that a student has no apples.

从表面上看,这似乎是微不足道的。 如果假设每个学生有多少个苹果,则零表示该学生没有苹果。

Then why ask the question?

那为什么要问这个问题呢?

Well, zero can mean zero… but it can also mean a slew of other things, and if you’re not careful, ignoring it could come back to haunt you.

好吧,零可能意味着零……但也可能意味着其他许多事情,如果您不小心,忽略它可能会再次困扰您。

It can be easy to disregard or ignore missing data. Whether the values are 0, NULL, NA, or blank, we are often quick to ignore these records because they “lack information”. However, these data points can often be critical pieces of information for our problem and the “lack” of information actually is information.

忽略或忽略丢失的数据很容易。 无论值是0,NULL,NA还是空白,我们经常会很快忽略这些记录,因为它们“缺少信息”。 但是,这些数据点通常可能是解决我们问题的关键信息,而信息“不足”实际上就是信息。

Let’s consider the following scenario. We are consulting for a bank and they want us to determine if a customer is likely to default on their credit card payments. Below is a sample of the data we are given to evaluate this problem.

让我们考虑以下情形。 我们正在为一家银行提供咨询,他们希望我们确定客户是否有可能拖欠其信用卡付款。 以下是我们提供的评估此问题的数据样本。

We see that there are five variables we could use to make predictions and of those five, four contain values of 0, blank, or in some cases both.

我们看到有五个变量可用于进行预测,在这五个变量中,四个包含0,空白或在某些情况下都包含的值。

It might be easy to ignore these values or chalk them up as some kind of error in the bank’s system but let’s take a closer look and see if there might be more to the story.

忽略这些值或将它们归类为银行系统中的某种错误可能很容易,但让我们仔细看看,看看故事可能还有更多内容。

Our first variable with missing data is Credit Score. Two of the eight customers have no credit score value. While it may seem like these values were skipped over, notice that both of these customers have ages of 23 and 20. Since they are relatively young, there is a good chance they may have only recently opened up a credit card and consequently would not have a credit score yet. It might be easy to backfill these records with a value of 0, but that would not make sense either given that we don’t know how they will actually perform. How would we handle this in a real-life scenario? One approach would be to find the average score of individuals with similar ages and use that value for our missing records.

我们缺少数据的第一个变量是Credit Score 。 八个客户中有两个没有信用评分值。 尽管看起来这些值似乎已被跳过,但请注意,这两个客户的年龄分别为23岁和20岁。由于他们相对年轻,所以很有可能他们只是最近才打开了信用卡,因此不会信用评分呢。 将这些记录的值回填为0可能很容易,但是鉴于我们不知道它们的实际表现,这也没有意义。 在现实生活中,我们将如何处理? 一种方法是找到年龄相似的个人的平均分数,并将该值用于我们的缺失记录。

The next column with missing data is Missed Payments. This time we have records with both the missing data and values of 0. Intuitively, we decypher that a value of 0 indicates a customer has never made a late payment. In this case, 0 does really mean 0. What about the missing values? Well, as we discussed for Credit Score, there might be other factors impacting this variable. Notice again that our missing record is for a customer who is only 20 years old. Given that they also do not have a credit score, we might infer that they have never missed a payment because they have never had the chance to make one yet. If our customer only opened their account this month, then they would not have had to make a payment and consequently could not miss one either.

缺少数据的下一列是“ 未付款项” 。 这次,我们有同时缺失数据和值为0的记录。从直觉上讲,我们解密为0表示客户从未付款。 在这种情况下,0确实意味着0。缺失值又如何呢? 好吧,正如我们在“ 信用评分”中讨论的那样,可能还有其他因素会影响此变量。 再次注意,我们丢失的记录仅针对20岁的客户。 考虑到他们也没有信用评分,我们可以推断他们从未错过过付款,因为他们还没有机会进行付款。 如果我们的客户仅在本月开户,那么他们就不必付款,因此也不会错过任何一个。

Moving onto our final two variables, Credit Limit and Payment Due, we see there are again both missing values and 0 values. For our values of 0, they appear to be fairly intuitive that $0 is a plausible amount in these cases. Our missing data, however, poses a bigger question. How can an individual have no value for a credit limit? Are they allowed to spend as much as they want? This same individual also has no payment due… how does that work?

进入最后两个变量, 信用额度到期付款 ,我们再次看到缺失值和0值。 对于我们的0值,他们似乎很直观,在这些情况下,$ 0是合理的金额。 但是,我们缺少的数据提出了一个更大的问题。 个人如何没有信用额度的价值? 是否允许他们花费想要多少? 该个人也没有应付款...这是如何工作的?

Let’s approach this similar to our other scenarios — by looking at the rest of the customer’s information. First, we can see this individual has a much lower score than all our other customers, including the 23-year-old (higher scores are better for credit). We also see that they have missed 8 payments, double the next highest individual. Hmmm… so why would they have no credit limit?

让我们以与其他方案类似的方式进行处理-通过查看客户的其余信息。 首先,我们可以看到此人的得分比所有其他客户(包括23岁的客户)低得多(得分越高,信用越好)。 我们还看到他们错过了8笔付款,是第二高的个人的两倍。 嗯...为什么他们没有信用额度?

One possible answer — this individual had performed so poorly that the bank decided to terminate their account. As a result, the customer is still in our database but is not active with the bank any longer. Consequently, they cannot spend any money and would not be subject to making payments either.

一个可能的答案-这个人的表现很差,银行决定终止他们的帐户。 结果,客户仍然在我们的数据库中,但是不再与银行保持联系。 因此,他们不能花任何钱,也不必付款。

All of the above are obviously hypothetical scenarios and explanations as to why data might be missing, why it might be zero, and how we might handle it. Data can be missing for a number of reasons but understanding why it is missing or zero can be critical in learning more and making better decisions. In a real-world scenario, you might be able to go back to the bank and ask clarifying questions around the data to verify if your assumptions are correct. Of course, there are plenty of times where that will not be an option either.

上面所有这些显然都是关于数据为何丢失,为何可能为零以及我们如何处理的假设性场景和解释。 数据丢失可能有多种原因,但是了解数据丢失或为零的原因对于学习更多信息和制定更好的决策至关重要。 在现实世界中,您也许可以回到银行询问有关数据的澄清问题,以验证您的假设是否正确。 当然,在很多时候,这也不是一种选择。

In the modern world of data science and machine learning, we often see models that cannot handle missing values and are forced to handle this data in another way. While it can be easy to simply drop these records or impute averages or medians, we should also take time to consider what these missing values represent. Imagine if we had simply imputed values of 0 to any blank credit score in our bank example? A potential model may have made horrible predictions because we falsely assumed that 0 and missing were the same when in this case they were not. Similarly, if we had backfilled our missing credit limit values with 0, we would have been using at least one customer who had already defaulted in a model trying to predict if this customer would default.

在当今的数据科学和机器学习世界中,我们经常看到无法处理缺失值并被迫以其他方式处理此数据的模型。 尽管简单地删除这些记录或估算平均值或中位数很容易,但我们也应该花些时间考虑这些缺失值代表什么。 想象一下,如果在我们的银行示例中,我们是否仅将0的值估算为任何空白信用评分? 一个潜在的模型可能做出了可怕的预测,因为我们错误地假定0和缺失在这种情况下不是相同的。 同样,如果我们用0回填缺少的信用额度值,那么我们将使用至少一个已经在模型中发生违约的客户,试图预测该客户是否会违约。

Sometimes zero really is zero. Sometimes missing is simply a human error of a failed data entry job. Sometimes, there’s a much deeper story going on. To quote my professor, “What the heck does zero mean?”

有时零真的是零。 有时丢失仅仅是由于数据输入作业失败而导致的人为错误。 有时,还有一个更深层次的故事正在发生。 用我的教授的话说:“零意味着什么?”

翻译自: https://medium.com/@ivanecky/what-the-heck-does-zero-mean-8c5f42266dc6

协方差意味着什么

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392460.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Go_笔试题记录-不熟悉的

1、golang中没有隐藏的this指针,这句话的含义是() A. 方法施加的对象显式传递,没有被隐藏起来 B. golang沿袭了传统面向对象编程中的诸多概念,比如继承、虚函数和构造函数 C. golang的面向对象表达更直观,对…

leetcode 316. 去除重复字母(单调栈)

给你一个字符串 s ,请你去除字符串中重复的字母,使得每个字母只出现一次。需保证 返回结果的字典序最小(要求不能打乱其他字符的相对位置)。 注意:该题与 1081 https://leetcode-cn.com/problems/smallest-subsequenc…

Go-json解码到结构体

废话不多说,直接干就得了,上代码 package mainimport ("encoding/json""fmt" )type IT struct {Company string json:"company" Subjects []string json:"subjects"IsOk bool json:"isok"…

leetcode 746. 使用最小花费爬楼梯(dp)

数组的每个索引作为一个阶梯,第 i个阶梯对应着一个非负数的体力花费值 costi。 每当你爬上一个阶梯你都要花费对应的体力花费值,然后你可以选择继续爬一个阶梯或者爬两个阶梯。 您需要找到达到楼层顶部的最低花费。在开始时,你可以选择从索…

安卓中经常使用控件遇到问题解决方法(持续更新和发现篇幅)(在textview上加一条线、待续)...

TextView设置最多显示30个字符。超过部分显示...(省略号)&#xff0c;有人说分别设置TextView的android:signature"true",而且设置android:ellipsize"end";可是我试了。居然成功了&#xff0c;供大家參考 [java] view plaincopy<TextView android:id…

网络工程师晋升_晋升为工程师的最快方法

网络工程师晋升by Sihui Huang黄思慧 晋升为工程师的最快方法 (The Fastest Way to Get Promoted as an Engineer) We all want to live up to our potential, grow in our career, and do the best work of our lives. Getting promoted at work not only proves that we hav…

java 银行存取款_用Java编写银行存钱取钱

const readline require(‘readline-sync‘)//引用readline-synclet s 2;//错误的次数for (let i 0; i < 3; i) {console.log(‘请输入名&#xff1a;(由英文组成)‘);let user readline.question();console.log(‘请输入密码&#xff1a;(由数字组成)‘);let password …

垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器

垃圾邮件分类 python介绍 (Introduction) I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.我一直对Goo…

leetcode 103. 二叉树的锯齿形层序遍历(层序遍历)

给定一个二叉树&#xff0c;返回其节点值的锯齿形层序遍历。&#xff08;即先从左往右&#xff0c;再从右往左进行下一层遍历&#xff0c;以此类推&#xff0c;层与层之间交替进行&#xff09;。例如&#xff1a; 给定二叉树 [3,9,20,null,null,15,7],3/ \9 20/ \15 7 返回…

简单易用的MongoDB

从我第一次听到Nosql这个概念到如今已经走过4个年头了&#xff0c;但仍然没有具体的去做过相应的实践。最近获得一段学习休息时间&#xff0c;购买了Nosql技术实践一书&#xff0c;正在慢慢的学习。在主流观点中&#xff0c;Nosql大体分为4类&#xff0c;键值存储数据库&#x…

html画布图片不显示_如何在HTML5画布上显示图像

html画布图片不显示by Nash Vail由Nash Vail Ok, so here’s a question: “Why do we need an article for this, Nash?”好的&#xff0c;这是一个问题&#xff1a;“为什么我们需要为此写一篇文章&#xff0c;纳什&#xff1f;” Well, grab a seat.好吧&#xff0c;坐下…

java断点续传插件_视频断点续传+java视频

之前仿造uploadify写了一个HTML5版的文件上传插件&#xff0c;没看过的朋友可以点此先看一下~得到了不少朋友的好评&#xff0c;我自己也用在了项目中&#xff0c;不论是用户头像上传&#xff0c;还是各种媒体文件的上传&#xff0c;以及各种个性的业务需求&#xff0c;都能得到…

全栈入门_启动数据栈入门包(2020)

全栈入门I advise a lot of people on how to build out their data stack, from tiny startups to enterprise companies that are moving to the cloud or from legacy solutions. There are many choices out there, and navigating them all can be tricky. Here’s a brea…

Go-json解码到接口及根据键获取值

Go-json解码到接口及根据键获取值 package mainimport ("encoding/json""fmt""github.com/bitly/go-simplejson" )type JsonServer struct {ServerName stringServerIP string }type JsonServers struct {Servers []JsonServer }func main() {…

C#接口的显隐实现

显示接口实现与隐式接口实现 何为显式接口实现、隐式接口实现&#xff1f;简单概括&#xff0c;使用接口名作为方法名的前缀&#xff0c;这称为“显式接口实现”&#xff1b;传统的实现方式&#xff0c;称为“隐式接口实现”。下面给个例子。 IChineseGreeting接口&#xff0c;…

亚马逊 各国站点 链接_使用Amazon S3和HTTPS的简单站点托管

亚马逊 各国站点 链接by Georgia Nola乔治亚诺拉(Georgia Nola) 使用Amazon S3和HTTPS的简单站点托管 (Simple site hosting with Amazon S3 and HTTPS) Hiya folks!大家好&#xff01; In this tutorial I’ll show you how to host a static website with HTTPS on AWS wit…

leetcode 387. 字符串中的第一个唯一字符(hash)

给定一个字符串&#xff0c;找到它的第一个不重复的字符&#xff0c;并返回它的索引。如果不存在&#xff0c;则返回 -1。 示例&#xff1a; s “leetcode” 返回 0 s “loveleetcode” 返回 2 class Solution { public int firstUniqChar(String s) { int[][] tempnew i…

marlin 三角洲_三角洲湖泊和数据湖泊-入门

marlin 三角洲Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logica…

tomcat中设置Java 客户端程序的http(https)访问代理

1、假定http/https代理服务器为 127.0.0.1 端口为8118 2、在tomcat/bin/catalina.sh脚本文件中设置JAVA_OPTS&#xff0c;如下图&#xff1a; 保存后重启tomcat就能生效。转载于:https://www.cnblogs.com/zhangmingcheng/p/11211776.html

java界面中显示图片_java中怎样在界面中显示图片?

方法一&#xff1a;JLabel helloLabel new JLabel("New label");helloLabel.setIcon(new ImageIcon("E:\\javaSE\u4EE3\u7801\\TimeManager\\asset\\hello.gif"));helloLabel.setBackground(Color.BLACK);helloLabel.setBounds(0, 0, 105, 50);contentPan…