mad离群值_全部关于离群值

mad离群值

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from the combination/collection of the data.

离群值是数据集中的一个数据点,该数据点与所有其他观察值相距较远。 位于数据集总体分布之外的数据点。 或用外行术语来说,离群值是与数据的组合/收集不同的行为。

Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. To understand outliers, we need to go through these points:

异常值可能对主题区域和数据收集过程很有帮助。 了解离群值如何发生以及在过程或研究区域的正常部分是否会再次发生离群值至关重要。 要了解异常值,我们需要经历以下几点:

  • what causes the outliers?

    是什么导致异常值?
  • Impact of the outlier

    离群值的影响
  • Methods to Identify outliers

    识别异常值的方法

是什么导致异常值? (What causes the outliers?)

Before dealing with the outliers, one should know what causes them. There are three causes for outliers — data entry/An experiment measurement errors, sampling problems, and natural variation.

在处理异常值之前,应该知道是什么导致了异常值。 造成异常的原因有三点:数据输入/实验测量错误,采样问题和自然变化。

  1. Data entry /An experimental measurement error

    数据输入/实验测量错误

An error can occur while experimenting/entering data. During data entry, a typo can type the wrong value by mistake. Let us consider a dataset of age, where we found a person age is 356, which is impossible. So this is a Data entry error.

实验/输入数据时可能会发生错误。 在数据输入期间,错字可能会错误输入错误的值。 让我们考虑年龄的数据集,我们发现一个人的年龄是356,这是不可能的。 因此,这是一个数据输入错误。

Image for post

These types of errors are easy to identify. If you determine that an outlier value is an error, we can fix this error by deleting the data point because you know it’s an incorrect value.

这些类型的错误很容易识别。 如果您确定异常值是错误的,我们可以通过删除数据点来解决此错误,因为您知道该值是错误的。

2. Sampling problems

2.抽样问题

Outliers can occur while collecting random samples. Let us consider an example where we have records of bone density of various subjects, but there is an unusual growth of bone in a subject, after analyzing this has been discovered that the subject had diabetes, which affects bone health. The goal was to model bone density growth in girls with no health conditions that affect bone growth. Since the data is not a part of the target population so we will not consider this.

收集随机样本时可能会出现异常值。 让我们考虑一个示例,在该示例中,我们记录了各个受试者的骨密度,但是在分析了该受试者患有糖尿病并影响骨骼健康之后,该受试者的骨骼异常生长。 目的是在没有影响骨骼生长的健康状况的女孩中模拟骨骼密度的增长。 由于数据不是目标人群的一部分,因此我们不会考虑这一点。

3. Natural variation

3.自然变异

Suppose we need to check the reliability of a machine. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.

假设我们需要检查机器的可靠性。 正常过程包括标准材料,制造设置和条件。 如果在研究的一部分过程中发生异常情况,例如电源故障或机器设置偏离标准值,则可能会影响产品。 这些异常的制造条件可能会通过创建具有非典型强度值的产品而导致异常值。 在这些异常条件下制造的产品不能反映正常过程中目标产品的数量。 因此,您可以合法地从数据集中删除这些数据点。

离群值的影响 (Impact of the outlier)

Outliers can change the results of the data analysis and statistical modeling. Following are some impacts of outliers in the data set:

离群值可以更改数据分析和统计建模的结果。 以下是数据集中异常值的一些影响:

  • It may cause a significant impact on the mean and the standard deviation

    可能会对平均值和标准偏差产生重大影响
  • If the outliers are non-randomly distributed, they can decrease normality

    如果离群值是非随机分布的,则它们可能会降低正态性
  • They can bias or influence estimates that may be of substantive interest

    它们可能会偏向或影响可能具有实质意义的估计
  • They can also impact the basic assumption of Regression, ANOVA, and other statistical model assumptions.

    它们还会影响回归,ANOVA和其他统计模型假设的基本假设。

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

为了深入了解其影响,让我们举一个例子来检查在数据集中有无异常时数据集会发生什么情况。

Let’s examine what can happen to a data set with outliers. For the sample data set:

让我们研究带有异常值的数据集会发生什么。 对于样本数据集:

1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4

1,1,2,2,2,2,3,3,3,4,4

We find the following mean, median, mode, and standard deviation:

我们发现以下平均值,中位数,众数和标准差:

Mean = 2.58

均值= 2.58

Median = 2.5

中位数= 2.5

Mode = 2

模式= 2

Standard Deviation = 1.08

标准偏差= 1.08

If we add an outlier to the data set:

如果我们向数据集添加离群值:

1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400

1,1,2,2,2,2,3,3,3,4,4,400

The new values of our statistics are:

我们的统计信息的新值是:

Mean = 35.38

均值= 35.38

Median = 2.5

中位数= 2.5

Mode = 2

模式= 2

Standard Deviation = 114.74

标准偏差= 114.74

As you can see, having outliers often has a significant effect on your mean and standard deviation.

如您所见,离群值通常会对平均值和标准差产生重大影响。

识别异常值的方法 (Methods to Identify outliers)

There are various ways to identify outliers in a dataset, following are some of them:

在数据集中识别异常值的方法有多种,以下是其中的一些方法:

  1. Sorting the data

    排序数据
  2. Using graphical Method

    使用图形方法
  3. Using z score

    使用z分数
  4. Using the IQR interquartile range

    使用IQR四分位距

排序数据 (Sorting the data)

Sorting the dataset is the simplest and effective method to check unusual value. Let us consider an example of age dataset:

对数据集进行排序是检查异常值的最简单有效的方法。 让我们考虑年龄数据集的示例:

Image for post

In the above dataset, we have sort the age dataset and get to know that 398 is an outlier. Sorting data method is most effective on the small dataset.

在上面的数据集中,我们对年龄数据集进行了排序,并且知道398是一个离群值。 排序数据方法对小型数据集最有效。

使用图形方法 (Using graphical Method)

We can detect outliers with the help of graphical representation like Scatter plot and Boxplot.

我们可以借助散点图和Boxplot等图形表示来检测异常值。

Image for post

1. Scatter Plot

1.散点图

Scatter plots often have a pattern. We call a data point an outlier if it doesn’t fit the pattern. Here we have a scatter plot of Weight vs height. Notice how two of the points don’t fit the pattern very well. There is no special rule that tells us whether or not a point is an outlier in a scatter plot. When doing more advanced statistics, it may become helpful to invent a precise definition of “outlier”.

散点图通常具有某种模式。 如果数据点不适合模式,我们称其为离群值 。 这是重量与高度的散点图。 请注意,有两个点不太适合该模式。 没有特殊的规则可以告诉我们点在散点图中是否是异常值。 在进行更高级的统计时,发明精确定义“异常值”可能会有所帮助。

Image for post

2. Box-Plot

2.箱线图

Box-plot is one of the most effective ways of identifying Outliers in a dataset. When reviewing a box plot, an outlier is defined as a data point that is located outside the box of the box plot. As seen in the box plot of bill vs days. Box-Plot uses the Interquartile range(IQR) to detect outliers.

箱线图是识别数据集中异常值的最有效方法之一。 查看箱形图时,离群值定义为位于箱形图框外部的数据点。 如票据与天数的方框图所示。 箱线图使用四分位间距(IQR)来检测离群值。

使用z分数 (Using z-score)

Image for post

Z-score (also called a standard score) gives you an idea of how many standard deviations away a data point is from the mean.. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is.

Z分数(也称为标准分数 )使您了解一个数据点与平均值之间有多少标准偏差。但是从技术上讲,它衡量的是低于或高于总体的多少标准偏差表示原始分数是多少。 。

Z score = (x -mean) / std. deviation

Z分数=(x-均值)/ std。 偏差

In a normal distribution, it is estimated that

在正态分布中,估计

68% of the data points lie between +/- 1 standard deviation.

68%的数据点位于+/- 1标准偏差之间。

95% of the data points lie between +/- 2 standard deviation.

95%的数据点在+/- 2标准偏差之间。

99.7% of the data points lie between +/- 3 standard deviation.

99.7%的数据点位于+/- 3标准偏差之间。

Formula for Z score = (Observation — Mean)/Standard Deviation

Z分数的公式=(观测值–平均值)/标准差

z = (X — μ) / σ

z =(X —μ)/σ

Let us consider a dataset:

让我们考虑一个数据集:

Image for post
Image for post

使用IQR四分位距 (Using the IQR interquartile range)

Image for post

Interquartile range(IQR), is just the width of the box in the box-plot which can be used as a measure of how spread out the values are. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.

四分位数间距(IQR)只是箱形图中箱形的宽度,可用作度量值分布的程度。 离群值是从框的两端到框长度的一半以上的任何值。

Steps

脚步

  1. Arrange the data in increasing order

    按升序排列数据
  2. Calculate first(q1) and third quartile(q3)

    计算第一个(q1)和第三个四分位数(q3)
  3. Find interquartile range (q3-q1)

    查找四分位数范围(q3-q1)
  4. Find lower bound q1*1.5

    求下界q1 * 1.5
  5. Find upper bound q3*1.5

    求上限q3 * 1.5

Anything that lies outside of lower and upper bound is an outlier

上下限以外的任何东西都是异常值

Let us take the same example as of Z-score:

让我们以与Z分数相同的示例为例:

Image for post
Image for post

As you can see we have found Lower and upper values that is: 7.5 and 19.5, so anything that lies outside these values is an outlier.

如您所见,我们发现下限值和上限值分别是:7.5和19.5,因此超出这些值的任何值都是异常值。

This is all we have about outliers, I hope you enjoyed reading. Thank you

这就是关于异常值的全部内容,希望您喜欢阅读。 谢谢

翻译自: https://medium.com/analytics-vidhya/its-all-about-outliers-cbe172aa1309

mad离群值

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389822.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

青年报告_了解青年的情绪

青年报告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …

post提交参数过多时,取消Tomcat对 post长度限制

1.Tomcat 默认的post参数的最大大小为2M, 当超过时将会出错,可以配置maxPostSize参数来改变大小。 从 apache-tomcat-7.0.63 开始,参数 maxPostSize 的含义就变了: 如果将值设置为 0,表示 POST 最大值为 0,…

map(平均平均精度_客户的平均平均精度

map(平均平均精度Disclaimer: this was created for my clients because it’s rather challenging to explain such a complex metric in simple words, so don’t expect to see much of math or equations here. And remember that I try to keep it simple.免责声明 &#…

Sublime Text 2搭建Go开发环境,代码提示+补全+调试

本文在已安装Go环境的前提下继续。 1、安装Sublime Text 2 2、安装Package Control。 运行Sublime,按下 Ctrl(在Tab键上边),然后输入以下内容: import urllib2,os,hashlib; h 7183a2d3e96f11eeadd761d777e62404 e330…

zookeeper、hbase常见命令

a) Zookeeper:帮助命令-help i. ls /查看zk下根节点目录 ii. create /zk_test my_data//在测试集群没有创建成功 iii. get /zk_test my_data//获取节点信息 iv. set / zk_test my_data//更改节点相关信息 v. delete /zk_test//删除节点信…

鲜活数据数据可视化指南_数据可视化实用指南

鲜活数据数据可视化指南Exploratory data analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among v…

Linux lsof命令详解

lsof(List Open Files) 用于查看你进程开打的文件,打开文件的进程,进程打开的端口(TCP、UDP),找回/恢复删除的文件。是十分方便的系统监视工具,因为lsof命令需要访问核心内存和各种文件,所以需要…

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解 史密斯卧推(smith press)是固定器械上完成的卧推,对于初级健身者来说,自由卧推(哑铃卧推、杠铃卧推)还不能很好地把握平衡性&…

图像特征 可视化_使用卫星图像可视化建筑区域

图像特征 可视化地理可视化/菲律宾/遥感 (GEOVISUALIZATION / PHILIPPINES / REMOTE-SENSING) Big data is incredible! The way Big Data manages to bring sciences and business domains to new levels is almost sort of magical. It allows us to tap into a variety of a…

375. 猜数字大小 II

375. 猜数字大小 II 我们正在玩一个猜数游戏,游戏规则如下: 我从 1 到 n 之间选择一个数字。你来猜我选了哪个数字。如果你猜到正确的数字,就会 赢得游戏 。如果你猜错了,那么我会告诉你,我选的数字比你的 更大或者更…

海量数据寻找最频繁的数据_在数据中寻找什么

海量数据寻找最频繁的数据Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or anal…

OSChina 周四乱弹 —— 要成立复仇者联盟了,来报名

2019独角兽企业重金招聘Python工程师标准>>> Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 Devoes :分享吴若希的单曲《越难越爱 (Love Is Not Easy / TVB剧集《使徒行者》片尾曲)》: 《越难越爱 (Love Is No…

2023. 连接后等于目标字符串的字符串对

2023. 连接后等于目标字符串的字符串对 给你一个 数字 字符串数组 nums 和一个 数字 字符串 target ,请你返回 nums[i] nums[j] (两个字符串连接)结果等于 target 的下标 (i, j) (需满足 i ! j)的数目。 示例 1&…

webapi 找到了与请求匹配的多个操作(ajax报500,4的错误)

1、ajax报500,4的错误,然而多次验证自己的后台方法没错。然后跟踪到如下图的错误信息! 2、因为两个函数都是无参的,返回值也一样。如下图 3,我给第一个函数加了一个参数后,就不报错了,所以我想,…

可视化 nlp_使用nlp可视化尤利西斯

可视化 nlpMy data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elu…

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…

python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数…

您是六个主要数据角色中的哪一个

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

自定义按钮动态变化_新闻价值的变化定义

自定义按钮动态变化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …