自我价值感缺失的表现
Before handling the missing values, we must know what all possible types of it exists in the data science world. Basically there are 3 types to be found everywhere on the web, but in some of the core research papers there is one more type of it. Let me introduce you with all of them very briefly-
在处理缺失值之前,我们必须知道数据科学世界中存在所有可能的类型。 基本上,在网络上到处都可以找到3种类型,但是在一些核心研究论文中,还有另外一种类型。 让我简单地向大家介绍一下-
Structurally Missing Data- Let me tell you an example where we have the results of the students of a university of a particular semester and out of the entire data, some of the result values were missing. This may happen when either of the students have dropped out before exams or maybe were absent. So, this is a structurally missing value. In this case, the best possible solution is to deduce by inserting 0 at those missing places.
结构上缺失的数据-让我告诉你一个例子,其中我们有特定学期大学学生的成绩,而在全部数据中,有些结果值丢失了。 当任何一个学生在考试前辍学或缺席时,可能会发生这种情况。 因此,这是结构上缺失的值。 在这种情况下,最好的解决方案是在那些丢失的位置插入0来推断。
MCAR (Missing Completely at Random)- When missing values are randomly distributed over entire dataset, MCAR occurs in instances where missing data is not related to the scores on the variables in the question and is not related to the scores on any other variables under analysis. For example, when data are missing for respondents for which their questionnaire was lost. Say you have complete data of 15 questions and incomplete data of 10. In this case, we compare these two datasets by some testing say t-test and if we don’t find any difference in means between the two samples of data, we can assume the data to be MCAR.
MCAR(完全随机缺失)-当缺失值随机分布在整个数据集中时,MCAR发生在以下情况下:缺失数据与问题中变量的分数无关,并且与分析中任何其他变量的分数均无关。 例如,当丢失了问卷的受访者的数据丢失时。 假设您有15个问题的完整数据,有10个问题的不完整数据。在这种情况下,我们通过一些测试(例如t检验)比较了这两个数据集,如果我们发现两个数据样本之间的均值没有任何差异,我们可以假设数据为MCAR。
MAR (Missing at Random)- Data is not missing randomly across entire dataset but is missing randomly only within sub samples of data. When the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing value itself is MAR. For example, in an IQ dataset, only older people have missing value. Thus, the probability of missing data on IQ is related to age. Also, to assume this as MAR is difficult because there is no way of testing it.
MAR(随机丢失)-数据在整个数据集中并不是随机丢失的,而是仅在子数据样本内随机丢失的。 当变量上缺失数据的概率与模型中其他一些测量变量相关,而与缺失值本身无关的变量值则为MAR。 例如,在IQ数据集中,只有老年人的价值缺失。 因此,丢失智商数据的可能性与年龄有关。 而且,很难将其假定为MAR,因为没有办法对其进行测试。
NMAR (Not Missing at Random)- When the missing data has no structure to it, we can’t treat it as missing at random. It may be the case where we can’t make conclusions to the missing value.
NMAR(随机丢失)-当丢失的数据没有结构时,我们不能将其视为随机丢失。 在某些情况下,我们无法得出缺失值的结论。
Some Common Approaches to deal with such type of missing data:
处理此类丢失数据的一些常用方法 :
Simple one: Drop the corresponding Column/ Row-
简单一:删除相应的Column / Row-
pd.Dataframe.isnull().dropna()
If your data size is large and corresponding count of missing values in column/rows are comparatively quite low, then we use this approach.
如果您的数据量很大,并且列/行中缺失值的相应计数相对较低,那么我们可以使用这种方法。
2. Imputation- It fills the missing value with some number. The imputed value won’t be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column/row entirely. We can name some of the imputation techniques as below:
2.插补-用一些数字填充缺失值。 在大多数情况下,推算的值并不完全正确,但是与完全删除列/行相比,推导的值通常会导致更准确的模型。 我们可以将一些插补技术命名为:
a) Mean/Median Imputation: As the name suggests, in this we replace missing values by mean or median of the total. We use this approach when the number of missing observations is low.
a)均值/中位数插补:顾名思义,在此我们将缺失值替换为总数的均值或中位数。 当缺少的观察次数很少时,我们使用这种方法。
b) Multivariate Imputation by Chained Equations (MICE): It assumes that the missing data are Missing at Random (MAR). It imputes data on a variable-by-variable basis by specifying an imputation model per variable. It uses all the variables in the data for predictions.
b)链式方程多元估计(MICE):它假定丢失的数据是随机丢失(MAR)。 通过为每个变量指定插补模型,它可以逐变量插补数据。 它使用数据中的所有变量进行预测。
3. Random Forest- Yes, it is also a non-parametric imputation method that works well with both data missing at random and not missing at random. It uses multiple decision trees to estimate missing values and outputs OOB (out of bag) imputation error estimates.
3.随机森林-是的,它也是一种非参数插补方法,可以很好地处理随机丢失的数据和随机丢失的数据。 它使用多个决策树来估计缺失值,并输出OOB(袋外)估算误差估计。
However, there are various other efficient methods to handle the missing values as per the given scenario and the type of data. I have discussed here the most common ones with you. Hope it was helpful, thanks for reading! Good luck!! Be safe!!
但是,根据给定方案和数据类型,还有各种其他有效的方法来处理缺失值。 我在这里与您讨论了最常见的问题。 希望对您有所帮助,感谢您的阅读! 祝好运!! 注意安全!!
翻译自: https://medium.com/analytics-vidhya/different-types-of-missing-values-approaches-to-deal-with-them-1f67c617374c
自我价值感缺失的表现
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392087.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!