数据eda_关于分类和有序数据的EDA

数据eda

数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:

分类变量是将可能的值作为一组选项提供的变量,可以预定义或打开。 一个例子可以是一个人的性别。 对于序数变量,可以按照某些规则对选项进行排序,例如Likert Scale:

  • Like

    喜欢
  • Like Somewhat

    有点像
  • Neutral

    中性
  • Dislike Somewhat

    有点不喜欢
  • Dislike

    不喜欢

To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:

为了简化更多示例,我们将使用一个简单示例,该示例基于一组已通过或未通过2次不同考试的学生,结果显示在下一个RxC表中:

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

Statisticians have developed specific techniques to analyze this data, the most important are:

统计人员已经开发出分析此数据的特定技术,其中最重要的是:

协议措施 (Measures of Agreement)

百分比协议 (Percent Agreement)

Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.

计算为费率在特定类别中的案例数除以费率总数。

Image for post
Adding totals to the example, self-generated.
将总计添加到示例中,自行生成。
  • The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%

    通过考试2的百分比协议是25 /(25 + 60)= 0.29,所以29.4%
  • The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%

    通过考试1的百分比协议是30/85 = 0.35,所以35.3%
  • The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.

    通过考试1和未通过考试2的百分比协议是10/85 = 0.117,所以11.7%。

The problem with the percent agreement is that the data can be obtained only by chance.

百分比一致性的问题在于只能偶然获得数据。

科恩的卡帕 (Cohen’s Kappa)

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

To overcome the problems of percent agreement, we calculate Kappa as:

为了克服百分比协议的问题,我们将Kappa计算为:

Image for post
Cohen’s Kappa formula, self-generated.
科恩的Kappa公式,是自生成的。

where P0 is the observed agreement and Pe the expected agreement, calculated as:

其中P0是观察到的协议, Pe是期望的协议,计算公式为:

Image for post
P0 and Pe formulas, self-generated.
P0和Pe公式,是自生成的。

In our example:

在我们的示例中:

  • P0 = 70/85 = 0.82

    P0 = 70/85 = 0.82

  • Pe = 30 x 25 / 85² + 55 x 60 / 85² = 0.56

    Pe = 30 x 25 /85²+ 55 x 60 /85²= 0.56

  • K = 0.26 / 0.44 = 0.59

    K = 0.26 / 0.44 = 0.59

The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.

Kappa结果的可能范围是(-1,1),其中0表示观察到的一致和机会一致是相同的,如果所有情况都一致,则为1;如果所有情况都不一致,则为-1。

卡方分布 (The Chi-Squared Distribution)

To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.

要使用分类变量进行假设检验,我们需要使用自定义分布,最常见的是卡方,即连续的理论概率分布。

This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.

这种分布只有一个参数, k表示自由度。 当k接近无穷大时,卡方分布变得类似于正态分布。

卡方检验 (Chi-Squared Test)

This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:

该测试用于检查两个类别变量是否独立,我们将使用相同的示例来说明如何计算它:

First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:

首先,我们定义要测试的假设,在本例中,我们要检查通过考试1和考试2是否独立,因此:

  • H0 = Pass exam 1 and pass exam 2 are independent.

    H0 =通过考试1和通过考试2是独立的。
  • Ha = Pass exam 1 and pass exam 2 are dependent.

    Ha =通过考试1和通过考试2是相关的。

This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:

该测试依赖于期望值与观察值之间的差异,以计算期望值(如果两个变量都是独立的,您会发现什么),我们使用:

Image for post
Expected values formula, self-generated.
期望值公式,自行生成。

To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:

为了简化计算,首先我们计算边际,这些值是我们在第二张表中已经计算出的每行和每列的总和。 期望值的计算公式为:

Image for post
Expected values calculation for our example, self-generated.
本示例的期望值计算,是自生成的。

Now we have all we need to calculate the chi-squared formula:

现在我们有了计算卡方公式所需的全部:

Image for post
The chi-Squared formula, self-generated.
卡方公式,自生成。

With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:

对于总和符号,我们的意思是我们必须为变量4的所有组合计算公式,并对结果求和:

Image for post
Results for each sum of the formula, self-generated.
公式的每个和的结果,自生成。

The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.

最终值是所有4的总和,即26.96 ,现在我们必须将此结果与统计表进行比较,为此,我们需要知道自由度,它们的计算方式为(num rows-1)*(num columns -1) ,在我们的情况下,我们的自由度= 1。

According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.

根据在Google上发现的易于搜索的Chi-Squared表(任何语言的统计软件包都应在函数中包含它们),, = 0.05的临界值为3.841,我们的结果要大得多,因此,我们拒绝空值假设意味着通过考试1和通过考试2是相互依赖的。

分类数据的相关统计 (Correlation statistics for categorical data)

As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:

由于人的相关性要求至少在区间水平上测量变量,因此我们需要对二进制和序数变量采用新的计算方法,让我们对其进行介绍:

二进制变量 (Binary Variables)

Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:

Phi是两个二进制变量之间关联度的度量,基于Cohen Kappa部分介绍的表,其计算公式为:

Image for post
Formulas to calculate the phi statistic, self-generated.
自行计算phi统计信息的公式。

Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1

在我们的示例中,使用第二个公式,Φ=( 26.96 / 85)^(1/2)= 0.1

Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.

注意,第一个公式可以得出负值,而第二个公式只能得出正值,我们不在乎结果的方向,我们只分析绝对值。

If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.

如果数据的分布是50–50,则数据分布均匀,phi可以达到1的值,否则潜在的最大值较低。 就我们而言,我们之间的关系很少。

点-双相关 (The Point-Biserial Correlation)

It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:

这是一种计算二分变量和连续变量之间的相关性的度量,公式为下一个:

Image for post
Point biserial correlation formula, self-generated.
点双数相关公式,自生成。

Where:

哪里:

  • x̄1 = mean of the continuous variable for group 1

    x̄1 =组1连续变量的平均值

  • x̄2 = mean of the continuous variable for group 2

    x̄2 =第2组连续变量的平均值

  • p = proportion of class 1 in the dichotomous variable

    p = 1类在二分变量中的比例

  • s_x = Standart deviation of the continuous variable

    s_x =连续变量的标准偏差

To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:

遵循我们的示例,我们将假定下一个值,该值是将考试1变量与学习的小时数进行比较而获得的:

  • x̄ pass = 5.5

    x̄通过 = 5.5

  • x̄ not pass = 3.1

    x̄不及格 = 3.1

  • p = 20/25 = 0.8

    p = 20/25 = 0.8

  • s_x = 2

    s_x = 2

With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.

使用这些值,我们得到的结果为2.4 * 0.4 / 2 = 0.48 ,表明变量之间存在某种关系。

序数变数 (Ordinal Variables)

The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.

序数变量最常用的相关系数是Spearman的秩序系数 ,通常称为Spearman的r

Image for post
Spearman’s r correlation coefficient for ordinal variables, self-generated.
Spearman的r相关系数,用于自变量。

where d_i means the difference between 2 variables for each individual and n the size of the sample.

其中d_i表示每个个体的2个变量与样本大小的n之差。

摘要 (Summary)

In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.

在数据科学中,我们习惯于对二进制,分类或普通变量进行散点图绘制,将它们用作其他图中的色差,但是当我们计算相关性时,由于内置变量,很容易跳过此变量R中的python或Dplyr的熊猫函数不使用它们。

In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.

在这篇文章中,我们展示了如何分析这些变量的分布以及它们与所有其他变量的相关性。

This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

这是我特别#十后100daysofML,我会发布在GitHub上,Twitter和中型企业(这一挑战的进步阿德里亚塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻译自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836

数据eda

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389430.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

PyTorch官方教程中文版:PYTORCH之60MIN入门教程代码学习

Pytorch入门 import torch""" 构建非初始化的矩阵 """x torch.empty(5,3) #print(x)""" 构建随机初始化矩阵 """x torch.rand(5,3)""" 构造一个矩阵全为 0,而且数据类型是 long &qu…

Flexbox 最简单的表单

弹性布局(Flexbox)逐渐流行&#xff0c;越来越多的人开始使用&#xff0c;因为它写Css布局真是太简单了一一、<form>元素表单使用<form>元素<form></form>复制代码上面是一个空的表单&#xff0c;根据HTML标准&#xff0c;它是一个块级元素&#xff0c…

CSS中的盒子模型

一.为什么使用CSS 1.有效的传递页面信息 2.使用CSS美化过的页面文本&#xff0c;使页面漂亮、美观&#xff0c;吸引用户 3.可以很好的突出页面的主题内容&#xff0c;使用户第一眼可以看到页面主要内容 4.具有良好的用户体验 二.字体样式属性 1.font-family:英…

jdk重启后步行_向后介绍步行以一种新颖的方式来预测未来

jdk重启后步行“永远不要做出预测&#xff0c;尤其是关于未来的预测。” (KK Steincke) (“Never Make Predictions, Especially About the Future.” (K. K. Steincke)) Does this picture portray a horse or a car? 这张照片描绘的是马还是汽车&#xff1f; How likely is …

PyTorch官方教程中文版:入门强化教程代码学习

PyTorch之数据加载和处理 from __future__ import print_function, division import os import torch import pandas as pd #用于更容易地进行csv解析 from skimage import io, transform #用于图像的IO和变换 import numpy as np import matplotlib.pyplot a…

css3-2 CSS3选择器和文本字体样式

css3-2 CSS3选择器和文本字体样式 一、总结 一句话总结&#xff1a;是要记下来的&#xff0c;记下来可以省很多事。 1、css的基本选择器中的:first-letter和:first-line是什么意思&#xff1f; :first-letter选择第一个单词&#xff0c;:first-line选择第一行 2、css的伪类选…

mongodb仲裁者_真理的仲裁者

mongodb仲裁者Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Par…

优化 回归_使用回归优化产品价格

优化 回归应用数据科学 (Applied data science) Price and quantity are two fundamental measures that determine the bottom line of every business, and setting the right price is one of the most important decisions a company can make. Under-pricing hurts the co…

Node.js——异步上传文件

前台代码 submit() {var file this.$refs.fileUpload.files[0];var formData new FormData();formData.append("file", file);formData.append("username", this.username);formData.append("password", this.password);axios.post("http…

用 JavaScript 的方式理解递归

原文地址 1. 递归是啥? 递归概念很简单&#xff0c;“自己调用自己”&#xff08;下面以函数为例&#xff09;。 在分析递归之前&#xff0c;需要了解下 JavaScript 中“压栈”&#xff08;call stack&#xff09; 概念。 2. 压栈与出栈 栈是什么&#xff1f;可以理解是在内存…

PyTorch官方教程中文版:Pytorch之图像篇

微调基于 torchvision 0.3的目标检测模型 """ 为数据集编写类 """ import os import numpy as np import torch from PIL import Imageclass PennFudanDataset(object):def __init__(self, root, transforms):self.root rootself.transforms …

大数据数据科学家常用面试题_进行数据科学工作面试

大数据数据科学家常用面试题During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent amo…

scrapy模拟模拟点击_模拟大流行

scrapy模拟模拟点击复杂系统 (Complex Systems) In our daily life, we encounter many complex systems where individuals are interacting with each other such as the stock market or rush hour traffic. Finding appropriate models for these complex systems may give…

公司想申请网易企业电子邮箱,怎么样?

不论公司属于哪个行业&#xff0c;选择企业邮箱&#xff0c;交互界面友好度、稳定性、安全性都是选择邮箱所必须考虑的因素。网易企业邮箱邮箱方面已有21年的运营经验&#xff0c;是国内资历最高的电子邮箱&#xff0c;在各个方面都非常成熟完善。 从交互界面友好度来看&#x…

莫烦Matplotlib可视化第二章基本使用代码学习

基本用法 import matplotlib.pyplot as plt import numpy as np""" 2.1基本用法 """ # x np.linspace(-1,1,50) #[-1,1]50个点 # #y 2*x 1 # # y x**2 # plt.plot(x,y) #注意&#xff1a;x,y顺序不能反 # plt.show()"""…

vue.js python_使用Python和Vue.js自动化报告过程

vue.js pythonIf your organization does not have a data visualization solution like Tableau or PowerBI nor means to host a server to deploy open source solutions like Dash then you are probably stuck doing reports with Excel or exporting your notebooks.如果…

plsql中导入csvs_在命令行中使用sql分析csvs

plsql中导入csvsIf you are familiar with coding in SQL, there is a strong chance you do it in PgAdmin, MySQL, BigQuery, SQL Server, etc. But there are times you just want to use your SQL skills for quick analysis on a small/medium sized dataset.如果您熟悉SQ…

第十八篇 Linux环境下常用软件安装和使用指南

提醒&#xff1a;如果之后要安装virtualenvwrapper的话&#xff0c;可以直接跳到安装virtualenvwrapper的方法&#xff0c;而不需要先安装好virtualenv安装virtualenv和生成虚拟环境安装virtualenv&#xff1a;yum -y install python-virtualenv生成虚拟环境&#xff1a;先切换…

莫烦Matplotlib可视化第三章画图种类代码学习

3.1散点图 import matplotlib.pyplot as plt import numpy as npn 1024 X np.random.normal(0,1,n) Y np.random.normal(0,1,n) T np.arctan2(Y,X) #用于计算颜色plt.scatter(X,Y,s75,cT,alpha0.5)#alpha是透明度 #plt.scatter(np.arange(5),np.arange(5)) #一条线的散点…

计算机科学必读书籍_5篇关于数据科学家的产品分类必读文章

计算机科学必读书籍Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.产品分类/产品分类是将产品…