数据eda_关于分类和有序数据的EDA

数据eda

数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:

分类变量是将可能的值作为一组选项提供的变量,可以预定义或打开。 一个例子可以是一个人的性别。 对于序数变量,可以按照某些规则对选项进行排序,例如Likert Scale:

  • Like

    喜欢
  • Like Somewhat

    有点像
  • Neutral

    中性
  • Dislike Somewhat

    有点不喜欢
  • Dislike

    不喜欢

To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:

为了简化更多示例,我们将使用一个简单示例,该示例基于一组已通过或未通过2次不同考试的学生,结果显示在下一个RxC表中:

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

Statisticians have developed specific techniques to analyze this data, the most important are:

统计人员已经开发出分析此数据的特定技术,其中最重要的是:

协议措施 (Measures of Agreement)

百分比协议 (Percent Agreement)

Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.

计算为费率在特定类别中的案例数除以费率总数。

Image for post
Adding totals to the example, self-generated.
将总计添加到示例中,自行生成。
  • The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%

    通过考试2的百分比协议是25 /(25 + 60)= 0.29,所以29.4%
  • The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%

    通过考试1的百分比协议是30/85 = 0.35,所以35.3%
  • The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.

    通过考试1和未通过考试2的百分比协议是10/85 = 0.117,所以11.7%。

The problem with the percent agreement is that the data can be obtained only by chance.

百分比一致性的问题在于只能偶然获得数据。

科恩的卡帕 (Cohen’s Kappa)

Image for post
The example used in the whole article, self-generated.
整篇文章中使用的示例是自生成的。

To overcome the problems of percent agreement, we calculate Kappa as:

为了克服百分比协议的问题,我们将Kappa计算为:

Image for post
Cohen’s Kappa formula, self-generated.
科恩的Kappa公式,是自生成的。

where P0 is the observed agreement and Pe the expected agreement, calculated as:

其中P0是观察到的协议, Pe是期望的协议,计算公式为:

Image for post
P0 and Pe formulas, self-generated.
P0和Pe公式,是自生成的。

In our example:

在我们的示例中:

  • P0 = 70/85 = 0.82

    P0 = 70/85 = 0.82

  • Pe = 30 x 25 / 85² + 55 x 60 / 85² = 0.56

    Pe = 30 x 25 /85²+ 55 x 60 /85²= 0.56

  • K = 0.26 / 0.44 = 0.59

    K = 0.26 / 0.44 = 0.59

The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.

Kappa结果的可能范围是(-1,1),其中0表示观察到的一致和机会一致是相同的,如果所有情况都一致,则为1;如果所有情况都不一致,则为-1。

卡方分布 (The Chi-Squared Distribution)

To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.

要使用分类变量进行假设检验,我们需要使用自定义分布,最常见的是卡方,即连续的理论概率分布。

This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.

这种分布只有一个参数, k表示自由度。 当k接近无穷大时,卡方分布变得类似于正态分布。

卡方检验 (Chi-Squared Test)

This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:

该测试用于检查两个类别变量是否独立,我们将使用相同的示例来说明如何计算它:

First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:

首先,我们定义要测试的假设,在本例中,我们要检查通过考试1和考试2是否独立,因此:

  • H0 = Pass exam 1 and pass exam 2 are independent.

    H0 =通过考试1和通过考试2是独立的。
  • Ha = Pass exam 1 and pass exam 2 are dependent.

    Ha =通过考试1和通过考试2是相关的。

This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:

该测试依赖于期望值与观察值之间的差异,以计算期望值(如果两个变量都是独立的,您会发现什么),我们使用:

Image for post
Expected values formula, self-generated.
期望值公式,自行生成。

To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:

为了简化计算,首先我们计算边际,这些值是我们在第二张表中已经计算出的每行和每列的总和。 期望值的计算公式为:

Image for post
Expected values calculation for our example, self-generated.
本示例的期望值计算,是自生成的。

Now we have all we need to calculate the chi-squared formula:

现在我们有了计算卡方公式所需的全部:

Image for post
The chi-Squared formula, self-generated.
卡方公式,自生成。

With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:

对于总和符号,我们的意思是我们必须为变量4的所有组合计算公式,并对结果求和:

Image for post
Results for each sum of the formula, self-generated.
公式的每个和的结果,自生成。

The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.

最终值是所有4的总和,即26.96 ,现在我们必须将此结果与统计表进行比较,为此,我们需要知道自由度,它们的计算方式为(num rows-1)*(num columns -1) ,在我们的情况下,我们的自由度= 1。

According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.

根据在Google上发现的易于搜索的Chi-Squared表(任何语言的统计软件包都应在函数中包含它们),, = 0.05的临界值为3.841,我们的结果要大得多,因此,我们拒绝空值假设意味着通过考试1和通过考试2是相互依赖的。

分类数据的相关统计 (Correlation statistics for categorical data)

As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:

由于人的相关性要求至少在区间水平上测量变量,因此我们需要对二进制和序数变量采用新的计算方法,让我们对其进行介绍:

二进制变量 (Binary Variables)

Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:

Phi是两个二进制变量之间关联度的度量,基于Cohen Kappa部分介绍的表,其计算公式为:

Image for post
Formulas to calculate the phi statistic, self-generated.
自行计算phi统计信息的公式。

Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1

在我们的示例中,使用第二个公式,Φ=( 26.96 / 85)^(1/2)= 0.1

Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.

注意,第一个公式可以得出负值,而第二个公式只能得出正值,我们不在乎结果的方向,我们只分析绝对值。

If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.

如果数据的分布是50–50,则数据分布均匀,phi可以达到1的值,否则潜在的最大值较低。 就我们而言,我们之间的关系很少。

点-双相关 (The Point-Biserial Correlation)

It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:

这是一种计算二分变量和连续变量之间的相关性的度量,公式为下一个:

Image for post
Point biserial correlation formula, self-generated.
点双数相关公式,自生成。

Where:

哪里:

  • x̄1 = mean of the continuous variable for group 1

    x̄1 =组1连续变量的平均值

  • x̄2 = mean of the continuous variable for group 2

    x̄2 =第2组连续变量的平均值

  • p = proportion of class 1 in the dichotomous variable

    p = 1类在二分变量中的比例

  • s_x = Standart deviation of the continuous variable

    s_x =连续变量的标准偏差

To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:

遵循我们的示例,我们将假定下一个值,该值是将考试1变量与学习的小时数进行比较而获得的:

  • x̄ pass = 5.5

    x̄通过 = 5.5

  • x̄ not pass = 3.1

    x̄不及格 = 3.1

  • p = 20/25 = 0.8

    p = 20/25 = 0.8

  • s_x = 2

    s_x = 2

With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.

使用这些值,我们得到的结果为2.4 * 0.4 / 2 = 0.48 ,表明变量之间存在某种关系。

序数变数 (Ordinal Variables)

The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.

序数变量最常用的相关系数是Spearman的秩序系数 ,通常称为Spearman的r

Image for post
Spearman’s r correlation coefficient for ordinal variables, self-generated.
Spearman的r相关系数,用于自变量。

where d_i means the difference between 2 variables for each individual and n the size of the sample.

其中d_i表示每个个体的2个变量与样本大小的n之差。

摘要 (Summary)

In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.

在数据科学中,我们习惯于对二进制,分类或普通变量进行散点图绘制,将它们用作其他图中的色差,但是当我们计算相关性时,由于内置变量,很容易跳过此变量R中的python或Dplyr的熊猫函数不使用它们。

In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.

在这篇文章中,我们展示了如何分析这些变量的分布以及它们与所有其他变量的相关性。

This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

这是我特别#十后100daysofML,我会发布在GitHub上,Twitter和中型企业(这一挑战的进步阿德里亚塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻译自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836

数据eda

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389430.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

jdk重启后步行_向后介绍步行以一种新颖的方式来预测未来

jdk重启后步行“永远不要做出预测,尤其是关于未来的预测。” (KK Steincke) (“Never Make Predictions, Especially About the Future.” (K. K. Steincke)) Does this picture portray a horse or a car? 这张照片描绘的是马还是汽车? How likely is …

mongodb仲裁者_真理的仲裁者

mongodb仲裁者Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Par…

优化 回归_使用回归优化产品价格

优化 回归应用数据科学 (Applied data science) Price and quantity are two fundamental measures that determine the bottom line of every business, and setting the right price is one of the most important decisions a company can make. Under-pricing hurts the co…

大数据数据科学家常用面试题_进行数据科学工作面试

大数据数据科学家常用面试题During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent amo…

scrapy模拟模拟点击_模拟大流行

scrapy模拟模拟点击复杂系统 (Complex Systems) In our daily life, we encounter many complex systems where individuals are interacting with each other such as the stock market or rush hour traffic. Finding appropriate models for these complex systems may give…

vue.js python_使用Python和Vue.js自动化报告过程

vue.js pythonIf your organization does not have a data visualization solution like Tableau or PowerBI nor means to host a server to deploy open source solutions like Dash then you are probably stuck doing reports with Excel or exporting your notebooks.如果…

plsql中导入csvs_在命令行中使用sql分析csvs

plsql中导入csvsIf you are familiar with coding in SQL, there is a strong chance you do it in PgAdmin, MySQL, BigQuery, SQL Server, etc. But there are times you just want to use your SQL skills for quick analysis on a small/medium sized dataset.如果您熟悉SQ…

计算机科学必读书籍_5篇关于数据科学家的产品分类必读文章

计算机科学必读书籍Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.产品分类/产品分类是将产品…

交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统

交替最小二乘矩阵分解pyspark上的动手推荐系统 (Hands-on recommender system on pyspark) Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For ex…

python 网页编程_通过Python编程检索网页

python 网页编程The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext T…

火种 ctf_分析我的火种数据

火种 ctfOriginally published at https://www.linkedin.com on March 27, 2020 (data up to date as of March 20, 2020).最初于 2020年3月27日 在 https://www.linkedin.com 上 发布 (数据截至2020年3月20日)。 Day 3 of social distancing.社会疏离的第三天。 As I sit on…

data studio_面向营销人员的Data Studio —报表指南

data studioIn this guide, we describe both the theoretical and practical sides of reporting with Google Data Studio. You can use this guide as a comprehensive cheat sheet in your everyday marketing.在本指南中,我们描述了使用Google Data Studio进行…

人流量统计系统介绍_统计介绍

人流量统计系统介绍Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the …

乐高ev3 读取外部数据_数据就是新乐高

乐高ev3 读取外部数据When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were…

图像灰度化与二值化

图像灰度化 什么是图像灰度化? 图像灰度化并不是将单纯的图像变成灰色,而是将图片的BGR各通道以某种规律综合起来,使图片显示位灰色。 规律如下: 手动实现灰度化 首先我们采用手动灰度化的方式: 其思想就是&#…

分析citibike数据eda

数据科学 (Data Science) CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — no…

上采样(放大图像)和下采样(缩小图像)(最邻近插值和双线性插值的理解和实现)

上采样和下采样 什么是上采样和下采样? • 缩小图像(或称为下采样(subsampled)或降采样(downsampled))的主要目的有 两个:1、使得图像符合显示区域的大小;2、生成对应图…

r语言绘制雷达图_用r绘制雷达蜘蛛图

r语言绘制雷达图I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂数字_分裂的补充:超越数字,打印物理可视化

java 分裂数字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

结构化数据建模——titanic数据集的模型建立和训练(Pytorch版)

本文参考《20天吃透Pytorch》来实现titanic数据集的模型建立和训练 在书中理论的同时加入自己的理解。 一,准备数据 数据加载 titanic数据集的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。 结构化数据一般会使用Pandas中的DataFrame进行预处理…