eda分析_EDA理论指南

eda分析

Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to choose for a problem.

中号 OST的数据分析问题开始理解数据。 这是最关键和最复杂的步骤。 此步骤还会影响我们在预测建模问题中做出的进一步决策,其中一项是我们要为问题选择的算法。

In this article, we will see a complete tough guide for such a problem.

在本文中,我们将看到有关此问题的完整指南。

Content

内容

  1. Reading Data

    读取数据
  2. Variable Identification

    变量识别
  3. Univariate analysis

    单变量分析
  4. Bivariate analysis

    双变量分析
  5. Missing values- types and analysis

    缺失值-类型和分析
  6. Outlier treatment

    离群值处理
  7. Variable Transformation

    变量变换

读取数据和变量识别 (Reading data and Variable Identification)

Reading the data infers getting the answers to the following questions

读取数据可以得出以下问题的答案

  • What is the shape of my data?

    数据的形状如何?
  • How many features does my data contain?

    我的数据包含多少个功能?
  • What does it look like?

    它是什么样子的?
  • What are the types of variables?

    变量的类型是什么?
Image for post
Guide1: Types of Variables
指南1:变量类型

单变量分析(UA) (Univariate Analysis (UA))

什么是UA? (What is UA?)

When we explore a single variable at a time from a given list of features, its called UA. We summarize the variable and help us better understand the data.

当我们一次从给定的功能列表中探索单个变量时,其称为UA。 我们总结了变量并帮助我们更好地理解了数据。

We see for the following things in UA

我们在UA中看到以下内容

  • Central tendency (mean, median, mode) and dispersion of the variable

    变量的集中趋势(均值,中位数,众数)和离散
  • Distribution of variable- symmetric, right-skewed or left-skewed

    对称分布,右偏或左偏的分布
  • Missing values and outliers

    缺失值和离群值
  • Count and count percent: Observing the frequency of each category in a categorical variable helps us to understand and deal with that variable.

    计算百分比:观察类别变量中每个类别的频率有助于我们理解和处理该变量。

为什么选择UA? (Why UA?)

We explore that variable, checks for anomalies like outliers, and missing values that we will see in the latter part.

我们将探索该变量,检查异常值(如异常值)和缺失值,我们将在后面的部分中看到这些值。

UA方法 (Methods for UA)

For Continuous Variables:

对于连续变量:

  1. Tabular Method: Used to describe central tendencies, dispersion, and missing values.

    表格方法:用于描述中心趋势,离散度和缺失值。
  2. Graphical Method: Used for distribution and checking Outliers. We can use Histograms for understanding distribution and Box Plots for outliers detection.

    图形方法:用于分发和检查离群值。 我们可以使用直方图来了解分布,而可以使用箱形图来检测异常值。

A combination of Histograms and Box plots is called a Violin Plot

直方图和箱形图的组合称为小提琴图

Image for post
Guide2: Methods of Univariate Analysis for continuous variables
指南2:连续变量的单变量分析方法

For Categorical variables:

对于分类变量:

  1. Tabular Method: “.value_counts()” operation in python gives a tabular form of frequencies.

    表格方法:python中的“ .value_counts()”操作提供了表格形式的频率。
  2. Graphical Method: The best graph that is used in the case of a categorical variable is barplot.

    图形方法:对于分类变量,使用的最佳图形是条形图。
Image for post
Guide3: Methods of Univariate Analysis for categorical variables
指南3:分类变量的单变量分析方法

双变量分析(BA) (Bivariate Analysis (BA))

什么是学士学位? (What is BA?)

When we study the empirical relationship of two variables concerning each other, it is called BA.

当我们研究两个变量彼此相关的经验关系时,称为BA。

为什么要学士学位? (Why BA?)

It helps to detect anomalies, understand the dependence of two variables on each other, and the impact of each variable ion the target variable.

它有助于检测异常,了解两个变量之间的依赖性,以及每个变量对目标变量的影响。

BA的方法 (Methods for BA)

  1. For Continuous-Continuous types: There are two methods to study the relationship between two continuous variables i.e. A scatter plot and the correlation analysis.

    对于连续-连续类型 :有两种方法研究两个连续变量之间的关系,即散点图相关性分析

Image for post
Guide4: Bivariate analysis for Continuous-Continuous type variables
指南4:连续-连续类型变量的双变量分析

2. For categorical-continuous types: Under this head, we can use bar plots and T-tests for the analysis purpose.

2. 对于连续类别:在此标题下,我们可以使用条形图T检验进行分析。

The T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two or more groups/categories. Calculating a t-test requires the difference between the mean values and the standard deviation from each category.

T检验是一种推论统计量,用于确定两个或多个组/类别的均值之间是否存在显着差异。 计算t检验需要每个类别的平均值和标准偏差之间的差。

Image for post
Guide5: Bivariate analysis for categorical-Continuous type variables
指南5:分类连续类型变量的双变量分析

3. For Categorical-categorical types: Two-way table and Chi-square test are used to analyze the relationship of two categorical variables.

3. 对于分类类别类型:使用双向表和卡方检验分析两个分类变量之间的关系。

缺失值 (Missing Values)

缺少价值的原因? (Reasons for Missing Values?)

There can be various missing values in data, some of which can be

数据中可能存在各种缺失值,其中一些可能是

  • There may not be may response recorded.

    可能没有记录响应。
  • There can be some error while recording the data

    记录数据时可能会出现一些错误
  • There can be some error while reading the data, etc.

    读取数据时可能会出错,等等。

缺失值的类型? (Types of Missing values?)

  1. Missing Completely at Random (MCAR): These are the missing values that do not have any relation with any other variable or the variable in which they are occurring.

    完全随机缺失(MCAR):这些缺失值与任何其他变量或发生它们的变量没有任何关系。

  2. Missing at random (MAR): The missing values that do not have any relation within the variable they exist but may have an observable trend in other variables. Eg. The income data for people having age greater than 60 years can be missing as people with that age are generally retired.

    随机缺失(MAR):这些缺失值在存在的变量中没有任何关系,但在其他变量中可能有可观察的趋势。 例如 。 年龄超过60岁的人的收入数据可能会丢失,因为该年龄的人通常已经退休。

  3. Missing Not at Random (MNAR): The missing value has a relation in the variable they exist. Eg. House having a price more than Rs. 2 crores can be missing in the database as for that price there cannot be frequent buyers.

    随机缺失(MNAR):缺失值与它们存在的变量有关。 例如 。 价格超过Rs的房子。 数据库中可能缺少2千万,因为该价格不能频繁购买。

缺失值的处理方法 (Methods of dealing Missing Values)

There are two basic methods to deal with missing values

有两种处理缺失值的基本方法

  1. Deletion: We delete all the missing value rows from the dataset before training the model.

    删除:我们在训练模型之前从数据集中删除所有缺失值行。

  2. Imputation: There are various methods by which we can fill the missing values.

    归因:我们可以通过多种方法来填充缺失值。

Image for post
Guide6: Treating Missing values
指南6:处理缺失值

离群值 (Outliers)

离群值的类型及其识别 (Types of Outliers and their identification)

There are two types of outliers:

有两种异常值:

  1. Univariate Outlier: It can be identified using a box plot.

    单变量离群值:可以使用箱形图进行识别。

  2. Bivariate Outliers: It can be identified using a scatter plot between the two variables.

    双变量离群值:可以使用两个变量之间的散点图来识别。

离群值的标准 (Criteria for an outlier)

Criteria for X to be outlier:Q1: median for first 25% observation when sorted in ascending order
Q2: median for last 25% observation when sorted in ascending order
Q3: median of all observationIQR: Inter quartile range = Q3-Q1
if X is outlier then X must satisfy:X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)

异常值的处理 (Treatment of outlier)

  1. We can delete that observation.

    我们可以删除该观察。
  2. We can impute the value of outlier by the methods discussed in ways for imputing missing values.

    我们可以通过以估算缺失值的方式讨论的方法来估算离群值。
  3. We can apply transformations (to be discussed next)

    我们可以应用转换(将在下面讨论)

变量变换 (Variable Transformation)

We all know that normalization increases the accuracy of the model. But what exactly is normalization? It is one of the techniques of variable transformation.

众所周知,归一化可以提高模型的准确性。 但是规范化到底是什么? 它是变量转换的技术之一。

In variable transformation, we replace the variable by one of its functions. for example, replace the variable x by its log value.

在变量转换中,我们用变量的功能之一代替变量。 例如,将变量x替换为其对数值。

We can try to fix the following things that we have obtained as an observation in previous EDA processes:

我们可以尝试修复在以前的EDA过程中观察得到的以下问题:

  1. We can change the scale of the variable (redefining the limits of a variable)

    我们可以更改变量的小数位数(重新定义变量的限制)
  2. Conversion of a non-linear relationship into a linear relationship

    将非线性关系转换为线性关系
  3. It is observed that algorithms better perform on symmetrically distributed variables than skewed so we can convert skewed distribution to symmetric distribution.

    可以看出,算法在对称分布变量上的性能要优于偏态分布,因此我们可以将偏态分布转换为对称分布。

变量转换方法 (Methods of Variable Transformation)

  1. Non-linear transformation: We can replace the variable by its log value, square root, or cube root. These are non-linear transformations, hence help us to deal with all the points stated above.

    非线性转换 :我们可以用变量的对数值,平方根或立方根替换变量。 这些是非线性变换,因此有助于我们处理上述所有问题。

  2. Binning: We can divide the continuous values into various bins hence converting a continuous variable into categorical. This may help us to categorize the outlier into some categories with which our model can deal.

    Binning:我们可以将连续值划分为不同的bin,从而将连续变量转换为分类变量。 这可以帮助我们将异常值分类为模型可以处理的某些类别。

加起来 (Summing up)

This is an extensive guide for Exploratory Data Analysis. This not only includes how to detect anomalies but also how to deal and get rid of them. This is a very naive approach to EDA hence most of the chapters are covered yet.

这是探索性数据分析的详尽指南。 这不仅包括如何检测异常,还包括如何处理和消除异常。 这是一种非常幼稚的EDA方法,因此大多数章节都已介绍。

翻译自: https://towardsdatascience.com/the-eda-theoretical-guide-b7cef7653f0d

eda分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391294.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于ssm框架和freemarker的商品销售系统

项目说明 1、项目文件结构 2、项目主要接口及其实现 &#xff08;1&#xff09;Index&#xff1a; 首页页面&#xff1a;展示商品功能&#xff0c;可登录或查看商品详细信息 &#xff08;2&#xff09;登录&#xff1a;/ApiLogin 3、dao层 数据持久化层&#xff0c;把商品和用户…

简·雅各布斯指数第二部分:测试

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 &#xff0c;我带您完成了根据简雅各布斯(Jacobs Jacobs)所确定…

Docker 入门(3)Docke的安装和基本配置

1. Docker Linux下的安装 1.1 Docker Engine 的版本 社区版 ( CE, Community Edition ) 社区版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基础功能&#xff0c;主要针对开发者和小型团队进行开发和试验企业版 ( EE, Enterprise Edition ) 企业版 ( Docker Engi…

python:单元测试框架pytest的一个简单例子

之前一般做自动化测试用的是unitest框架&#xff0c;发现pytest同样不错&#xff0c;写一个例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序&#xff0c;或者停止运行的VM。当使用docker run命令基于镜像启动容器时&#xff0c;容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文&#xff1a;https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown&#xff0c;还有setupClass和teardownClass&#xff08;需要配合classmethod装饰器一起使用&#xff09;&#xff0c;接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成&#xff0c;分享了iHealth从最初的服务器端直接部署&#xff0c;到现在实现全自动CI/CD的实践经验。作者简介郭拓&#xff0c;北京爱和健康科技有限公司&#xff08;iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root&#xff0c;返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1&#xff1a; 输入&#xff1a;root [10,5,15,3,7,null,18], low 7, high 15 输出&#xff1a;32 示例 2&#xff1a; 输入&#xff1a;root [10,5,15,3,7,13,18,1,nul…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…

lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言,但是我们仍然可以从中学到什么呢?...

lisp语言是最好的语言This article is in response to Emmet Boudreau’s article ‘Should We be Using Lisp for Data-Science’.本文是对 Emmet Boudreau的文章“我们应该将Lisp用于数据科学”的 回应 。 Below, unless otherwise stated, lisp refers to Common Lisp; in …

static、volatile、synchronize

原子性&#xff08;排他性&#xff09;&#xff1a;不论是多核还是单核&#xff0c;具有原子性的量&#xff0c;同一时刻只能有一个线程来对它进行操作&#xff01;可见性&#xff1a;多个线程对同一份数据操作&#xff0c;thread1改变了某个变量的值&#xff0c;要保证thread2…

1.10-linux三剑客之sed命令详解及用法

内容:1.sed命令介绍2.语法格式,常用功能查询 增加 替换 批量修改文件名第1章 sed是什么字符流编辑器 Stream Editor第2章 sed功能与版本处理出文本文件,日志,配置文件等增加,删除,修改,查询sed --versionsed -i 修改文件内容第3章 语法格式3.1 语法格式sed [选项] [sed指令…

python pca主成分_超越“经典” PCA:功能主成分分析(FPCA)应用于使用Python的时间序列...

python pca主成分FPCA is traditionally implemented with R but the “FDASRSF” package from J. Derek Tucker will achieve similar (and even greater) results in Python.FPCA传统上是使用R实现的&#xff0c;但是J. Derek Tucker的“ FDASRSF ”软件包将在Python中获得相…

初探Golang(2)-常量和命名规范

1 命名规范 1.1 Go是一门区分大小写的语言。 命名规则涉及变量、常量、全局函数、结构、接口、方法等的命名。 Go语言从语法层面进行了以下限定&#xff1a;任何需要对外暴露的名字必须以大写字母开头&#xff0c;不需要对外暴露的则应该以小写字母开头。 当命名&#xff08…

大数据平台构建_如何像产品一样构建数据平台

大数据平台构建重点 (Top highlight)Over the past few years, many companies have embraced data platforms as an effective way to aggregate, handle, and utilize data at scale. Despite the data platform’s rising popularity, however, little literature exists on…

初探Golang(3)-数据类型

Go语言拥有两大数据类型&#xff0c;基本数据类型和复合数据类型。 1. 数值类型 ##有符号整数 int8&#xff08;-128 -> 127&#xff09; int16&#xff08;-32768 -> 32767&#xff09; int32&#xff08;-2,147,483,648 -> 2,147,483,647&#xff09; int64&#x…