数据探索性分析_探索性数据分析

数据探索性分析

When we hear about Data science or Analytics , the first thing that comes to our mind is Modelling , Tuning etc. . But one of the most important and primary steps before all of these is Exploratory Data Analysis or EDA.

当我们听到有关数据科学或分析的知识时,想到的第一件事就是建模,调整等。 但是,在所有这些步骤中最重要和最主要的步骤之一是探索性数据分析或EDA。

Image for post
Exploratory data analysis (Machine learning process steps)
探索性数据分析(机器学习过程步骤)

为什么选择EDA (Why EDA)

In Data Science one of the Major problem Data Scientists/Analysts are facing today is the Data Quality . Since we rely on multiple sources for data , data quality is often compromised.The quality of Data determines the quality of models which we are building on it .As the adage goes,Garbage in , garbage out . The above statement holds very true in the case of Data science.

在数据科学领域,数据科学家/分析师当今面临的主要问题之一是数据质量。 由于我们依赖于多个数据源,因此数据质量常常受到损害。数据的质量决定了我们在其上构建的模型的质量。 上面的陈述在数据科学领域非常正确。

We cannot build Empire State Building or Burj Khalifa on a shaky foundation !

我们不能在摇摇欲坠的基础上建造帝国大厦或哈利法塔!

And that explains why 60–80% of time of Data Scientists are being spent on Data gathering and Data preparation.

这就解释了为什么将60-80%的数据科学家的时间都花在数据收集和数据准备上。

When we are working with Data , EDA or Exploratory Data Analysis is the most important step .It is very important to gather as much information and insights from data as we could before processing it . This could be done by EDA. EDA Also help us to analyse the underlying trends and patterns in data and also help us to formulate our problem statement in a better way .

当我们处理数据时,EDA或探索性数据分析是最重要的步骤。在处理数据之前,从数据中收集尽可能多的信息和见解非常重要。 这可以由EDA完成。 EDA还可以帮助我们分析数据的潜在趋势和模式,还可以帮助我们更好地制定问题陈述。

Well begun is half done”

好的开始已经完成了一半”

Exploratory Data Analysis helps to understand the data better and also it helps to understand what Data speaks.This could be done both by visual analysis as well as with few other analysis.Also EDA helps to distinguish between what to be pursued further and what is not worth following up.

探索性数据分析有助于更好地理解数据,也有助于理解数据的含义,这既可以通过可视化分析也可以通过很少的其他分析来完成,此外EDA有助于区分需要进一步追求的目标和不追求的目标值得跟进。

Exploratory Data Analysis

探索性数据分析

Let’s explore steps of Exploratory data analysis using Bank loan Data set

让我们探索使用银行贷款数据集进行探索性数据分析的步骤

Import the Libraries:

导入库:

To perform initial analysis , we would need libraries like Numpy, Pandas,Seaborn and Matplotlib. Numpy is an array processing package.Its a library for numerical computations .Pandas is used for data manipulation and analysis. Matplotlib and Seaborn are statistical libraries used for data visualization

为了进行初步分析,我们需要Numpy,Pandas,Seaborn和Matplotlib之类的库。 Numpy是一个数组处理程序包,它是一个用于数值计算的库.Pandas用于数据处理和分析。 Matplotlib和Seaborn是用于数据可视化的统计库

Image for post

Import Dataset:

导入数据集:

Data is stored in csv file format, hence we are importing it using pd.read_csv

数据以csv文件格式存储,因此我们使用pd.read_csv导入数据

Image for post

Imported data from the file is stored in bankloan_df dataframe

从文件导入的数据存储在bankloan_df数据框中

Information of data set:

数据集信息:

.info() will display information about the data frames

.info()将显示有关数据帧的信息

Image for post

It shows the column names,number of rows and columns, data types etc.It gives an idea about what type of data it is .It is very important to understand whether a column represents categorical or numerical variable , if categorical we should understand whether its ordinal or nominal .We need to treat each of these data types differently which I will explain in another post.You can use .astype to change the datatype of a column

它显示列名,行数和列数,数据类型等。它给出有关数据类型的信息。了解列是表示类别变量还是数值变量非常重要,如果是类别变量,则应了解其类型顺序或标称。我们需要对每种数据类型进行不同的处理,这将在另一篇文章中进行解释。您可以使用.astype更改列的数据类型

Image for post

If need to know only the number of rows and columns .shape can be used

如果只需要知道行数和列数,可以使用.shape

Image for post

To see the data type , bankloan_df.dtypes can be used

要查看数据类型,可以使用bankloan_df.dtypes

To check the null values bankloan_df.isnull().sum() can be used

要检查空值,可以使用bankloan_df.isnull()。sum()

Image for post

Descriptive Analysis :

描述性分析:

.describe() is used for descriptive analysis , it provides details like count, mean, standard deviation, Inter Quartile Range etc.This analysis helps to understand the skewness of data.

.describe()用于描述性分析,它提供了诸如计数,均值,标准差,四分位数间距等详细信息。此分析有助于理解数据的偏度。

Image for post

In the case of categorical variables,to check the representation of different groups , we use groupby. This is used to analyze whether any group is over represented than other . If such under representation is there for target variable, we need to treat it with certain techniques like SMOTE.

对于分类变量,为了检查不同组的表示形式,我们使用groupby。 这用于分析是否有任何一个组比另一个组高。 如果目标变量存在这种表示不足的情况,则需要使用某些技术(例如SMOTE)对其进行处理。

Image for post

Graphical analysis:

图形分析:

Graphs are very important tool to understand the data distribution .We use different graphs for analyzing data. We use it for Univariate, Bi Variate and Multi Variate Analysis. Seaborn is a very good library to explore different graphs. I will explain few very common graphs in the analysis here and will write a post in detail about graphs later.

图是了解数据分布的非常重要的工具。我们使用不同的图来分析数据。 我们将其用于单变量,双变量和多变量分析。 Seaborn是一个很好的图书馆,可以探索不同的图形。 在这里的分析中,我将解释一些非常常见的图形,稍后将详细撰写有关图形的文章。

Uni variate Analysis — Analysis where we consider only one variable. Few uni variate graphs are Count Plot, Box Plot etc.

单变量分析-仅考虑一个变量的分析。 很少有单变量图是计数图,箱形图等。

Countplot:-Countplot shows the counts of observations in each category using bars

Countplot:-Countplot使用条形图显示每个类别中的观察计数

Image for post
Image for post

Boxplot:-A box plot (or box-and-whisker plot) shows the distribution of quantitative data.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

Boxplot:-箱形图(或箱须图)显示定量数据的分布。框显示数据集的四分位数,而晶须延伸以显示其余分布,确定点除外使用四分位间距范围的函数的“异常值”。

Image for post

To identify outliers also we use boxplots

为了识别异常值,我们还使用箱线图

Image for post

Bi Variate Analysis is where relationship between two variables are plotted in the graph and in Multi variate Analysis , relationship between different variables represented using graphs.

双变量分析是在图中绘制两个变量之间的关系的地方,而在多变量分析中,则是使用图表表示的不同变量之间的关系的地方。

Pairplot is a Bi Variate graph which is used to analyse the relationship between different variables in a dataset. This is very important step for Model building.

Pairplot是Bi Variate图,用于分析数据集中不同变量之间的关系。 这对于模型构建非常重要。

Image for post

Correlation

相关性

Correlation is another important step of EDA. While building a model, its important to understand whether any correlation exists between the independent variables and also with independent variable and dependent variable. This also helps in feature selection/elimination.

关联是EDA的另一个重要步骤。 在构建模型时,重要的是要了解自变量之间以及自变量和因变量之间是否存在任何关联。 这也有助于特征选择/消除。

Values closer to +1 and -1 are considered as maximum correlated variables.The values in diagonal is the correlation of variable with itself and it will always be +1.

接近+1和-1的值被视为最大相关变量。对角线的值是变量与其自身的相关性,它将始终为+1。

Image for post

Correlation graphs can be designed using the below code snippet

可以使用以下代码片段设计相关图

Image for post
Image for post

These are initial few steps of Exploratory data analysis. Based on the findings of each step ,one can take appropriate action to improve data quality ,analyse the trend or to treat missing variables/Outliers or anomaly appropriately.

这些是探索性数据分析的最初几个步骤。 根据每个步骤的发现,可以采取适当的措施来改善数据质量,分析趋势或适当地处理缺失的变量/异常值或异常。

“Information is the oil of the 21st century, and analytics is the combustion engine.” — Peter Sondergaard,Gartner Research

“信息是21世纪的石油,分析是内燃机。” -Peter Sondergaard,Gartner研究

翻译自: https://medium.com/@viveksmenon/exploratory-data-analysis-d464f3adb777

数据探索性分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389704.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

5930. 两栋颜色不同且距离最远的房子

5930. 两栋颜色不同且距离最远的房子 街上有 n 栋房子整齐地排成一列,每栋房子都粉刷上了漂亮的颜色。给你一个下标从 0 开始且长度为 n 的整数数组 colors ,其中 colors[i] 表示第 i 栋房子的颜色。 返回 两栋 颜色 不同 房子之间的 最大 距离。 第 …

stata中心化处理_带有stata第2部分自定义配色方案的covid 19可视化

stata中心化处理This guide will cover an important, yet, under-explored part of Stata: the use of custom color schemes. In summary, we will learn how to go from this graph:本指南将涵盖Stata的一个重要但尚未充分研究的部分:自定义配色方案的使用。 总而…

Anaconda配置和使用

为什么80%的码农都做不了架构师?>>> 原来一直使用原生python和pip的方式,换了新电脑,准备折腾下Anaconda。 安装过程就不说了,全程可视化安装,很简单。 安装后用“管理员权限”打开“Anaconda Prompt”命令…

python 插补数据_python 2020中缺少数据插补技术的快速指南

python 插补数据Most machine learning algorithms expect complete and clean noise-free datasets, unfortunately, real-world datasets are messy and have multiples missing cells, in such cases handling missing data becomes quite complex.大多数机器学习算法期望完…

NIO 学习笔记

0. 介绍 参考 关于Java IO与NIO知识都在这里 ,在其基础上进行修改与补充。 1. NIO介绍 1.1 NIO 是什么 Java NIO 是 java 1.4, 之后新出的一套IO接口. NIO中的N可以理解为Non-blocking,不单纯是New。 1.2 NIO的特性/NIO与IO区别 IO是面向流的&#x…

[原创]java获取word里面的文本

需求场景 开发的web办公系统如果需要处理大量的Word文档(比如有成千上万个文档),用户一定提出查找包含某些关键字的文档的需求,这就要求能够读取 word 中的文字内容,而忽略其中的文字样式、表格、图片等信息。 方案分析…

ab 模拟_Ab测试第二部分的直观模拟

ab 模拟In this post, I would like to invite you to continue our intuitive exploration of A/B testing, as seen in the previous post:在本文中,我想邀请您继续我们对A / B测试的直观探索,如前一篇文章所示: Resuming what we saw, we…

1886. 判断矩阵经轮转后是否一致

1886. 判断矩阵经轮转后是否一致 给你两个大小为 n x n 的二进制矩阵 mat 和 target 。现 以 90 度顺时针轮转 矩阵 mat 中的元素 若干次 ,如果能够使 mat 与 target 一致,返回 true ;否则,返回 false 。 示例 1: 输…

samba登陆密码不正确

win7访问Linux Samba的共享目录提示“登录失败:用户名或密码错误”解决方法 解决办法:修改本地安全策略 通过Samba服务可以实现UNIX/Linux主机与Windows主机之间的资源互访,由于实验需要,轻车熟路的在linux下配置了samba服务&…

各类软件马斯洛需求层次分析_需求的分析层次

各类软件马斯洛需求层次分析When I joined Square, I was embedded on a product that had been in-market for a year but didn’t have dedicated analytics support.当我加入Square时,我被嵌入了已经上市一年但没有专门的分析支持的产品。 As you might expect,…

MySQL的变量分类总结

在MySQL中,my.cnf是参数文件(Option Files),类似于ORACLE数据库中的spfile、pfile参数文件,照理说,参数文件my.cnf中的都是系统参数(这种称呼比较符合思维习惯),但是官方…

亚洲国家互联网渗透率_发展中亚洲国家如何回应covid 19

亚洲国家互联网渗透率The COVID-19 pandemic has severely hit various economies across the world, with global impact estimated between USD 6.1 trillion and USD 9.1 trillion, equivalent to a loss of 7.1% to 10.5% of global gross domestic product (GDP).[1] More…

snake4444勒索病毒成功处理教程方法工具达康解密金蝶/用友数据库sql后缀snake4444...

*snake4444勒索病毒成功处理教程方法 案例:笔者负责一个政务系统的第三方公司的运维,上班后发现服务器的所有文件都打不开了,而且每个文件后面都有一个snake4444的后缀,通过网络我了解到这是一种勒索病毒。因为各个文件不能正常打…

有史以来最漂亮的游戏机

The recent reveal of the PlayStation 5’s design has divided the gaming world. There are those who appreciate its bold, daring industrial design and those who would have preferred something a little less outlandish; perhaps a little more traditional.吨 他最…

墨刀原型制作 位置选择_原型制作不再是可选的

墨刀原型制作 位置选择The ‘role’ of a designer has been a topic of discussion several many years now. In the past decade, the role of a Designer got split into several different roles like — Graphic Designer, User Experience Designer, Interaction Designe…

eclipse maven 构建简单springmvc项目

环境&#xff1a;eclipse Version: Oxygen.3a Release (4.7.3a) 创建maven Project项目&#xff0c;目录结构 修改工程的相关编译属性 修改pop.xml&#xff0c;引入springmvc相关包 <project xmlns"http://maven.apache.org/POM/4.0.0"xmlns:xsi"http://www.…

使用协同过滤推荐电影

ALSO, ARE RECOMMENDER SYSTEMS INFLUENCING OUR TASTE??此外&#xff0c;推荐系统是否影响我们的口味&#xff1f; An excerpt on creating a movie recommender system similar to the OTT platforms.有关创建类似于OTT平台的电影推荐系统的摘录。 INTRODUCTION介绍 For…

数据暑假实习面试_面试数据科学实习如何准备

数据暑假实习面试Unfortunately, on this occasion, your application was not successful, and we have appointed an applicant who…不幸的是&#xff0c;这一次&#xff0c;您的申请没有成功&#xff0c;我们已经任命了一位符合以下条件的申请人&#xff1a; Sounds famili…

谷歌 colab_如何在Google Colab上使用熊猫分析

谷歌 colabRecently, pandas have come up with an amazing open-source library called pandas-profiling. Generally, EDA starts by df.describe(), df.info() and etc which to be done separately. Pandas_profiling extends the general data frame report using a singl…

Java之生成Pdf并对Pdf内容操作

虽说网上有很多可以在线导出Pdf或者word或者转成png等格式的工具&#xff0c;但是我觉得还是得了解知道是怎么实现的。一来&#xff0c;在线免费转换工具&#xff0c;是有容量限制的&#xff0c;达到一定的容量时&#xff0c;是不能成功导出的;二来&#xff0c;业务需求&#x…