eda可视化_5用于探索性数据分析(EDA)的高级可视化

eda可视化

Early morning, a lady comes to meet Sherlock Holmes and Watson. Even before the lady opens her mouth and starts telling the reason for her visit, Sherlock can tell a lot about a person by his sheer power of observation and deduction. Similarly, we can deduce a lot about the data and relationship among the features before complex modelling and feeding the data to algorithms.

清晨,一位女士来见福尔摩斯和沃森。 甚至在这位女士张开嘴并开始说出拜访原因之前,Sherlock都能凭借其观察力和推论的绝对能力来讲述一个人的事。 同样,在进行复杂建模并将数据输入算法之前,我们可以推断出很多数据以及要素之间的关系。

Objective

目的

In this article, I will discuss five advanced data visualisation options to perform an advanced EDA and become Sherlock Holmes of data science. The goal is to deduce most on the relationship among different data points with minimal coding and quickest built-in options available.

在本文中,我将讨论五个高级数据可视化选项,以执行高级EDA并成为数据科学的Sherlock Holmes。 目的是通过最少的编码和最快的内置选项来推断不同数据点之间的关系。

Step 1: We will be using the seaborn package inbuilt datasets and advanced option to illustrate the advanced data visualisation.

步骤1:我们将使用seaborn软件包内置的数据集和高级选项来说明高级数据可视化。

import seaborn as sns
import matplotlib.pyplot as plt

Step 2:Seaborn package comes with a few in-built datasets to quickly prototype a visualisation and evaluate its suitability for EDA with own data points. In the article, we will be using the seaborn “penguins” dataset. From the online seaborn repository, the dataset is loaded with load_dataset method. We can get the list of all the inbuilt Seaborn datasets with get_dataset_names() names method.

第2步: Seaborn软件包附带了一些内置数据集,可快速创建可视化原型并使用自己的数据点评估其对EDA的适用性。 在本文中,我们将使用原始的“企鹅”数据集。 从在线Seaborn存储库中,使用load_dataset方法加载数据集。 我们可以使用get_dataset_names()names方法获取所有内置Seaborn数据集的列表。

In the below code, in FacetGrid method, dataset name i.e. “bird”, feature by which we want to organise the visualisation of the data i.e. “island”, and feature by which we want to group i.e. hue as “specifies” is mentioned as the parameter. Further, in the “map” method X and Y-axis of the scatter plot i.e. “flipper_length_mm”, and “body_mass_g” mentioned in the example below.

在下面的代码中,在FacetGrid方法中,数据集名称即“ bird”,我们要通过其组织数据可视化的功能(即“ island”)和我们要对其进行分组(例如“ specify”的色相)的功能如下:参数。 此外,在“映射”方法中,在以下示例中提到的散点图的X轴和Y轴,即“ flipper_length_mm”和“ body_mass_g”。

bird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()

Data set visualisation based on the above code plots the data points organised by “island” and colour-coded by “species”.

基于以上代码的数据集可视化将按“岛屿”组织并按“物种”按颜色编码的数据点绘制成图。

With one glance we can infer that “Gentoo” species are only present on the island “Biscoe”. Gentoo species is heavier and has a longer flipper length that other species. “Adelie” species is available on all three islands and “Chinstrap” is only available on the island “Dream”. You can see that with only 5 lines of code we can get so much information without any modelling.

乍一看,我们可以推断出“ Gentoo”物种仅存在于“ Biscoe”岛上。 Gentoo物种较重,并且鳍状肢的长度比其他物种更长。 在所有三个岛上都可以使用“阿德利”物种,而在“梦”岛上则可以使用“ Chinstrap”物种。 您会看到,仅用5行代码,我们无需任何建模就可以获得大量信息。

I will encourage you to post a comment on other information we can deduce from the below visualisation.

我鼓励您对我们可以从下面的图表中得出的其他信息发表评论。

Image for post

Step 3: We want to quickly get an idea on the range of weights of the penguins by species and islands. Also, identify the concentration of the weight range.

步骤3:我们想快速了解按物种和岛屿划分的企鹅体重范围。 另外,确定体重范围的浓度。

With a strip plot, we can plot the weight of the penguins organised by species for each the islands.

通过条形图,我们可以绘制每个岛屿按物种组织的企鹅的体重。

sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()
Image for post

Strip plot is helpful to get an insight as long as plots are not overcrowded with densely populated data points. In the island “Dream” points are densely populated in the plot, and it is a bit difficult to get meaningful information from it.

只要图块不被人口稠密的数据点过度拥挤,带状图就有助于获得洞察力。 在岛上,“梦”点在图中密密麻麻地填充着,很难从中获取有意义的信息。

Swarmplot can help to visualise the range of weight of the penguins by species in each of the islands non-overlapping points.

Swarmplot可以帮助按岛上每个非重叠点的物种可视化企鹅的体重范围。

In the below code, we mention similar information like above in strip plot.

在下面的代码中,我们在带状图中提到了类似的信息。


sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()

This improves the comprehension of the data points immensely in case of densely populated data. We can infer with a glance that “Adelie” weight ranges from approx 2500 to 4800 grams and a typical Gentoo species is heavier than Chinstrap species. I will leave you to perform other exploratory data analysis based on the below swarm plot.

在数据密集的情况下,这极大地改善了数据点的理解。 我们可以一目了然地推断出“阿德利”的重量范围约为2500至4800克,典型的Gentoo物种比Chinstrap物种重。 我将让您根据以下群图执行其他探索性数据分析。

Image for post

Step 4: Next, we would like to understand the relationship between body mass and culmen length of penguins in each of the island based on their sex.

步骤4:接下来,我们想了解根据岛屿的性别,企鹅的体重与高短长度之间的关系。

In the below code, in x and y parameter features between which we are interested in identifying, the relationship is mentioned. Hue is mentioned as “sex” as we want to learn the relation for male and female penguins separately.

在下面的代码中,在我们希望识别的x和y参数特征中,提到了这种关系。 顺化被称为“性别”,因为我们想分别学习男女企鹅的关系。

sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()

We can gather that body mass and culmen length relationship in Biscoe island penguins is similar for both male and female sex. On the contrary, in the island Dream, the relationship trend is quite the opposite for male and female penguins. Are the body mass and culmen length relationship of the penguins in the island Dream is linear or not linear?

我们可以发现,在Biscoe岛上,企鹅的体重和长度与男性和女性的相似。 相反,在“梦岛”中,男女企鹅的关系趋势相反。 梦岛中企鹅的体重和宫长长度关系是线性的还是线性的?

We can visualise the polynomial relationship by specifying the parameter order in the lmplot method.

我们可以通过在lmplot方法中指定参数顺序来可视化多项式关系。

Image for post

In the case of heavily densely populated data points, we can further extend our exploratory data analysis by visualizing the body mass culmen relationship by male and female sex separately for each island.

在人口稠密的数据点的情况下,我们可以通过可视化每个岛屿分别以男性和女性性别划分的身体标本关系来进一步扩展探索性数据分析。

Visualization is organised by the col and row parameter mentioned in the code.

可视化由代码中提到的col和row参数组织。

sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()
Image for post

Step 5: It helps to visualise a scatter plot and histogram side by side to get a holistic view of the spread of data points and also the frequency of observation at the same time. Joint plots are an efficient way to picture it. In the code below, x & y parameter is the feature between which we are trying to identify the relationship. As the data points are densely populated, hence we will plot a hexbin instead of a scatter plot. “Kind” parameter indicates the type of plot. If you want to know more about hexbin plot, then please refer my article 5 Powerful Visualisation with Pandas for Data Preprocessing.

第5步:这有助于并排可视化散点图和直方图,从而获得数据点分布的整体视图以及同时观察的频率。 联合图是描绘它的有效方法。 在下面的代码中,x&y参数是我们试图识别关系的功能。 由于数据点密集,因此我们将绘制六边形而不是散点图。 “种类”参数指示图的类型。 如果您想了解更多关于hexbin图的信息,请参阅我的文章5:使用Pandas进行功能强大的可视化以进行数据预处理 。

sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
Image for post

Based on the histogram we can infer that most numbers of penguins are between 3500 and 4500 grams. Can you deduce the most frequent culmen length range of the penguins?

根据直方图,我们可以推断出大多数企鹅数量在3500至4500克之间。 您能推断出企鹅最常见的茎长度范围吗?

We can also plot the individual data points (as shown in the right side plot) inside the hexbin plot with the code shown below.

我们还可以使用下面所示的代码在hexbin图内绘制各个数据点(如右侧图所示)。

g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()

Step 6: At last, we would like to get an overview of the spread and relationship among different features by island. Paitplots are very handy to visualise the scatter plot among different feature. The feature “island” is mentioned as the hue as we want to colour code the plot based on it.

第6步:最后,我们希望按岛屿概述不同要素之间的传播和关系。 Paitplots非常便于查看不同特征之间的散点图。 由于我们要根据其对地块进行颜色编码,因此将特征“岛”称为色相。

sns.pairplot(bird, hue="island") 
plt.show()

We can see from the visualisation that in the island Biscoe most of the penguins have shallow culmen depth but are heavier than penguins in other islands. Similarly, we can conclude that flipper length is longer for most penguins in Biscoe island but have shallow culmen depth.

从可视化中我们可以看到,在比斯科岛中,大多数企鹅的阴茎深度较浅,但比其他岛屿的企鹅重。 同样,我们可以得出结论,在比斯科岛上的大多数企鹅,鳍状肢的长度更长,但洞室深度却较浅。

Image for post

I hope you will use these advanced visualisations for the exploratory data analysis and get a sense of data points relationship before embarking any complex modelling exercise. I would love to hear your favourite visualization plots for EDA and also a list of conclusions we can draw from the examples illustrated in this article.

我希望您可以在进行任何复杂的建模练习之前,将这些高级可视化用于探索性数据分析,并了解数据点之间的关系。 我很想听听您最喜欢的EDA可视化图,以及我们可以从本文所示示例中得出的结论列表。

In case, you would like to learn data visualisation using pandas then please read by trending article on 5 Powerful Visualisation with Pandas for Data Preprocessing.

如果您想使用熊猫学习数据可视化,那么请阅读趋势文章5关于数据预处理的熊猫的强大可视化 。

If you are interested in learning different Scikit-Learn scalers then please do read my article Feature Scaling — Effect Of Different Scikit-Learn Scalers: Deep Dive

如果您有兴趣学习其他Scikit-Learn洁牙机,请阅读我的文章Feature Scaling —不同Scikit-Learn洁牙机的效果:深入研究

"""Full Code"""import seaborn as sns
import matplotlib.pyplot as plt
bird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()
sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()
sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()
sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()
sns.pairplot(bird, hue="island")
plt.show()

翻译自: https://towardsdatascience.com/5-advanced-visualisation-for-exploratory-data-analysis-eda-c8eafeb0b8cb

eda可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391546.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Hyperledger Fabric 1.0 从零开始(十二)——fabric-sdk-java应用

Hyperledger Fabric 1.0 从零开始(十)——智能合约(参阅:Hyperledger Fabric Chaincode for Operators——实操智能合约) Hyperledger Fabric 1.0 从零开始(十一)——CouchDB(参阅&a…

css跑道_如何不超出跑道:计划种子的简单方法

css跑道There’s lots of startup advice floating around. I’m going to give you a very practical one that’s often missed — how to plan your early growth. The seed round is usually devoted to finding your product-market fit, meaning you start with no or li…

熊猫数据集_为数据科学拆箱熊猫

熊猫数据集If you are already familiar with NumPy, Pandas is just a package build on top of it. Pandas provide more flexibility than NumPy to work with data. While in NumPy we can only store values of single data type(dtype) Pandas has the flexibility to st…

JAVA基础——时间Date类型转换

在java中有六大时间类,分别是: 1、java.util包下的Date类, 2、java.sql包下的Date类, 3、java.text包下的DateFormat类,(抽象类) 4、java.text包下的SimpleDateFormat类, 5、java.ut…

LeetCode第五天

leetcode 第五天 2018年1月6日 22.(566) Reshape the Matrix JAVA class Solution {public int[][] matrixReshape(int[][] nums, int r, int c) {int[][] newNums new int[r][c];int size nums.length*nums[0].length;if(r*c ! size)return nums;for(int i0;i<size;i){ne…

matplotlib可视化_使用Matplotlib改善可视化设计的5个魔术技巧

matplotlib可视化It is impossible to know everything, no matter how much our experience has increased over the years, there are many things that remain hidden from us. This is normal, and maybe an exciting motivation to search and learn more. And I am sure …

robot:循环遍历数据库查询结果是否满足要求

使用list类型变量{}接收查询结果&#xff0c;再for循环遍历每行数据&#xff0c;取出需要比较的数值 转载于:https://www.cnblogs.com/gcgc/p/11424114.html

rm命令

命令 ‘rm’ &#xff08;remove&#xff09;&#xff1a;删除一个目录中的一个或多个文件或目录&#xff0c;也可以将某个目录及其下属的所有文件及其子目录均删除掉 语法&#xff1a;rm&#xff08;选项&#xff09;&#xff08;参数&#xff09; 默认会提示‘是否’删除&am…

感知器 机器学习_机器学习感知器实现

感知器 机器学习In this post, we are going to have a look at a program written in Python3 using numpy. We will discuss the basics of what a perceptron is, what is the delta rule and how to use it to converge the learning of the perceptron.在本文中&#xff0…

Python之集合、解析式,生成器,函数

一 集合 1 集合定义&#xff1a; 1 如果花括号为空&#xff0c;则是字典类型2 定义一个空集合&#xff0c;使用set 加小括号使用B方式定义集合时&#xff0c;集合内部的数必须是可迭代对象&#xff0c;数值类型的不可以 其中的值必须是可迭代对象&#xff0c;其中的元素必须是可…

python:如何传递一个列表参数

转载于:https://www.cnblogs.com/gcgc/p/11426356.html

curl的安装与简单使用

2019独角兽企业重金招聘Python工程师标准>>> windows 篇&#xff1a; 安装篇&#xff1a; 我的电脑版本是windows7,64位&#xff0c;对应的curl下载地址如下&#xff1a; https://curl.haxx.se/download.html 直接找到下面的这个版本&#xff1a; curl-7.57.0.tar.g…

gcc 编译过程

gcc 编译过程从 hello.c 到 hello(或 a.out)文件&#xff0c; 必须历经 hello.i、 hello.s、 hello.o&#xff0c;最后才得到 hello(或a.out)文件&#xff0c;分别对应着预处理、编译、汇编和链接 4 个步骤&#xff0c;整个过程如图 10.5 所示。 这 4 步大致的工作内容如下&am…

虎牙直播电影一天收入_电影收入

虎牙直播电影一天收入“美国电影协会(MPAA)的首席执行官J. Valenti提到&#xff1a;“没有人能告诉您电影在市场上的表现。 直到电影在黑暗的剧院里放映并且银幕和观众之间都散发出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned th…

Python操作Mysql实例代码教程在线版(查询手册)_python

实例1、取得MYSQL的版本在windows环境下安装mysql模块用于python开发MySQL-python Windows下EXE安装文件下载 复制代码 代码如下:# -*- coding: UTF-8 -*- #安装MYSQL DB for pythonimport MySQLdb as mdb con None try: #连接mysql的方法&#xff1a;connect(ip,user,pass…

批判性思维_为什么批判性思维技能对数据科学家至关重要

批判性思维As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.正如亚历山大波普(Alexander Pope)所说…

Manjaro 17 搭建 redis 4.0.1 集群服务

安装Redis在Linux环境中 这里我们用的是manjaro一个小众一些的发行版 我选用的是manjaro 17 KDE 如果你已经安装好了manjaro 那么你需要准备一个redis.tar.gz包 这里我选用的是截至目前最新的redis 4.0.1版本 我们可以在官网进行下载 https://redis.io/download选择Stable &…

快速排序简便记_建立和测试股票交易策略的快速简便方法

快速排序简便记Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without se…

robot:List变量的使用注意点

创建list类型变量&#xff0c;两种方式&#xff0c;建议使用Create List关键字 使用该列表变量时需要变为${}方式&#xff0c;切记切记&#xff01; 转载于:https://www.cnblogs.com/gcgc/p/11429482.html

python基础教程(十一)

迭代器 本节进行迭代器的讨论。只讨论一个特殊方法---- __iter__ &#xff0c;这个方法是迭代器规则的基础。 迭代器规则 迭代的意思是重复做一些事很多次---就像在循环中做的那样。__iter__ 方法返回一个迭代器&#xff0c;所谓迭代器就是具有next方法的对象&#xff0c;在调…