eda可视化
Early morning, a lady comes to meet Sherlock Holmes and Watson. Even before the lady opens her mouth and starts telling the reason for her visit, Sherlock can tell a lot about a person by his sheer power of observation and deduction. Similarly, we can deduce a lot about the data and relationship among the features before complex modelling and feeding the data to algorithms.
清晨,一位女士来见福尔摩斯和沃森。 甚至在这位女士张开嘴并开始说出拜访原因之前,Sherlock都能凭借其观察力和推论的绝对能力来讲述一个人的事。 同样,在进行复杂建模并将数据输入算法之前,我们可以推断出很多数据以及要素之间的关系。
Objective
目的
In this article, I will discuss five advanced data visualisation options to perform an advanced EDA and become Sherlock Holmes of data science. The goal is to deduce most on the relationship among different data points with minimal coding and quickest built-in options available.
在本文中,我将讨论五个高级数据可视化选项,以执行高级EDA并成为数据科学的Sherlock Holmes。 目的是通过最少的编码和最快的内置选项来推断不同数据点之间的关系。
Step 1: We will be using the seaborn package inbuilt datasets and advanced option to illustrate the advanced data visualisation.
步骤1:我们将使用seaborn软件包内置的数据集和高级选项来说明高级数据可视化。
import seaborn as sns
import matplotlib.pyplot as plt
Step 2:Seaborn package comes with a few in-built datasets to quickly prototype a visualisation and evaluate its suitability for EDA with own data points. In the article, we will be using the seaborn “penguins” dataset. From the online seaborn repository, the dataset is loaded with load_dataset method. We can get the list of all the inbuilt Seaborn datasets with get_dataset_names() names method.
第2步: Seaborn软件包附带了一些内置数据集,可快速创建可视化原型并使用自己的数据点评估其对EDA的适用性。 在本文中,我们将使用原始的“企鹅”数据集。 从在线Seaborn存储库中,使用load_dataset方法加载数据集。 我们可以使用get_dataset_names()names方法获取所有内置Seaborn数据集的列表。
In the below code, in FacetGrid method, dataset name i.e. “bird”, feature by which we want to organise the visualisation of the data i.e. “island”, and feature by which we want to group i.e. hue as “specifies” is mentioned as the parameter. Further, in the “map” method X and Y-axis of the scatter plot i.e. “flipper_length_mm”, and “body_mass_g” mentioned in the example below.
在下面的代码中,在FacetGrid方法中,数据集名称即“ bird”,我们要通过其组织数据可视化的功能(即“ island”)和我们要对其进行分组(例如“ specify”的色相)的功能如下:参数。 此外,在“映射”方法中,在以下示例中提到的散点图的X轴和Y轴,即“ flipper_length_mm”和“ body_mass_g”。
bird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()
Data set visualisation based on the above code plots the data points organised by “island” and colour-coded by “species”.
基于以上代码的数据集可视化将按“岛屿”组织并按“物种”按颜色编码的数据点绘制成图。
With one glance we can infer that “Gentoo” species are only present on the island “Biscoe”. Gentoo species is heavier and has a longer flipper length that other species. “Adelie” species is available on all three islands and “Chinstrap” is only available on the island “Dream”. You can see that with only 5 lines of code we can get so much information without any modelling.
乍一看,我们可以推断出“ Gentoo”物种仅存在于“ Biscoe”岛上。 Gentoo物种较重,并且鳍状肢的长度比其他物种更长。 在所有三个岛上都可以使用“阿德利”物种,而在“梦”岛上则可以使用“ Chinstrap”物种。 您会看到,仅用5行代码,我们无需任何建模就可以获得大量信息。
I will encourage you to post a comment on other information we can deduce from the below visualisation.
我鼓励您对我们可以从下面的图表中得出的其他信息发表评论。
Step 3: We want to quickly get an idea on the range of weights of the penguins by species and islands. Also, identify the concentration of the weight range.
步骤3:我们想快速了解按物种和岛屿划分的企鹅体重范围。 另外,确定体重范围的浓度。
With a strip plot, we can plot the weight of the penguins organised by species for each the islands.
通过条形图,我们可以绘制每个岛屿按物种组织的企鹅的体重。
sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()
Strip plot is helpful to get an insight as long as plots are not overcrowded with densely populated data points. In the island “Dream” points are densely populated in the plot, and it is a bit difficult to get meaningful information from it.
只要图块不被人口稠密的数据点过度拥挤,带状图就有助于获得洞察力。 在岛上,“梦”点在图中密密麻麻地填充着,很难从中获取有意义的信息。
Swarmplot can help to visualise the range of weight of the penguins by species in each of the islands non-overlapping points.
Swarmplot可以帮助按岛上每个非重叠点的物种可视化企鹅的体重范围。
In the below code, we mention similar information like above in strip plot.
在下面的代码中,我们在带状图中提到了类似的信息。
sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()
This improves the comprehension of the data points immensely in case of densely populated data. We can infer with a glance that “Adelie” weight ranges from approx 2500 to 4800 grams and a typical Gentoo species is heavier than Chinstrap species. I will leave you to perform other exploratory data analysis based on the below swarm plot.
在数据密集的情况下,这极大地改善了数据点的理解。 我们可以一目了然地推断出“阿德利”的重量范围约为2500至4800克,典型的Gentoo物种比Chinstrap物种重。 我将让您根据以下群图执行其他探索性数据分析。
Step 4: Next, we would like to understand the relationship between body mass and culmen length of penguins in each of the island based on their sex.
步骤4:接下来,我们想了解根据岛屿的性别,企鹅的体重与高短长度之间的关系。
In the below code, in x and y parameter features between which we are interested in identifying, the relationship is mentioned. Hue is mentioned as “sex” as we want to learn the relation for male and female penguins separately.
在下面的代码中,在我们希望识别的x和y参数特征中,提到了这种关系。 顺化被称为“性别”,因为我们想分别学习男女企鹅的关系。
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()
We can gather that body mass and culmen length relationship in Biscoe island penguins is similar for both male and female sex. On the contrary, in the island Dream, the relationship trend is quite the opposite for male and female penguins. Are the body mass and culmen length relationship of the penguins in the island Dream is linear or not linear?
我们可以发现,在Biscoe岛上,企鹅的体重和长度与男性和女性的相似。 相反,在“梦岛”中,男女企鹅的关系趋势相反。 梦岛中企鹅的体重和宫长长度关系是线性的还是线性的?
We can visualise the polynomial relationship by specifying the parameter order in the lmplot method.
我们可以通过在lmplot方法中指定参数顺序来可视化多项式关系。
In the case of heavily densely populated data points, we can further extend our exploratory data analysis by visualizing the body mass culmen relationship by male and female sex separately for each island.
在人口稠密的数据点的情况下,我们可以通过可视化每个岛屿分别以男性和女性性别划分的身体标本关系来进一步扩展探索性数据分析。
Visualization is organised by the col and row parameter mentioned in the code.
可视化由代码中提到的col和row参数组织。
sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()
Step 5: It helps to visualise a scatter plot and histogram side by side to get a holistic view of the spread of data points and also the frequency of observation at the same time. Joint plots are an efficient way to picture it. In the code below, x & y parameter is the feature between which we are trying to identify the relationship. As the data points are densely populated, hence we will plot a hexbin instead of a scatter plot. “Kind” parameter indicates the type of plot. If you want to know more about hexbin plot, then please refer my article 5 Powerful Visualisation with Pandas for Data Preprocessing.
第5步:这有助于并排可视化散点图和直方图,从而获得数据点分布的整体视图以及同时观察的频率。 联合图是描绘它的有效方法。 在下面的代码中,x&y参数是我们试图识别关系的功能。 由于数据点密集,因此我们将绘制六边形而不是散点图。 “种类”参数指示图的类型。 如果您想了解更多关于hexbin图的信息,请参阅我的文章5:使用Pandas进行功能强大的可视化以进行数据预处理 。
sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()
Based on the histogram we can infer that most numbers of penguins are between 3500 and 4500 grams. Can you deduce the most frequent culmen length range of the penguins?
根据直方图,我们可以推断出大多数企鹅数量在3500至4500克之间。 您能推断出企鹅最常见的茎长度范围吗?
We can also plot the individual data points (as shown in the right side plot) inside the hexbin plot with the code shown below.
我们还可以使用下面所示的代码在hexbin图内绘制各个数据点(如右侧图所示)。
g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()
Step 6: At last, we would like to get an overview of the spread and relationship among different features by island. Paitplots are very handy to visualise the scatter plot among different feature. The feature “island” is mentioned as the hue as we want to colour code the plot based on it.
第6步:最后,我们希望按岛屿概述不同要素之间的传播和关系。 Paitplots非常便于查看不同特征之间的散点图。 由于我们要根据其对地块进行颜色编码,因此将特征“岛”称为色相。
sns.pairplot(bird, hue="island")
plt.show()
We can see from the visualisation that in the island Biscoe most of the penguins have shallow culmen depth but are heavier than penguins in other islands. Similarly, we can conclude that flipper length is longer for most penguins in Biscoe island but have shallow culmen depth.
从可视化中我们可以看到,在比斯科岛中,大多数企鹅的阴茎深度较浅,但比其他岛屿的企鹅重。 同样,我们可以得出结论,在比斯科岛上的大多数企鹅,鳍状肢的长度更长,但洞室深度却较浅。
I hope you will use these advanced visualisations for the exploratory data analysis and get a sense of data points relationship before embarking any complex modelling exercise. I would love to hear your favourite visualization plots for EDA and also a list of conclusions we can draw from the examples illustrated in this article.
我希望您可以在进行任何复杂的建模练习之前,将这些高级可视化用于探索性数据分析,并了解数据点之间的关系。 我很想听听您最喜欢的EDA可视化图,以及我们可以从本文所示示例中得出的结论列表。
In case, you would like to learn data visualisation using pandas then please read by trending article on 5 Powerful Visualisation with Pandas for Data Preprocessing.
如果您想使用熊猫学习数据可视化,那么请阅读趋势文章5关于数据预处理的熊猫的强大可视化 。
If you are interested in learning different Scikit-Learn scalers then please do read my article Feature Scaling — Effect Of Different Scikit-Learn Scalers: Deep Dive
如果您有兴趣学习其他Scikit-Learn洁牙机,请阅读我的文章Feature Scaling —不同Scikit-Learn洁牙机的效果:深入研究
"""Full Code"""import seaborn as sns
import matplotlib.pyplot as pltbird= sns.load_dataset("penguins")
g = sns.FacetGrid(bird, col="island", hue="species")
g.map(plt.scatter, "flipper_length_mm", "body_mass_g", alpha=.6)
g.add_legend()
plt.show()sns.stripplot(x="island", y="body_mass_g", hue= "species", data=bird, palette="Set1")
plt.show()sns.swarmplot(x="island", y="body_mass_g", hue="species",data=bird, palette="Set1", dodge=True)
plt.show()sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island", markers=["o", "x"],palette="Set1",data=bird)
plt.show()sns.lmplot(x="body_mass_g", y="culmen_length_mm", hue="sex", col="island",row="sex",order=2, markers=["o", "x"],palette="Set1",data=bird)
plt.show()sns.set_palette("gist_rainbow_r")
sns.jointplot(x="body_mass_g", y="culmen_length_mm", kind="hex",data=bird )
plt.show()g = sns.jointplot(x="body_mass_g", y="culmen_length_mm", data=bird, kind="hex", color="c")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.set_axis_labels("Body Mass (in gram)", "Culmen Length ( in mm)")
plt.show()sns.pairplot(bird, hue="island")
plt.show()
翻译自: https://towardsdatascience.com/5-advanced-visualisation-for-exploratory-data-analysis-eda-c8eafeb0b8cb
eda可视化
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391546.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!