鲜活数据数据可视化指南_数据可视化实用指南

鲜活数据数据可视化指南

Exploratory data analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among variables, and the underlying structure of the data. One of the most effective tools in EDA is data visualization.

探索性数据分析(EDA)是数据科学或机器学习管道的重要组成部分。 为了使用数据创建强大而有价值的产品,您需要浏览数据,了解变量之间的关系以及数据的基础结构。 数据可视化是EDA中最有效的工具之一。

Data visualizations tell us much more than plain numbers. They are also more likely to stick to your head. In this post, we will try to explore a customer churn dataset using the power of visualizations.

数据可视化告诉我们的不仅仅是单纯的数字。 他们也更有可能坚持你的想法。 在本文中,我们将尝试使用可视化功能探索客户流失数据集 。

We will create many different visualizations and, on each one, try to introduce a feature of Matplotlib or Seaborn library.

我们将创建许多不同的可视化,并在每一个上尝试引入Matplotlib或Seaborn库的功能。

We start with importing related libraries and reading the dataset into a pandas dataframe.

我们首先导入相关的库,然后将数据集读取到pandas数据框中。

import pandas as pd
import numpy as npimport matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
%matplotlib inlinedf = pd.read_csv("/content/Churn_Modelling.csv")df.head()
Image for post

The dataset contains 10000 customers (i.e. rows) and 14 features about the customers and their products at a bank. The goal here is to predict whether a customer will churn (i.e. exited = 1) using the provided features.

该数据集包含10000个客户(即行)和银行中有关客户及其产品的14个特征。 这里的目标是使用提供的功能预测客户是否会流失(即退出= 1)。

Let’s start with a catplot which is a categorical plot of the Seaborn library.

让我们从图开始,这是Seaborn库的分类图。

sns.catplot(x='Gender', y='Age', data=df, hue='Exited', height=8, aspect=1.2)
Image for post

Finding: People between the ages of 45 and 60 are more likely to churn (i.e. leave the company) than other ages. There is not a considerable difference between females and males in terms of churning.

发现 :45至60岁的人比其他年龄段的人更容易流失(即离开公司)。 男性和女性在搅动方面没有显着差异。

The hue parameter is used to differentiate the data points based on a categorical variable.

hue参数用于基于分类变量来区分数据点。

The next visualization is the scatter plot which shows the relationship between two numerical variables. Let’s see if the estimated salary and balance of a customer are related.

下一个可视化是散点图 ,它显示了两个数值变量之间的关系。 让我们看看客户的估计工资和余额是否相关。

plt.figure(figsize=(12,8))plt.title("Estimated Salary vs Balance", fontsize=16)sns.scatterplot(x='Balance', y='EstimatedSalary', data=df)
Image for post

We first used matplotlib.pyplot interface to create a Figure object and set the title. Then, we drew the actual plot on this figure object with Seaborn.

我们首先使用matplotlib.pyplot接口创建一个Figure对象并设置标题。 然后,我们使用Seaborn在此图形对象上绘制了实际图。

Finding: There is not a meaningful relationship or correlation between the estimated salary and balance. Balance seems to have a normal distribution (excluding the customers with zero balance).

调查结果 :估计的薪水和余额之间没有有意义的关系或相关性。 余额似乎具有正态分布(不包括余额为零的客户)。

The next visualization is the boxplot which shows the distribution of a variable in terms of median and quartiles.

下一个可视化效果是箱线图 ,它以中位数和四分位数的形式显示了变量的分布。

plt.figure(figsize=(12,8))ax = sns.boxplot(x='Geography', y='Age', data=df)ax.set_xlabel("Country", fontsize=16)
ax.set_ylabel("Age", fontsize=16)
Image for post

We also adjusted the font sizes of x and y axes using set_xlabel and set_ylabel.

我们还使用set_xlabelset_ylabel调整了x和y轴的字体大小

Here is the structure of boxplots:

这是箱线图的结构:

Image for post
Image source)图像来源 )

Median is the point in the middle when all points are sorted. Q1 (first or lower quartile) is the median of the lower half of the dataset. Q3 (third or upper quartile) is the median of the upper half of the dataset.

中点是对所有点进行排序时中间的点。 Q1(第一个或下一个四分位数)是数据集下半部分的中位数。 Q3(第三或上四分位数)是数据集上半部分的中位数。

Thus, boxplots give us an idea about the distribution and outliers. In the boxplot we created, there are many outliers (represented with dots) on top.

因此,箱线图使我们对分布和异常值有了一个了解。 在我们创建的箱线图中,顶部有许多离群值(以点表示)。

Finding: The distribution of the age variable is right-skewed. The mean is greater than the median due to the outliers on the upper side. There is not a considerable difference between countries.

结果 :年龄变量的分布右偏。 由于上侧的异常值,平均值大于中位数。 各国之间没有显着差异。

Right-skewness can also be observed in the univariate distribution of a variable. Let’s create a distplot to observe the distribution.

右偏度也可以在变量的单变量分布中观察到。 让我们创建一个distplot来观察分布。

plt.figure(figsize=(12,8))plt.title("Distribution of Age", fontsize=16)sns.distplot(df['Age'], hist=False)
Image for post

The tail on the right side is heavier than the one on the left. The reason is the outliers as we also observed on the boxplot.

右侧的尾巴比左侧的尾巴重。 原因是离群值,正如我们在箱线图上所观察到的。

The distplot also provides a histogram by default but we changed it using the hist parameter.

默认情况下,distplot还提供直方图,但我们使用hist参数对其进行了更改。

Seaborn library also provides different types of pair plots which give an overview of pairwise relationships among variables. Let’s first take a random sample from our dataset to make the plots more appealing. The original dataset has 10000 observations and we will take a sample with 100 observations and 4 features.

Seaborn库还提供了不同类型的成对图,概述了变量之间的成对关系。 首先,我们从数据集中随机抽取一个样本,使图更具吸引力。 原始数据集具有10000个观测值,我们将抽取一个具有100个观测值和4个特征的样本。

subset=df[['CreditScore','Age','Balance','EstimatedSalary']].sample(n=100)g = sns.pairplot(subset, height=2.5)
Image for post

On the diagonal, we can see the histogram of variables. The other part of the grid represents pairwise relationships.

在对角线上,我们可以看到变量的直方图。 网格的另一部分表示成对关系。

Another tool to observe pairwise relationships is the heatmap which takes a matrix and produces a color encoded plot. Heatmaps are mostly used to check correlations between features and the target variable.

观察成对关系的另一个工具是热图 ,它采用矩阵并生成彩色编码图。 热图通常用于检查要素与目标变量之间的相关性。

Let’s first create a correlation matrix of some features using the corr function of pandas.

首先,我们使用熊猫的corr函数创建一些要素的相关矩阵。

corr_matrix = df[['CreditScore','Age','Tenure','Balance',
'EstimatedSalary','Exited']].corr()

We can now plot this matrix.

现在我们可以绘制该矩阵。

plt.figure(figsize=(12,8))sns.heatmap(corr_matrix, cmap='Blues_r', annot=True)
Image for post

Finding: The “Age” and “Balance” columns are positively correlated with customer churn (“Exited”).

结果 :“年龄”和“平衡”列与客户流失(“退出”)呈正相关。

As the amount of data increases, it gets trickier to analyze and explore it. There comes the power of visualizations which are great tools in exploratory data analysis when used efficiently and appropriately. Visualizations also help to deliver a message to your audience or inform them about your findings.

随着数据量的增加,分析和探索数据变得更加棘手。 可视化的强大功能是有效和适当使用探索性数据分析的重要工具。 可视化还有助于向您的听众传达信息或告知他们您的发现。

There is no one-fits-all kind of visualization method so certain tasks require different kinds of visualizations. Depending on the task, different options may be more suitable. What all visualizations have in common is that they are great tools for exploratory data analysis and the storytelling part of data science.

没有一种万能的可视化方法,因此某些任务需要不同类型的可视化。 根据任务,不同的选项可能更合适。 所有可视化的共同点在于,它们是探索性数据分析和数据科学讲故事部分的出色工具。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。 如果您有任何反馈意见,请告诉我。

翻译自: https://towardsdatascience.com/a-practical-guide-for-data-visualization-9f1a87c0a4c2

鲜活数据数据可视化指南

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389811.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Linux lsof命令详解

lsof(List Open Files) 用于查看你进程开打的文件,打开文件的进程,进程打开的端口(TCP、UDP),找回/恢复删除的文件。是十分方便的系统监视工具,因为lsof命令需要访问核心内存和各种文件,所以需要…

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解 史密斯卧推(smith press)是固定器械上完成的卧推,对于初级健身者来说,自由卧推(哑铃卧推、杠铃卧推)还不能很好地把握平衡性&…

图像特征 可视化_使用卫星图像可视化建筑区域

图像特征 可视化地理可视化/菲律宾/遥感 (GEOVISUALIZATION / PHILIPPINES / REMOTE-SENSING) Big data is incredible! The way Big Data manages to bring sciences and business domains to new levels is almost sort of magical. It allows us to tap into a variety of a…

375. 猜数字大小 II

375. 猜数字大小 II 我们正在玩一个猜数游戏,游戏规则如下: 我从 1 到 n 之间选择一个数字。你来猜我选了哪个数字。如果你猜到正确的数字,就会 赢得游戏 。如果你猜错了,那么我会告诉你,我选的数字比你的 更大或者更…

海量数据寻找最频繁的数据_在数据中寻找什么

海量数据寻找最频繁的数据Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or anal…

OSChina 周四乱弹 —— 要成立复仇者联盟了,来报名

2019独角兽企业重金招聘Python工程师标准>>> Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 Devoes :分享吴若希的单曲《越难越爱 (Love Is Not Easy / TVB剧集《使徒行者》片尾曲)》: 《越难越爱 (Love Is No…

2023. 连接后等于目标字符串的字符串对

2023. 连接后等于目标字符串的字符串对 给你一个 数字 字符串数组 nums 和一个 数字 字符串 target ,请你返回 nums[i] nums[j] (两个字符串连接)结果等于 target 的下标 (i, j) (需满足 i ! j)的数目。 示例 1&…

webapi 找到了与请求匹配的多个操作(ajax报500,4的错误)

1、ajax报500,4的错误,然而多次验证自己的后台方法没错。然后跟踪到如下图的错误信息! 2、因为两个函数都是无参的,返回值也一样。如下图 3,我给第一个函数加了一个参数后,就不报错了,所以我想,…

可视化 nlp_使用nlp可视化尤利西斯

可视化 nlpMy data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elu…

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…

python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数…

您是六个主要数据角色中的哪一个

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

自定义按钮动态变化_新闻价值的变化定义

自定义按钮动态变化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …

Linux记录-TCP状态以及(TIME_WAIT/CLOSE_WAIT)分析(转载)

1.TCP握手定理 2.TCP状态 l CLOSED:初始状态,表示TCP连接是“关闭着的”或“未打开的”。 l LISTEN :表示服务器端的某个SOCKET处于监听状态,可以接受客户端的连接。 l SYN_RCVD :表示服务器接收到了来自客户端请求…

算法 从 数中选出_算法可以选出胜出的nba幻想选秀吗

算法 从 数中选出Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without …

django-rest-framework第一次使用使用常见问题

2019独角兽企业重金招聘Python工程师标准>>> 记录在第一次使用django-rest-framework框架使用时遇到的问题,为了便于理解在这里创建了Person和Grade这两个model from django.db import models class Person(models.Model):SHIRT_SIZES ((S, Small),(M, …

插入脚注把脚注标注删掉_地狱司机不应该只是英国电影历史数据中的脚注,这说明了为什么...

插入脚注把脚注标注删掉Cowritten by Andie Yam由安迪(Andie Yam)撰写 Hell Drivers”, 1957地狱司机 》电影海报 Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. Mor…

贝叶斯统计 传统统计_统计贝叶斯如何补充常客

贝叶斯统计 传统统计For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.多年以来,学者们一直在使用所谓的常客统计学来评估实验操作是否具有significant效果。…

saltstack二

配置管理 haproxy的安装部署 haproxy各版本安装包下载路径https://www.haproxy.org/download/1.6/src/,跳转地址为http,改为https即可 创建相关目录 # 创建配置目录 [rootlinux-node1 ~]# mkdir /srv/salt/prod/pkg/ [rootlinux-node1 ~]# mkdir /srv/sa…