数据可视化 信息可视化_可视化数据以帮助清理数据

数据可视化 信息可视化

The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.

数据科学家的作用涉及检索大量结构化或非结构化数据之间的隐藏关系,以达到或调整某些业务标准。 近年来,随着企业希望通过易于获得的数据来扩大对市场及其客户的洞察力,这一作用的重要性已大大提高。

It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.

数据科学家的工作是获取这些数据并返回对业务问题或机会的更深刻理解。 这通常涉及使用科学方法,包括机器学习(ML)或神经网络(NN)。 尽管这些类型的结构可以在数千个数据点中找到比人类更快得多的含义,但是如果馈入其中的数据是凌乱的数据,则它们可能不可靠。

Messy data could cause have very negative consequences on your models they are of many forms of which include:


缺少数据(Missing data:)

Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.

表示为“ NaN”(不是数字的缩写)或Python单例对象的“无”。

Sometimes the best way to deal with problems is the simplest.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()

A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.

快速检查返回的值会发现在不同的列中891的列数不一致,明显缺少信息。 我们还注意到,接下来将要介绍一些字段属于“对象”类型。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived 891 non-null int64
pclass 891 non-null int64
name 891 non-null object
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
ticket 891 non-null object
fare 891 non-null float64
cabin 204 non-null object
embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.


sns.heatmap(df.isnull(), cbar=False)

数据不一致(Inconsistent data:)

  • Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values.

    列类型不一致:数据框中的列可能会有所不同,如上所述。 列可以具有不同的类型,例如对象,整数或浮点数,虽然通常这是列类型与其所拥有的值类型不匹配的情况,但可能会出现问题。 最重要的格式类型包括用于时间和日期值的日期时间。
  • Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following:

Image for post

This will return the number of iterations each value is repeated throughout the dataset.


离群数据(Outlier data:)

A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.

数据框列保存有关数据中特定功能的信息。 因此,我们可以对这些值的范围有一个基本的了解。 例如,年龄,我们知道将有一个介于0或100之间的范围。这并不意味着在该范围之间不会出现异常值。

A simple illustration of the following can be seen graphing a boxplot:


Image for post

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.


多重共线性: (Multicollinearity:)

While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.

虽然多重共线性不被认为是凌乱的数据,但这仅意味着数据框中的列或要素是相关的。 例如,如果您有一个“价格”列,一个“重量”列和一个“每重量价格”列,那么我们期望这些字段之间具有较高的多重共线性。 这可以通过删除一些高度相关的列来解决。

f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)
Image for post

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.


使此过程更容易: (Making this process easier:)

While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:

尽管数据科学家经常重复地完成这些初始任务,但通过创建结构化的函数可以使此信息的可视化变得更加容易。 我们试试吧:

from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]

1-Checking Multicollinearity


The function below returns a heatmap of collinearity between independent variables as well as with the target variable.


data = independent variable df X

数据 =自变量df X

target = dependent variable list y

目标 =因变量列表y

remove = list of variables not to be included (default as empty list)

remove =不包括的变量列表(默认为空列表)

add_target = boolean of whether to view heatmap with target included (default as False)

add_target =是否查看包含目标的热图的布尔值(默认为False)

inplace = manipulate your df to save the changes you made with remove/add_target (default as False)

inplace =操纵df保存使用remove / add_target所做的更改(默认为False)

*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*


data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)

data_viz.multicollinearity_check(data = X,target = y,remove = ['Latitude'],add_target = False,inplace = False)

Image for post
Image for post

2- Viewing Outliers:This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.


data = independent variable df X

数据 =自变量df X

target = dependent variable list y

目标 =因变量列表y

split = adjust the number of plotted rows as decimals between 0 and 1 or as integers

split =将绘制的行数调整为0到1之间的小数或整数

data_viz.view_outliers(data=X, target=y, split_size= 0.3 )

data_viz.view_outliers(data = X,target = y,split_size = 0.3)

Image for post

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.

这些图表必须由数据科学家读取,而不是自动传送到计算机,这一点很重要。 由于并非所有数据集都遵循相同的规则,因此重要的是,人类必须解释视觉效果并据此采取行动。

I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.


The functions used in the example above is available here :


Feel free to customise these as you see fit!


翻译自: https://medium.com/@rani_64949/visualisations-of-data-for-help-in-data-cleaning-dce15a94b383

数据可视化 信息可视化





seaborn添加数据标签In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在进行数据探索的过程中&#xff0c;我…

使用python pandas dataframe学习数据分析

⚠️ Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note️ 注意 -这篇文章是使用python系列学习数据分析的一部分。 如果您还没有阅读第一篇文…


无向图g的邻接矩阵一定是To study structure,tear away all flesh soonly the bone shows.要研究结构&#xff0c;请尽快撕掉骨头上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …


前端绘制绘制图表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …


如何描绘一个vue的项目Source)来源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …


数据存储加密和传输加密I’m not going to string you along until the end, dear reader, and say “Didn’t achieve anything groundbreaking but thanks for reading ;)”.亲爱的读者&#xff0c;我不会一直待到最后&#xff0c;然后说&#xff1a; “没有取得任何开创性的…


熊猫分发For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.对于刚接触数据科学的…

多线程 进度条 C# .net

前言  在我们应用程序开发过程中&#xff0c;经常会遇到一些问题&#xff0c;需要使用多线程技术来加以解决。本文就是通过几个示例程序给大家讲解一下多线程相关的一些主要问题。 执行长任务操作  许多种类的应用程序都需要长时间操作&#xff0c;比如&#xff1a;执行一…


课本&#xff1a;第五章 系统调用的三层机制&#xff08;下&#xff09; 中断向量0x80和system_call中断服务程序入口的关系 0x80对应着system_call中断服务程序入口&#xff0c;在start_kernel函数中调用了trap_init函数&#xff0c;trap_init函数中调用了set_system_trap_gat…

Codeforces Round 493

心情不好&#xff0c;被遣散回学校 &#xff0c;心态不好 &#xff0c;为什么会累&#xff0c;一直微笑就好了 #include<bits/stdc.h> using namespace std; int main() {freopen("in","r",stdin);\freopen("out","w",stdout);i…


从android3.0&#xff0c;系统提供了一个新的动画&#xff0d;property animation, 为什么系统会提供这样一个全新的动画包呢&#xff0c;先来看看之前的补间动画都有什么缺陷吧1、传统的补间动画都是固定的编码&#xff0c;功能是固定的&#xff0c;扩展难度大。比如传统动画只…


回归分析检验Regression analysis is a reliable method in statistics to determine whether a certain variable is influenced by certain other(s). The great thing about regression is also that there could be multiple variables influencing the variable of intere…


优秀的程序员 总会想着 如何把花30分钟才能解决的问题 在5分钟内就解决完 例如在应用上线这件事上 通常的做法是 构建项目在本地用maven打包 每次需要clean一次&#xff0c;再build一次 部署包在本地ide、git/svn、maven/gradie 及代码仓库、镜像仓库和云平台间 来回切换 上传部…

Ubuntu 18.04 下如何配置mysql 及 配置远程连接

首先是大家都知道的老三套&#xff0c;啥也不说上来就放三个大招&#xff1a; sudo apt-get install mysql-serversudo apt isntall mysql-clientsudo apt install libmysqlclient-dev 这三步下来mysql就装好了&#xff0c;然后我们偷偷检查一下 sudo netstat -tap | grep mysq…


数据科学与大数据技术的案例I’ve been in that situation where I got a bunch of data science case studies from different companies and I had to figure out what the problem was, what to do to solve it and what to focus on. Conversely, I’ve also designed case…


队列的链式存储结构及其实现A queue is a collection of items whereby its operations work in a FIFO — First In First Out manner. The two primary operations associated with them are enqueue and dequeue.队列是项目的集合&#xff0c;由此其操作以FIFO(先进先出)的方…


cad2016珊瑚What’s the future of the world’s coral reefs?世界珊瑚礁的未来是什么&#xff1f; In February of 2020, scientists at University of Hawaii Manoa released a study addressing this very question. The models they developed forecasted a 70–90% worl…


EChart中使用地图方式总结 2018年02月06日 22:18:57 来源&#xff1a;https://blog.csdn.net/shaxiaozilove/article/details/79274772最近在仿照EChart公交线路方向示例&#xff0c;开发表示排水网和污水网流向地图&#xff0c;同时地图上需要叠加排放口、污染源、污水处理厂等…

android mvp模式

越来越多人讨论mvp模式&#xff0c;mvp在android应用开发中获得更多的重视&#xff0c;这里说一下对MVP的简单了解。 什么是 MVP? MVP模式使逻辑从视图层分开&#xff0c;目的是我们在屏幕上怎么表现&#xff0c;和界面如何工作的所有事情就完全分开了。 View显示数据&…

Node.js REPL(交互式解释器)

2019独角兽企业重金招聘Python工程师标准>>> Node.js REPL(交互式解释器) Node.js REPL(Read Eval Print Loop:交互式解释器) 表示一个电脑的环境&#xff0c;类似 Window 系统的终端或 Unix/Linux shell&#xff0c;我们可以在终端中输入命令&#xff0c;并接收系统…