数据可视化 信息可视化_可视化数据以帮助清理数据

数据可视化 信息可视化

The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.

数据科学家的作用涉及检索大量结构化或非结构化数据之间的隐藏关系,以达到或调整某些业务标准。 近年来,随着企业希望通过易于获得的数据来扩大对市场及其客户的洞察力,这一作用的重要性已大大提高。

It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.

数据科学家的工作是获取这些数据并返回对业务问题或机会的更深刻理解。 这通常涉及使用科学方法,包括机器学习(ML)或神经网络(NN)。 尽管这些类型的结构可以在数千个数据点中找到比人类更快得多的含义,但是如果馈入其中的数据是凌乱的数据,则它们可能不可靠。

Messy data could cause have very negative consequences on your models they are of many forms of which include:

杂乱的数据可能会对您的模型造成非常不利的影响,它们的形式很多,包括:

缺少数据(Missing data:)

Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.

表示为“ NaN”(不是数字的缩写)或Python单例对象的“无”。

Sometimes the best way to deal with problems is the simplest.

有时,解决问题的最佳方法是最简单的。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()

A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.

快速检查返回的值会发现在不同的列中891的列数不一致,明显缺少信息。 我们还注意到,接下来将要介绍一些字段属于“对象”类型。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived 891 non-null int64
pclass 891 non-null int64
name 891 non-null object
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
ticket 891 non-null object
fare 891 non-null float64
cabin 204 non-null object
embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.

或者,您可以使用seaborn在热图上绘制缺失值,但是如果处理大数据帧,这可能会非常耗时。

sns.heatmap(df.isnull(), cbar=False)

数据不一致(Inconsistent data:)

  • Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values.

    列类型不一致:数据框中的列可能会有所不同,如上所述。 列可以具有不同的类型,例如对象,整数或浮点数,虽然通常这是列类型与其所拥有的值类型不匹配的情况,但可能会出现问题。 最重要的格式类型包括用于时间和日期值的日期时间。
  • Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following:

    值格式不一致:虽然这种类型的问题可能主要在分类值期间出现(如果存在拼写错误或错字),但可以使用以下方法进行检查:
df[‘age’].value_counts()
Image for post

This will return the number of iterations each value is repeated throughout the dataset.

这将返回在整个数据集中重复每个值的迭代次数。

离群数据(Outlier data:)

A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.

数据框列保存有关数据中特定功能的信息。 因此,我们可以对这些值的范围有一个基本的了解。 例如,年龄,我们知道将有一个介于0或100之间的范围。这并不意味着在该范围之间不会出现异常值。

A simple illustration of the following can be seen graphing a boxplot:

可以通过绘制箱形图来简单了解以下内容:

sns.boxplot(x=df['age'])
plt.show()
Image for post

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.

在此数据框中,右侧的点表示的值可以视为离群值,因为它们不在通常见证的值范围之内。

多重共线性: (Multicollinearity:)

While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.

虽然多重共线性不被认为是凌乱的数据,但这仅意味着数据框中的列或要素是相关的。 例如,如果您有一个“价格”列,一个“重量”列和一个“每重量价格”列,那么我们期望这些字段之间具有较高的多重共线性。 这可以通过删除一些高度相关的列来解决。

f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)
Image for post

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.

在这种情况下,我们可以看到值的正或负均不超过0.7,因此可以认为继续操作是安全的。

使此过程更容易: (Making this process easier:)

While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:

尽管数据科学家经常重复地完成这些初始任务,但通过创建结构化的函数可以使此信息的可视化变得更加容易。 我们试试吧:

----------------------------------------------------------------
from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
print(data[‘DESCR’][:830])
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]

1-Checking Multicollinearity

1-检查多重共线性

The function below returns a heatmap of collinearity between independent variables as well as with the target variable.

下面的函数返回自变量之间以及目标变量之间共线性的热图。

data = independent variable df X

数据 =自变量df X

target = dependent variable list y

目标 =因变量列表y

remove = list of variables not to be included (default as empty list)

remove =不包括的变量列表(默认为空列表)

add_target = boolean of whether to view heatmap with target included (default as False)

add_target =是否查看包含目标的热图的布尔值(默认为False)

inplace = manipulate your df to save the changes you made with remove/add_target (default as False)

inplace =操纵df保存使用remove / add_target所做的更改(默认为False)

*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*

*如果为remove传递了一个列名,该列的重新绘制图和目标,则在继续操作之前还会显示目标以帮助查看更改*

data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)

data_viz.multicollinearity_check(data = X,target = y,remove = ['Latitude'],add_target = False,inplace = False)

Image for post
Image for post

2- Viewing Outliers:This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.

2-查看离群值:此函数通过regplot和箱形图可视化返回离群值的并排视图,该图显示输入数据和目标值在指定分割范围内的情况。

data = independent variable df X

数据 =自变量df X

target = dependent variable list y

目标 =因变量列表y

split = adjust the number of plotted rows as decimals between 0 and 1 or as integers

split =将绘制的行数调整为0到1之间的小数或整数

data_viz.view_outliers(data=X, target=y, split_size= 0.3 )

data_viz.view_outliers(data = X,target = y,split_size = 0.3)

Image for post

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.

这些图表必须由数据科学家读取,而不是自动传送到计算机,这一点很重要。 由于并非所有数据集都遵循相同的规则,因此重要的是,人类必须解释视觉效果并据此采取行动。

I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.

我希望这段简短的数据可视化过程有助于为您的数据提供更清晰的可视化,以便在清理数据时更好地推动您的决策。

The functions used in the example above is available here :

上面示例中使用的功能在此处可用:

Feel free to customise these as you see fit!

随意自定义这些内容!

翻译自: https://medium.com/@rani_64949/visualisations-of-data-for-help-in-data-cleaning-dce15a94b383

数据可视化 信息可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389213.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

VS2005 ASP.NET2.0安装项目的制作(包括数据库创建、站点创建、IIS属性修改、Web.Config文件修改)

站点&#xff1a; 如果新建默认的Web安装项目&#xff0c;那它将创建的默认网站下的一个虚拟应用程序目录而不是一个新的站点。故我们只有创建新的安装项目&#xff0c;而不是Web安装项目。然后通过安装类进行自定义操作&#xff0c;创建新站如下图&#xff1a; 2、创建新的安项…

docker的基本命令

docker的三大核心&#xff1a;仓库(repository),镜像(image),容器(container)三者相互转换。 1、镜像(image) 镜像&#xff1a;组成docker容器的基础.类似安装系统的镜像 docker pull tomcat 通过pull来下载tomcat docker push XXXX 通过push的方式发布镜像 2、容器(container)…

seaborn添加数据标签_常见Seaborn图的数据标签快速指南

seaborn添加数据标签In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在进行数据探索的过程中&#xff0c;我…

使用python pandas dataframe学习数据分析

⚠️ Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note️ 注意 -这篇文章是使用python系列学习数据分析的一部分。 如果您还没有阅读第一篇文…

实现TcpIp简单传送

private void timer1_Tick(object sender, EventArgs e) { IPAddress ipstr IPAddress.Parse("192.168.0.106"); TcpListener serverListener new TcpListener(ipstr,13);//创建TcpListener对象实例 ser…

SQLServer之函数简介

用户定义函数定义 与编程语言中的函数类似&#xff0c;SQL Server 用户定义函数是接受参数、执行操作&#xff08;例如复杂计算&#xff09;并将操作结果以值的形式返回的例程。 返回值可以是单个标量值或结果集。 用户定义函数准则 在函数中&#xff0c;将会区别处理导致语句被…

无向图g的邻接矩阵一定是_矩阵是图

无向图g的邻接矩阵一定是To study structure,tear away all flesh soonly the bone shows.要研究结构&#xff0c;请尽快撕掉骨头上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …

移动pc常用Meta标签

移动常用 <meta charset"UTF-8"><title>{$configInfos[store_title]}</title><meta content"widthdevice-width,minimum-scale1.0,maximum-scale1.0,shrink-to-fitno,user-scalableno,minimal-ui" name"viewport"><m…

前端绘制绘制图表_绘制我的文学风景

前端绘制绘制图表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …

Rapi

本页内容 ●引言●SMARTPHONE SDK API 库●管理设备中的目录文件●取系统信息●远程操作电话和短信功能 Windows Mobile日益成熟&#xff0c;开发者队伍也越来越壮大。作为一个10年的计算机热爱者和程序员&#xff0c;我也经受不住新技术的诱惑&#xff0c;倒腾起Mobile这个玩具…

android 字符串特殊字符转义

XML转义字符 以下为XML标志符的数字和字符串转义符 " ( 或 &quot;) ( 或 &apos;) & ( 或 &amp;) lt(<) (< 或 <) gt(>) (> 或 >) 如题&#xff1a; 比如&#xff1a;在string.xml中定义如下一个字符串&#xff0c;…

如何描绘一个vue的项目_描绘了一个被忽视的幽默来源

如何描绘一个vue的项目Source)来源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …

数据存储加密和传输加密_将时间存储网络应用于加密预测

数据存储加密和传输加密I’m not going to string you along until the end, dear reader, and say “Didn’t achieve anything groundbreaking but thanks for reading ;)”.亲爱的读者&#xff0c;我不会一直待到最后&#xff0c;然后说&#xff1a; “没有取得任何开创性的…

熊猫分发_熊猫新手:第一部分

熊猫分发For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.对于刚接触数据科学的…

多线程 进度条 C# .net

前言  在我们应用程序开发过程中&#xff0c;经常会遇到一些问题&#xff0c;需要使用多线程技术来加以解决。本文就是通过几个示例程序给大家讲解一下多线程相关的一些主要问题。 执行长任务操作  许多种类的应用程序都需要长时间操作&#xff0c;比如&#xff1a;执行一…

window 10 多版本激活工具

window 10 通用版激活工具 云盘地址&#xff1a;https://pan.baidu.com/s/1bo3L4Kn 激活工具网站&#xff1a;http://www.tudoupe.com/win10/win10jihuo/2017/0516/6823.html 转载于:https://www.cnblogs.com/ipyanthony/p/9288007.html

android 动画总结笔记 一

终于有时间可以详细去了解一下 android动画&#xff0c;先从android动画基础着手。在android 3.0之前android动画api主要是android.view.Animation包下的内容&#xff0c;来先看看这个包里面主要的类![Animation成员](https://img-blog.csdn.net/20150709115201928 "Anima…

《Linux内核原理与分析》第六周作业

课本&#xff1a;第五章 系统调用的三层机制&#xff08;下&#xff09; 中断向量0x80和system_call中断服务程序入口的关系 0x80对应着system_call中断服务程序入口&#xff0c;在start_kernel函数中调用了trap_init函数&#xff0c;trap_init函数中调用了set_system_trap_gat…

使用C#调用外部Ping命令获取网络连接情况

使用C#调用外部Ping命令获取网络连接情况 以前在玩Windows 98的时候&#xff0c;几台电脑连起来&#xff0c;需要测试网络连接是否正常&#xff0c;经常用的一个命令就是Ping.exe。感觉相当实用。 现在 .net为我们提供了强大的功能来调用外部工具&#xff0c;并通过重定向输…