数据科学中的数据可视化

数据可视化简介 (Introduction to Data Visualization)

Data visualization is the process of creating interactive visuals to understand trends, variations, and derive meaningful insights from the data. Data visualization is used mainly for data checking and cleaning, exploration and discovery, and communicating results to business stakeholders. Most of the data scientists pay little attention to graphs and focuses only on the numerical calculations which at times can be misleading. To understand the importance of visualization let’s take a look at Anscombe’s Data Quartet in Figures 1 and 2 below.

数据可视化是创建交互式视觉效果以了解趋势,变化并从数据中获得有意义的见解的过程。 数据可视化主要用于数据检查和清理,探索和发现以及将结果传达给业务涉众。 大多数数据科学家很少关注图形,而只关注于有时会引起误解的数值计算。 为了理解可视化的重要性,让我们在下面的图1和图2中查看Anscombe的Data Quartet。

Image for post
Figure 1. Anscombe’s Data Quartet showing how a pair of X and Y can have different values yet have different central tendency and correlation values. Data Credits — Anscombe, Francis J. (1973)
图1. Anscombe的数据四重奏显示了一对X和Y如何具有不同的值却具有不同的集中趋势和相关值。 数据信用-Anscombe,Francis J.(1973)

The same data points, when represented using visualization in Figure 2 below, depicts a different trend altogether.

当使用下面的图2中的可视化表示相同的数据点时,它们总共描述了不同的趋势。

Image for post
Figure 2. Illustrates how four identical datasets when examined using simple summary statistics look similar but vary considerably when graphed. Image Credits — Anscombe, Francis J. (1973)
图2.说明了使用简单的汇总统计数据检查时,四个相同的数据集看起来如何相似,但绘制时却相差很大。 图片来源-弗朗西斯·J·安斯科姆(1973)

It is important to visualize the data before any calculations are carried out. The visual representation can convey much more information when compared to descriptive statistics.

在执行任何计算之前,对数据进行可视化非常重要。 与描述性统计数据相比,视觉表示可以传达更多的信息。

数据可视化的作用 (Role of Data Visualization)

Multiple Business Intelligence Tools (BI) are currently ruling the market with each having its pros and cons. The concept of self-service dashboards was devised to allow stakeholders with little or no knowledge of data science, work independently on data, and derive some findings that might assist their day to day business decisions. We will look at some of the applications of data visualization using Tableau or Python in the examples below.

目前,多种商业智能工具(BI)统治着市场,每种都有其优缺点。 自助服务仪表板的概念旨在使几乎不了解数据科学或根本不了解数据科学的利益相关者,独立地处理数据并得出一些有助于其日常业务决策的发现。 在下面的示例中,我们将介绍一些使用Tableau或Python进行数据可视化的应用程序。

数据检查与清理 (Data Checking and Cleaning)

Data visualization can be used to look for obvious errors in the dataset including nulls, random values, distinct records, the format of dates, sensibility of spatial data, and string and character encoding.

数据可视化可用于查找数据集中的明显错误,包括空值,随机值,不同的记录,日期格式,空间数据的敏感性以及字符串和字符编码。

Image for post
Figure 3. Illustrates the distribution of Pedestrian volume in Melbourne captured by different sensors situated in and around CBD. The idea is to analyze if the latitude and longitude information is valid for a given dataset. The image is developed by the author using Tableau.
图3.说明了位于CBD内和周围的不同传感器捕获的墨尔本行人流量分布。 这个想法是分析经纬度信息对于给定的数据集是否有效。 该图像由作者使用Tableau开发。

资料分配 (Data Distribution)

Data visualization can be used to understand the distribution of the data, look for central tendencies (mean, median, and mode), understand the presence of outliers using a boxplot, check for skewness, and ever understand the impact of winsorization on data distribution. Figure 4 below illustrates how box plots can be developed to understand the presence of outliers.

数据可视化可用于了解数据的分布,寻找中心趋势(均值,中位数和众数),使用箱线图了解异常值,检查偏斜度,以及了解Winsorization对数据分布的影响。 下面的图4说明了如何绘制箱形图以了解异常值的存在。

Image for post
Figure 4. Displays the presence of outliers (outliers in pedestrian volume) across different sensors installed across various parts of Melbourne. The dataset used for this analysis can be found here. The image is developed by the author using Jupyter Notebook.
图4.显示跨墨尔本各个地区安装的不同传感器的异常值(行人量中的异常值)的存在。 可以在此处找到用于此分析的数据集。 该图像由作者使用Jupyter Notebook开发。

模型假设 (Model Assumptions)

Linear regression and other classification models follow certain underlying assumptions like data has to be normally distributed, the correlation between different independent variables shouldn’t exist, homoscedasticity of error terms, and many more. Hence visualizations are a key to validating some of these assumptions as well.

线性回归和其他分类模型遵循某些基本假设,例如数据必须正态分布,不应该存在不同自变量之间的相关性,误差项的均方差等等。 因此,可视化也是验证其中一些假设的关键。

Image for post
Figure 5. Illustrates the correlation plot of numerical variables using a heat map. The correlation plot is used to drop variables that are highly correlated while building a classification model to predict customer satisfaction using flight and facilities data. The image is developed by the author using Jupyter Notebook.
图5.使用热图说明数值变量的相关图。 相关图用于删除高度相关的变量,同时建立分类模型以使用航班和设施数据预测客户满意度。 该图像由作者使用Jupyter Notebook开发。

人在环分析 (Human-in-the-Loop Analytics)

Data scientists often use humans in the loop analytics to get a look and feel of the data, make a hypothesis, run appropriate analytics to validate the hypothesis, and repeat the process till conclusive evidence is determined. E.g. in Python a very popular package Seaborn has a function called pair plot. Pair plots are very useful in determining the relationship between dependent and independent variables. The idea of the visualization is to get a better understanding of the directional sense of if some of the independent variables impact the model results or not.

数据科学家经常在循环分析中使用人工来获得数据的外观和感觉,做出假设,运行适当的分析以验证假设,并重复该过程直到确定结论性证据为止。 例如,在Python中,一个非常受欢迎的软件包Seaborn具有一个称为结对图的函数。 配对图对于确定因变量和自变量之间的关系非常有用。 可视化的想法是更好地理解方向性,即某些自变量是否影响模型结果。

Image for post
Figure 6. Illustrates the pair plot representation of a dependent variable (say customer satisfaction of airline passengers) across independent variables like distance of the flight, the delay in arrival, and the delay in departure. The image is developed by the author using Jupyter Notebook.
图6.图示了跨自变量(例如,飞行距离,到达延迟和起飞延迟)的因变量(例如,航空公司乘客的客户满意度)的对图表示。 该图像由作者使用Jupyter Notebook开发。

降维 (Dimension Reduction)

While working with multiple variables it is difficult to visualize the data in an n-dimension space. E.g. in a data set that has different customer attributes (say numerical) it is difficult to plot the customers considering all attributes. In scenarios like this, dimension reduction techniques like Principal Component Analysis (PCA) or Factor Analysis can be useful to bring down the attributes to fewer dimensions. PCA finds linear combinations of variables that best explain the observations whereas Factor analysis finds linear combinations of variables that best explain the relationship between the variables. The reduced dimension can then be plotted to analyze the customers in a 2D space.

使用多个变量时,很难在n维空间中可视化数据。 例如,在具有不同客户属性(例如数字)的数据集中,很难考虑所有属性来绘制客户。 在这种情况下,降维技术(例如主成分分析(PCA)或因子分析)可用于将属性降低到更少的维度。 PCA找到最能解释观测结果的变量线性组合,而因子分析则找到最能解释变量之间关系的变量线性组合。 然后可以绘制缩小的尺寸以分析2D空间中的客户。

More information on how to recreate these charts in Python can be found here.

可在此处找到有关如何在Python中重新创建这些图表的更多信息。

分析问题中的数据集类型 (Type of Datasets in Analytical Problems)

It is important to understand the type of datasets to determine the type of visualization that can be applied. E.g. when working with a tabular data a combination of bar graphs and line charts might be useful when compared to spatial data where a map with a density plot might communicate the result effectively. Before we take a deeper look into the type of visualization let’s understand some of the key data types that are commonly used.

重要的是了解数据集的类型,以确定可以应用的可视化类型。 例如,当与表格数据一起使用时,与空间数据相比,条形图和折线图的组合可能会很有用,在空间数据中,带有密度图的地图可能会有效地传达结果。 在深入研究可视化类型之前,让我们了解一些常用的关键数据类型。

表格数据 (Tabular data)

Data organized in tables, a row for each data item, and a column for each of its attributes. E.g. Datasets that are available in Excel, CSV files, Pandas data frame, etc.

数据组织在表格中,每个数据项一行,其每个属性列。 例如,Excel,CSV文件,Pandas数据框等中可用的数据集。

网络数据 (Network data)

Nodes in the network are data items and links between the nodes are relations between. For example a social network.

网络中的节点是数据项,节点之间的链接是它们之间的关系。 例如社交网络。

空间数据: (Spatial data:)

Data which is naturally organized and understood in terms of its spatial location or extent. E.g. latitude and longitude of locations, geography information, suburbs, streets, etc.

根据空间位置或范围自然组织和理解的数据。 例如,位置,地理信息,郊区,街道等的纬度和经度。

文字数据: (Textual data:)

This kind of data set consists of sequences of words and punctuation. E.g. twitter feed or customer complaints.

这种数据集由单词和标点的序列组成。 例如Twitter提要或客户投诉。

视觉词汇 (Visual Vocabulary)

The figures below provide a picture of how different visualizations can be used to depict different scenarios in the data.

下图提供了如何使用不同的可视化图像描述数据中不同场景的图片。

Image for post
Figure 7. Illustrates some of the graphs useful for visualizing trends w.r.t deviations from reference points. Image Credits — Github.io
图7.说明了一些图表,这些图表可用于可视化与参考点之间的偏差趋势。 图片积分— Github.io
Image for post
Figure 8. Illustrates some of the graphs useful for visualizing the correlation between multiple data points. Image Credits — Github.io
图8.说明了一些图形,这些图形对于可视化多个数据点之间的相关性很有用。 图片积分— Github.io
Image for post
Figure 9. Illustrates how visualizations can be used to understand the variation of attributes concerning time. Image Credits — Github.io
图9.说明了如何使用可视化来了解与时间有关的属性的变化。 图片积分— Github.io
Image for post
Figure 10. Illustrates how different visualizations can be used to understand rankings or order of different components. Image Credits — Github.io
图10.说明了如何使用不同的可视化效果来理解不同组件的排名或顺序。 图片积分— Github.io

You can find examples of other visualizations here.

您可以在此处找到其他可视化示例。

跨数据类型的可视化效果 (Effectiveness of Visualization across Data Types)

The table below displays the effectiveness of different visuals across data types. To understand the table better we need to have a better understanding of how variables (attributes from the data) can be categorized into different data types. Categorical variables are the ones that don’t have any ordering e.g. Gender, Grades, Marital Status, Job Position, etc. Numerical Variables are segmented into Ordinal and Quantitative variables. Ordinal variables are categories that can be ranked. E.g. Satisfaction (Good, Bad, and Average), Potential (High, Medium, and Low), etc. Quantitative variables are the ones that can take any range of numeric values between -infinity to +infinity. E.g. Age, Salary, Revenue, Sales, etc.

下表显示了跨数据类型的不同视觉效果的有效性。 为了更好地理解表,我们需要更好地了解如何将变量(来自数据的属性)归类为不同的数据类型。 分类变量是没有任何排序的变量 ,例如性别,等级,婚姻状况,工作职位等。 数字变量分为序数 变量定量变量。 有序变量是可以排序的类别。 例如,满意度(好,坏和平均),潜力(高,中和低)等。 定量变量是可以采用-infinity到+ infinity之间任意数值范围的变量 。 例如年龄,薪水,收入,销售等

Image for post
Figure 11. Illustrates how different graphs can be used to visualize patterns in the data taking into consideration the data type of the variable. Image credits — Developed by the author using PowerPoint.
图11.说明了如何使用不同的图来可视化数据中的模式,同时考虑到变量的数据类型。 图片来源-由作者使用PowerPoint开发。
Image for post
Figure 12. Illustrates the type of visualization that can be used for different data types. Image credit — Developed by the author using Excel.
图12.说明了可用于不同数据类型的可视化类型。 图像信用—由作者使用Excel开发。

结论 (Conclusion)

Data visualization forms the backbone of all analytical projects. It not only helps in gaining insights into the data but can be used as a tool for data pre-processing. Having the right set of visualizations for different data types and business scenarios is the key to effective communication of results.

数据可视化构成所有分析项目的基础。 它不仅有助于获得对数据的见解,而且可以用作数据预处理的工具。 为不同的数据类型和业务场景提供正确的可视化设置是有效传达结果的关键。

About the Author: Advanced analytics professional and management consultant helping companies find solutions for diverse problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;

作者简介:高级分析专家和管理顾问,通过组织数据的业务,技术和数学相结合,帮助公司找到各种问题的解决方案。 数据科学爱好者,在这里分享,学习和贡献; 您可以在 Linked Twitter上 与我 联系

翻译自: https://towardsdatascience.com/data-visualization-in-data-science-5681cbdde5bf

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391926.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

打针小说软件测试,UPDATE注射(mysql+php)的两个模式

一.---- 表的结构 userinfo--CREATE TABLE userinfo (groudid varchar(12) NOT NULL default 1,user varchar(12) NOT NULL default heige,pass varchar(122) NOT NULL default 123456) ENGINEMyISAM DEFAULT CHARSETlatin1;---- 导出表中的数据 userinfo--INSERT INTO userinf…

前端速成班_在此速成班中学习Go

前端速成班Learn everything you need to get started programming in Go with this crash course tutorial.通过该速成课程教程,学习在Go中开始编程所需的一切。 First, learn how to install a Go Programming Environment on Windows, Mac, or Linux. Then, lea…

手把手教你webpack3(6)css-loader详细使用说明

CSS-LOADER配置详解 前注: 文档全文请查看 根目录的文档说明。 如果可以,请给本项目加【Star】和【Fork】持续关注。 有疑义请点击这里,发【Issues】。 1、概述 对于一般的css文件,我们需要动用三个loader(是不是觉得好…

shell远程执行命令

1、先要配置免密登陆&#xff0c;查看上一篇免密传输内容 2、命令行执行少量命令&#xff1a;ssh ip "command1;command2"。例&#xff1a;ssh 172.1.1.1 "cd /home;ls" 3、脚本批量执行命令&#xff1a; #&#xff01;/bin/bash ssh ip << remotes…

Python调用C语言

Python中的ctypes模块可能是Python调用C方法中最简单的一种。ctypes模块提供了和C语言兼容的数据类型和函数来加载dll文件&#xff0c;因此在调用时不需对源文件做任何的修改。也正是如此奠定了这种方法的简单性。 示例如下 实现两数求和的C代码&#xff0c;保存为add.c //samp…

多重线性回归 多元线性回归_了解多元线性回归

多重线性回归 多元线性回归Video Link影片连结 We have taken a look at Simple Linear Regression in Episode 4.1 where we had one variable x to predict y, but what if now we have multiple variables, not just x, but x1,x2, x3 … to predict y — how would we app…

tp703n怎么做无线打印服务器,TP-Link TL-WR703N无线路由器无线AP模式怎么设置

TP-Link TL-WR703N无线路由器配置简单&#xff0c;不过对于没有网络基础的用户来说&#xff0c;完成路由器的安装和无线AP模式的设置&#xff0c;仍然有一定的困难&#xff0c;本文学习啦小编主要介绍TP-Link TL-WR703N无线路由器无线AP模式的设置方法!TP-Link TL-WR703N无线路…

unity 克隆_使用Unity开发Portal游戏克隆

unity 克隆Learn game development principles by coding a Portal-like game using Unity and C#. The principles you learn in this lecture from Colton Ogden can apply to any programming language and any game.通过使用Unity和C&#xff03;编写类似于Portal的游戏来学…

swift基础学习(八)

####1.主要用到的知识点 CAGradientLayer 处理渐变色AVAudioPlayer 音频播放Timer 定时器CABasicAnimation 动画#####2.效果图 ####3.代码 import UIKit import AVFoundationclass ViewController: UIViewController, AVAudioPlayerDelegate {var gradientLayer: CAGradientLay…

pandas之groupby分组与pivot_table透视

一、groupby 类似excel的数据透视表&#xff0c;一般是按照行进行分组&#xff0c;使用方法如下。 df.groupby(byNone, axis0, levelNone, as_indexTrue, sortTrue, group_keysTrue,squeezeFalse, observedFalse, **kwargs) 分组得到的直接结果是一个DataFrameGroupBy对象。 df…

js能否打印服务器端文档,js打印远程服务器文件

js打印远程服务器文件 内容精选换一换对于密码鉴权方式创建的Windows 2012弹性云服务器&#xff0c;使用初始密码以MSTSC方式登录时&#xff0c;登录失败&#xff0c;系统显示“第一次登录之前&#xff0c;你必须更改密码。请更新密码&#xff0c;或者与系统管理员或技术支持联…

spring—JdbcTemplate使用

JdbcTemplate基本使用 01-JdbcTemplate基本使用-概述(了解) JdbcTemplate是spring框架中提供的一个对象&#xff0c;是对原始繁琐的Jdbc API对象的简单封装。spring框架为我们提供了很多的操作模板类。例如&#xff1a;操作关系型数据的JdbcTemplate和HibernateTemplate&…

vanilla_如何在Vanilla JavaScript中操作DOM

vanillaby carlos da costa通过卡洛斯达科斯塔 如何在Vanilla JavaScript中操作DOM (How to manipulate the DOM in Vanilla JavaScript) So you have learned variables, selection structures, and loops. Now it is time to learn about DOM manipulation and to start doi…

NOIP201202寻宝

题目 试题描述传说很遥远的藏宝楼顶层藏着诱人的宝藏。 小明历尽千辛万苦终于找到传说中的这个藏宝楼&#xff0c;藏宝楼的门口竖着一个木板&#xff0c;上面写有几个大字&#xff1a;寻宝说明书。说明书的内容如下&#xff1a;藏宝楼共有N1层&#xff0c;最上面一层是顶层&…

修改UITextField中的placeholder的字体

修改字体颜色&#xff1a; [textField setValue:[UIColor redColor] forKeyPath:"_placeholderLabel.textColor"]; 复制代码 修改字体大小&#xff1a; [textField setValue:[UIFont boldSystemFontOfSize:16] forKeyPath:"_placeholderLabel.font"]; 复…

如何使用Python处理丢失的数据

The complete notebook and required datasets can be found in the git repo here完整的笔记本和所需的数据集可以在git repo中找到 Real-world data often has missing values.实际数据通常缺少值 。 Data can have missing values for a number of reasons such as observ…

MySQL—隔离级别

READ UNCOMMITED(读未提交) 即读取到了正在修改但是却还没有提交的数据&#xff0c;这就会造成数据读取的错误。 READ COMMITED(提交读/不可重复读) 它与READ UNCOMMITED的区别在于&#xff0c;它规定读取的时候读到的数据只能是提交后的数据。 这个级别所带来的问题就是不可…

做虚拟化服务器的配资一致嘛,服务器虚拟化技术在校园网管理中的应用探讨.pdf...

第 卷 第 期 江 苏 建 筑 职 业 技 术 学 院 学 报14 3 Vol.14 曧.3年 月 JOURNAL OF JIANGSU JIANZHU INSTITUTE2014 09 Se .2014p服务器虚拟化技术在校园网管理中的应用探讨,汪小霞 江建( , )健雄职业技术学院 软件与服务外包学院 江苏 太仓 215411: , ,摘 要 高校校园网数据…

aws中部署防火墙_如何在AWS中设置自动部署

aws中部署防火墙by Harry Sauers哈里绍尔斯(Harry Sauers) 如何在AWS中设置自动部署 (How to set up automated deployment in AWS) 设置和配置服务器 (Provisioning and Configuring Servers) 介绍 (Introduction) In this tutorial, you’ll learn how to use Amazon’s AWS…

Runtime的应用

来自&#xff1a;http://www.imlifengfeng.com/blog/?p397 1、快速归档 (id)initWithCoder:(NSCoder *)aDecoder { if (self [super init]) { unsigned int outCount; Ivar * ivars class_copyIvarList([self class], &outCount); for (int i 0; i < outCount; i ) …