数据科学中的数据可视化

数据可视化简介 (Introduction to Data Visualization)

Data visualization is the process of creating interactive visuals to understand trends, variations, and derive meaningful insights from the data. Data visualization is used mainly for data checking and cleaning, exploration and discovery, and communicating results to business stakeholders. Most of the data scientists pay little attention to graphs and focuses only on the numerical calculations which at times can be misleading. To understand the importance of visualization let’s take a look at Anscombe’s Data Quartet in Figures 1 and 2 below.

数据可视化是创建交互式视觉效果以了解趋势,变化并从数据中获得有意义的见解的过程。 数据可视化主要用于数据检查和清理,探索和发现以及将结果传达给业务涉众。 大多数数据科学家很少关注图形,而只关注于有时会引起误解的数值计算。 为了理解可视化的重要性,让我们在下面的图1和图2中查看Anscombe的Data Quartet。

Image for post
Figure 1. Anscombe’s Data Quartet showing how a pair of X and Y can have different values yet have different central tendency and correlation values. Data Credits — Anscombe, Francis J. (1973)
图1. Anscombe的数据四重奏显示了一对X和Y如何具有不同的值却具有不同的集中趋势和相关值。 数据信用-Anscombe,Francis J.(1973)

The same data points, when represented using visualization in Figure 2 below, depicts a different trend altogether.

当使用下面的图2中的可视化表示相同的数据点时,它们总共描述了不同的趋势。

Image for post
Figure 2. Illustrates how four identical datasets when examined using simple summary statistics look similar but vary considerably when graphed. Image Credits — Anscombe, Francis J. (1973)
图2.说明了使用简单的汇总统计数据检查时,四个相同的数据集看起来如何相似,但绘制时却相差很大。 图片来源-弗朗西斯·J·安斯科姆(1973)

It is important to visualize the data before any calculations are carried out. The visual representation can convey much more information when compared to descriptive statistics.

在执行任何计算之前,对数据进行可视化非常重要。 与描述性统计数据相比,视觉表示可以传达更多的信息。

数据可视化的作用 (Role of Data Visualization)

Multiple Business Intelligence Tools (BI) are currently ruling the market with each having its pros and cons. The concept of self-service dashboards was devised to allow stakeholders with little or no knowledge of data science, work independently on data, and derive some findings that might assist their day to day business decisions. We will look at some of the applications of data visualization using Tableau or Python in the examples below.

目前,多种商业智能工具(BI)统治着市场,每种都有其优缺点。 自助服务仪表板的概念旨在使几乎不了解数据科学或根本不了解数据科学的利益相关者,独立地处理数据并得出一些有助于其日常业务决策的发现。 在下面的示例中,我们将介绍一些使用Tableau或Python进行数据可视化的应用程序。

数据检查与清理 (Data Checking and Cleaning)

Data visualization can be used to look for obvious errors in the dataset including nulls, random values, distinct records, the format of dates, sensibility of spatial data, and string and character encoding.

数据可视化可用于查找数据集中的明显错误,包括空值,随机值,不同的记录,日期格式,空间数据的敏感性以及字符串和字符编码。

Image for post
Figure 3. Illustrates the distribution of Pedestrian volume in Melbourne captured by different sensors situated in and around CBD. The idea is to analyze if the latitude and longitude information is valid for a given dataset. The image is developed by the author using Tableau.
图3.说明了位于CBD内和周围的不同传感器捕获的墨尔本行人流量分布。 这个想法是分析经纬度信息对于给定的数据集是否有效。 该图像由作者使用Tableau开发。

资料分配 (Data Distribution)

Data visualization can be used to understand the distribution of the data, look for central tendencies (mean, median, and mode), understand the presence of outliers using a boxplot, check for skewness, and ever understand the impact of winsorization on data distribution. Figure 4 below illustrates how box plots can be developed to understand the presence of outliers.

数据可视化可用于了解数据的分布,寻找中心趋势(均值,中位数和众数),使用箱线图了解异常值,检查偏斜度,以及了解Winsorization对数据分布的影响。 下面的图4说明了如何绘制箱形图以了解异常值的存在。

Image for post
Figure 4. Displays the presence of outliers (outliers in pedestrian volume) across different sensors installed across various parts of Melbourne. The dataset used for this analysis can be found here. The image is developed by the author using Jupyter Notebook.
图4.显示跨墨尔本各个地区安装的不同传感器的异常值(行人量中的异常值)的存在。 可以在此处找到用于此分析的数据集。 该图像由作者使用Jupyter Notebook开发。

模型假设 (Model Assumptions)

Linear regression and other classification models follow certain underlying assumptions like data has to be normally distributed, the correlation between different independent variables shouldn’t exist, homoscedasticity of error terms, and many more. Hence visualizations are a key to validating some of these assumptions as well.

线性回归和其他分类模型遵循某些基本假设,例如数据必须正态分布,不应该存在不同自变量之间的相关性,误差项的均方差等等。 因此,可视化也是验证其中一些假设的关键。

Image for post
Figure 5. Illustrates the correlation plot of numerical variables using a heat map. The correlation plot is used to drop variables that are highly correlated while building a classification model to predict customer satisfaction using flight and facilities data. The image is developed by the author using Jupyter Notebook.
图5.使用热图说明数值变量的相关图。 相关图用于删除高度相关的变量,同时建立分类模型以使用航班和设施数据预测客户满意度。 该图像由作者使用Jupyter Notebook开发。

人在环分析 (Human-in-the-Loop Analytics)

Data scientists often use humans in the loop analytics to get a look and feel of the data, make a hypothesis, run appropriate analytics to validate the hypothesis, and repeat the process till conclusive evidence is determined. E.g. in Python a very popular package Seaborn has a function called pair plot. Pair plots are very useful in determining the relationship between dependent and independent variables. The idea of the visualization is to get a better understanding of the directional sense of if some of the independent variables impact the model results or not.

数据科学家经常在循环分析中使用人工来获得数据的外观和感觉,做出假设,运行适当的分析以验证假设,并重复该过程直到确定结论性证据为止。 例如,在Python中,一个非常受欢迎的软件包Seaborn具有一个称为结对图的函数。 配对图对于确定因变量和自变量之间的关系非常有用。 可视化的想法是更好地理解方向性,即某些自变量是否影响模型结果。

Image for post
Figure 6. Illustrates the pair plot representation of a dependent variable (say customer satisfaction of airline passengers) across independent variables like distance of the flight, the delay in arrival, and the delay in departure. The image is developed by the author using Jupyter Notebook.
图6.图示了跨自变量(例如,飞行距离,到达延迟和起飞延迟)的因变量(例如,航空公司乘客的客户满意度)的对图表示。 该图像由作者使用Jupyter Notebook开发。

降维 (Dimension Reduction)

While working with multiple variables it is difficult to visualize the data in an n-dimension space. E.g. in a data set that has different customer attributes (say numerical) it is difficult to plot the customers considering all attributes. In scenarios like this, dimension reduction techniques like Principal Component Analysis (PCA) or Factor Analysis can be useful to bring down the attributes to fewer dimensions. PCA finds linear combinations of variables that best explain the observations whereas Factor analysis finds linear combinations of variables that best explain the relationship between the variables. The reduced dimension can then be plotted to analyze the customers in a 2D space.

使用多个变量时,很难在n维空间中可视化数据。 例如,在具有不同客户属性(例如数字)的数据集中,很难考虑所有属性来绘制客户。 在这种情况下,降维技术(例如主成分分析(PCA)或因子分析)可用于将属性降低到更少的维度。 PCA找到最能解释观测结果的变量线性组合,而因子分析则找到最能解释变量之间关系的变量线性组合。 然后可以绘制缩小的尺寸以分析2D空间中的客户。

More information on how to recreate these charts in Python can be found here.

可在此处找到有关如何在Python中重新创建这些图表的更多信息。

分析问题中的数据集类型 (Type of Datasets in Analytical Problems)

It is important to understand the type of datasets to determine the type of visualization that can be applied. E.g. when working with a tabular data a combination of bar graphs and line charts might be useful when compared to spatial data where a map with a density plot might communicate the result effectively. Before we take a deeper look into the type of visualization let’s understand some of the key data types that are commonly used.

重要的是了解数据集的类型,以确定可以应用的可视化类型。 例如,当与表格数据一起使用时,与空间数据相比,条形图和折线图的组合可能会很有用,在空间数据中,带有密度图的地图可能会有效地传达结果。 在深入研究可视化类型之前,让我们了解一些常用的关键数据类型。

表格数据 (Tabular data)

Data organized in tables, a row for each data item, and a column for each of its attributes. E.g. Datasets that are available in Excel, CSV files, Pandas data frame, etc.

数据组织在表格中,每个数据项一行,其每个属性列。 例如,Excel,CSV文件,Pandas数据框等中可用的数据集。

网络数据 (Network data)

Nodes in the network are data items and links between the nodes are relations between. For example a social network.

网络中的节点是数据项,节点之间的链接是它们之间的关系。 例如社交网络。

空间数据: (Spatial data:)

Data which is naturally organized and understood in terms of its spatial location or extent. E.g. latitude and longitude of locations, geography information, suburbs, streets, etc.

根据空间位置或范围自然组织和理解的数据。 例如,位置,地理信息,郊区,街道等的纬度和经度。

文字数据: (Textual data:)

This kind of data set consists of sequences of words and punctuation. E.g. twitter feed or customer complaints.

这种数据集由单词和标点的序列组成。 例如Twitter提要或客户投诉。

视觉词汇 (Visual Vocabulary)

The figures below provide a picture of how different visualizations can be used to depict different scenarios in the data.

下图提供了如何使用不同的可视化图像描述数据中不同场景的图片。

Image for post
Figure 7. Illustrates some of the graphs useful for visualizing trends w.r.t deviations from reference points. Image Credits — Github.io
图7.说明了一些图表,这些图表可用于可视化与参考点之间的偏差趋势。 图片积分— Github.io
Image for post
Figure 8. Illustrates some of the graphs useful for visualizing the correlation between multiple data points. Image Credits — Github.io
图8.说明了一些图形,这些图形对于可视化多个数据点之间的相关性很有用。 图片积分— Github.io
Image for post
Figure 9. Illustrates how visualizations can be used to understand the variation of attributes concerning time. Image Credits — Github.io
图9.说明了如何使用可视化来了解与时间有关的属性的变化。 图片积分— Github.io
Image for post
Figure 10. Illustrates how different visualizations can be used to understand rankings or order of different components. Image Credits — Github.io
图10.说明了如何使用不同的可视化效果来理解不同组件的排名或顺序。 图片积分— Github.io

You can find examples of other visualizations here.

您可以在此处找到其他可视化示例。

跨数据类型的可视化效果 (Effectiveness of Visualization across Data Types)

The table below displays the effectiveness of different visuals across data types. To understand the table better we need to have a better understanding of how variables (attributes from the data) can be categorized into different data types. Categorical variables are the ones that don’t have any ordering e.g. Gender, Grades, Marital Status, Job Position, etc. Numerical Variables are segmented into Ordinal and Quantitative variables. Ordinal variables are categories that can be ranked. E.g. Satisfaction (Good, Bad, and Average), Potential (High, Medium, and Low), etc. Quantitative variables are the ones that can take any range of numeric values between -infinity to +infinity. E.g. Age, Salary, Revenue, Sales, etc.

下表显示了跨数据类型的不同视觉效果的有效性。 为了更好地理解表,我们需要更好地了解如何将变量(来自数据的属性)归类为不同的数据类型。 分类变量是没有任何排序的变量 ,例如性别,等级,婚姻状况,工作职位等。 数字变量分为序数 变量定量变量。 有序变量是可以排序的类别。 例如,满意度(好,坏和平均),潜力(高,中和低)等。 定量变量是可以采用-infinity到+ infinity之间任意数值范围的变量 。 例如年龄,薪水,收入,销售等

Image for post
Figure 11. Illustrates how different graphs can be used to visualize patterns in the data taking into consideration the data type of the variable. Image credits — Developed by the author using PowerPoint.
图11.说明了如何使用不同的图来可视化数据中的模式,同时考虑到变量的数据类型。 图片来源-由作者使用PowerPoint开发。
Image for post
Figure 12. Illustrates the type of visualization that can be used for different data types. Image credit — Developed by the author using Excel.
图12.说明了可用于不同数据类型的可视化类型。 图像信用—由作者使用Excel开发。

结论 (Conclusion)

Data visualization forms the backbone of all analytical projects. It not only helps in gaining insights into the data but can be used as a tool for data pre-processing. Having the right set of visualizations for different data types and business scenarios is the key to effective communication of results.

数据可视化构成所有分析项目的基础。 它不仅有助于获得对数据的见解,而且可以用作数据预处理的工具。 为不同的数据类型和业务场景提供正确的可视化设置是有效传达结果的关键。

About the Author: Advanced analytics professional and management consultant helping companies find solutions for diverse problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;

作者简介:高级分析专家和管理顾问,通过组织数据的业务,技术和数学相结合,帮助公司找到各种问题的解决方案。 数据科学爱好者,在这里分享,学习和贡献; 您可以在 Linked Twitter上 与我 联系

翻译自: https://towardsdatascience.com/data-visualization-in-data-science-5681cbdde5bf

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391926.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

手把手教你webpack3(6)css-loader详细使用说明

CSS-LOADER配置详解 前注: 文档全文请查看 根目录的文档说明。 如果可以,请给本项目加【Star】和【Fork】持续关注。 有疑义请点击这里,发【Issues】。 1、概述 对于一般的css文件,我们需要动用三个loader(是不是觉得好…

多重线性回归 多元线性回归_了解多元线性回归

多重线性回归 多元线性回归Video Link影片连结 We have taken a look at Simple Linear Regression in Episode 4.1 where we had one variable x to predict y, but what if now we have multiple variables, not just x, but x1,x2, x3 … to predict y — how would we app…

tp703n怎么做无线打印服务器,TP-Link TL-WR703N无线路由器无线AP模式怎么设置

TP-Link TL-WR703N无线路由器配置简单,不过对于没有网络基础的用户来说,完成路由器的安装和无线AP模式的设置,仍然有一定的困难,本文学习啦小编主要介绍TP-Link TL-WR703N无线路由器无线AP模式的设置方法!TP-Link TL-WR703N无线路…

pandas之groupby分组与pivot_table透视

一、groupby 类似excel的数据透视表,一般是按照行进行分组,使用方法如下。 df.groupby(byNone, axis0, levelNone, as_indexTrue, sortTrue, group_keysTrue,squeezeFalse, observedFalse, **kwargs) 分组得到的直接结果是一个DataFrameGroupBy对象。 df…

js能否打印服务器端文档,js打印远程服务器文件

js打印远程服务器文件 内容精选换一换对于密码鉴权方式创建的Windows 2012弹性云服务器,使用初始密码以MSTSC方式登录时,登录失败,系统显示“第一次登录之前,你必须更改密码。请更新密码,或者与系统管理员或技术支持联…

如何使用Python处理丢失的数据

The complete notebook and required datasets can be found in the git repo here完整的笔记本和所需的数据集可以在git repo中找到 Real-world data often has missing values.实际数据通常缺少值 。 Data can have missing values for a number of reasons such as observ…

为什么印度盛产码农_印度农产品价格的时间序列分析

为什么印度盛产码农Agriculture is at the center of Indian economy and any major change in the sector leads to a multiplier effect on the entire economy. With around 17% contribution to the Gross Domestic Product (GDP), it provides employment to more than 50…

pandas处理excel文件和csv文件

一、csv文件 csv以纯文本形式存储表格数据 pd.read_csv(文件名),可添加参数enginepython,encodinggbk 一般来说,windows系统的默认编码为gbk,可在cmd窗口通过chcp查看活动页代码,936即代表gb2312。 例如我的电脑默认编码时gb2312&…

tukey检测_回到数据分析的未来:Tukey真空度的整洁实现

tukey检测One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.John Tukey的标志性论文之一,“ 数据分析的未来 ”&#x…

spring— Spring与Web环境集成

ApplicationContext应用上下文获取方式 应用上下文对象是通过new ClasspathXmlApplicationContext(spring配置文件) 方式获取的,但是每次从容器中获 得Bean时都要编写new ClasspathXmlApplicationContext(spring配置文件) ,这样的弊端是配置文件加载多次…

Elasticsearch集群知识笔记

Elasticsearch集群知识笔记 Elasticsearch内部提供了一个rest接口用于查看集群内部的健康状况: curl -XGET http://localhost:9200/_cluster/healthresponse结果: {"cluster_name": "format-es","status": "green&qu…

matplotlib图表介绍

Matplotlib 是一个python 的绘图库,主要用于生成2D图表。 常用到的是matplotlib中的pyplot,导入方式import matplotlib.pyplot as plt 一、显示图表的模式 1.plt.show() 该方式每次都需要手动show()才能显示图表,由于pycharm不支持魔法函数&a…

到2025年将保持不变的热门流行技术

重点 (Top highlight)I spent a good amount of time interviewing SMEs, data scientists, business analysts, leads & their customers, programmers, data enthusiasts and experts from various domains across the globe to identify & put together a list that…

马尔科夫链蒙特卡洛_蒙特卡洛·马可夫链

马尔科夫链蒙特卡洛A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide array of applications, the most common of…

django基于存储在前端的token用户认证

一.前提 首先是这个代码基于前后端分离的API,我们用了django的framework模块,帮助我们快速的编写restful规则的接口 前端token原理: 把(token加密后的字符串,keyname)在登入后发到客户端,以后客户端再发请求,会携带过来服务端截取(token加密后的字符串,keyname),我们再利用解密…

数据分布策略_有效数据项目的三种策略

数据分布策略Many data science projects do not go into production, why is that? There is no doubt in my mind that data science is an efficient tool with impressive performances. However, a successful data project is also about effectiveness: doing the righ…

java基础学习——5、HashMap实现原理

一、HashMap的数据结构 数组的特点是:寻址容易,插入和删除困难;而链表的特点是:寻址困难,插入和删除容易。那么我们能不能综合两者的特性,做出一种寻址容易,插入删除也容易的数据结构&#xff1…

看懂nfl定理需要什么知识_NFL球队为什么不经常通过?

看懂nfl定理需要什么知识Debunking common NFL myths in an analytical study on the true value of passing the ball在关于传球真实价值的分析研究中揭穿NFL常见神话 Background背景 Analytics are not used enough in the NFL. In a league with an abundance of money, i…

29/07/2010 sunrise

** .. We can only appreciate the miracle of a sunrise if we have waited in the darkness .. 人们在黑暗中等待着,那是期盼着如同日出般的神迹出现 .. 附:27/07/2010 sunrise ** --- 31 July 改动转载于:https://www.cnblogs.com/orderedchaos/archi…

密度聚类dbscan_DBSCAN —基于密度的聚类方法的演练

密度聚类dbscanThe idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model…