鲜为人知的6个黑科技网站_6种鲜为人知的熊猫绘图工具

鲜为人知的6个黑科技网站

Pandas is the go-to Python library for data analysis and manipulation. It provides numerous functions and methods that expedice the data analysis process.

Pandas是用于数据分析和处理的Python库。 它提供了加速数据分析过程的众多功能和方法。

When it comes to data visualization, pandas is not the prominent choice because there exist great visualization libraries such as matplotlib, seaborn, and plotly.

在数据可视化方面,大熊猫并不是首选,因为存在强大的可视化库,例如matplotlib,seaborn和plotly。

With that being said, we cannot just ignore the plotting tools of pandas. They help to discover relations within dataframes or series and syntax is pretty simple. Very informative plots can be created with just one line of code.

话虽如此,我们不能仅仅忽略熊猫的绘图工具。 它们有助于发现数据框或序列中的关系,语法非常简单。 只需一行代码就可以创建非常有用的图。

In this post, we will cover 6 plotting tools of pandas which definitely add value to the exploratory data analysis process.

在本文中,我们将介绍6种熊猫绘图工具,这些工具肯定会为探索性数据分析过程增添价值。

The first step to create a great machine learning model is to explore and understand the structure and relations within the data.

创建出色的机器学习模型的第一步是探索和理解数据内的结构和关系。

These 6 plotting tools will help you understand the data better:

这6种绘图工具将帮助您更好地理解数据:

  • Scatter matrix plot

    散点图

  • Density plot

    密度图

  • Andrews curves

    安德鲁斯曲线

  • Parallel coordinates

    平行坐标

  • Lag plots

    滞后图

  • Autocorrelation plot

    自相关图

I will use a diabetes dataset available on kaggle. Let’s first read the dataset into a pandas dataframe.

我将使用kaggle上提供的糖尿病数据集 。 首先让我们将数据集读入pandas数据框。

import pandas as pd
import numpy as npimport matplotlib.pyplot as plt
%matplotlib inlinedf = pd.read_csv("/content/diabetes.csv")
print(df.shape)
df.head()
Image for post

The dataset contains 8 numerical features and a target variable indicating if the person has diabetes.

该数据集包含8个数字特征和一个指示该人是否患有糖尿病的目标变量。

1.散点图 (1. Scatter matrix plot)

Scatter plots are typically used to explore the correlation between two variables (or features). The values of data points are shown using the cartesian coordinates.

散点图通常用于探索两个变量(或特征)之间的相关性。 使用笛卡尔坐标显示数据点的值。

Scatter plot matrix produces a grid of scatter plots with just one line of code.

散点图矩阵仅用一行代码即可生成散点图的网格。

from pandas.plotting import scatter_matrixsubset = df[['Glucose','BloodPressure','Insulin','Age']]scatter_matrix(subset, figsize=(10,10), diagonal='hist')
Image for post

I’ve selected a subset of the dataframe with 4 features for demonstration purposes. The diagonal shows the histogram of each variable but we can change it to show kde plot by setting diagonal parameter as ‘kde’.

为了演示目的,我选择了具有4个功能的数据框的子集。 对角线显示每个变量的直方图,但我们可以通过将对角线参数设置为' kde '来更改它以显示kde图。

2.密度图 (2. Density plot)

We can produce density plots using kde() function on series or dataframe.

我们可以在系列或数据框上使用kde()函数生成密度图。

subset = df[['Glucose','BloodPressure','BMI']]subset.plot.kde(figsize=(12,6), alpha=1)
Image for post

We are able to see the distribution of features with one line of code. Alpha parameter is used to adjust the darkness of lines.

我们可以用一行代码看到功能的分布。 Alpha参数用于调整线条的暗度。

3.安德鲁斯曲线 (3. Andrews curves)

Andrews curves, named after the statistician David F. Andrews, is a tool to plot multivariate data with lots of curves. The curves are created using the attributes (features) of samples as coefficients of Fourier series.

以统计学家大卫·安德鲁斯(David F. 使用样本的属性(特征)作为傅立叶级数的系数来创建曲线。

We get an overview of clustering of different classes by coloring the curves that belong to each class differently.

我们通过对属于每个类别的曲线进行不同的着色来获得对不同类别的聚类的概述。

from pandas.plotting import andrews_curvesplt.figure(figsize=(12,8))subset = df[['Glucose','BloodPressure','BMI', 'Outcome']]andrews_curves(subset, 'Outcome', colormap='Paired')
Image for post

We need to pass a dataframe and name of the variable that hold class information. Colormap parameter is optional. There seems to be a clear distinction (with some exceptions) between 2 classes based on the features in subset.

我们需要传递一个保存类信息的数据框和变量名。 Colormap参数是可选的。 根据子集中的功能,两个类之间似乎有明显的区别(有些例外)。

4.平行坐标 (4. Parallel coordinates)

Parallel coordinates is another tool for plotting multivariate data. Let’s first create the plot and then talk about what it tells us.

平行坐标是另一个用于绘制多元数据的工具。 让我们首先创建情节,然后谈论它告诉我们的内容。

from pandas.plotting import parallel_coordinatescols = ['Glucose','BloodPressure','BMI', 'Age']plt.figure(figsize=(12,8))parallel_coordinates(df,'Outcome',color=['Blue','Gray'],cols=cols)

We first import parallel_coordinates from pandas plotting tools. Then create a list of columns to use. Then a matplotlib figure is created. The last line creates parallel coordinates plot. We pass a dataframe and name of the class variable. Color parameter is optional and used to determine colors for each class. Finally cols parameter is used to select columns to be used in the plot. If not specified, all columns are used.

我们首先从熊猫绘图工具导入parallel_coordinates 。 然后创建要使用的列的列表。 然后创建一个matplotlib图形。 最后一行创建平行坐标图。 我们传递一个数据框和类变量的名称。 Color参数是可选的,用于确定每个类的颜色。 最后, cols参数用于选择要在绘图中使用的列。 如果未指定,则使用所有列。

Image for post

Each column is represented with a vertical line. The horizontal lines represent data points (rows in dataframe). We get an overview of how classes are separated according to features. “Glucose” variable seems to a good predictor to separate these two classes. On the other hand, lines of different classes overlap on “BloodPressure” which indicates it does not perform well in separating the classes.

每列均以垂直线表示。 水平线代表数据点(数据帧中的行)。 我们对如何根据功能分离类进行了概述。 “葡萄糖”变量似乎是区分这两个类别的良好预测指标。 另一方面,不同类别的行在“ BloodPressure”上重叠,这表明在分隔类别时效果不佳。

5.滞后图 (5. Lag plot)

Lag plots are used to check the randomness in a data set or time series. If a structure is displayed in lag plot, we can conclude that the data is not random.

滞后图用于检查数据集或时间序列中的随机性。 如果在滞后图中显示结构,则可以得出结论,数据不是随机的。

from pandas.plotting import lag_plotplt.figure(figsize=(10,6))lag_plot(df)
Image for post

There is no structure in our data set that indicates randomness.

我们的数据集中没有任何结构表明随机性。

Let’s see an example of non-random data. I will use the synthetic sample in pandas documentation page.

让我们看一个非随机数据的例子。 我将在pandas文档页面中使用合成样本。

spacing = np.linspace(-99 * np.pi, 99 * np.pi, num=1000)data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(spacing))plt.figure(figsize=(10,6))lag_plot(data)
Image for post

We can clearly see a structure on lag plot so the data is not random.

我们可以清楚地看到滞后图上的结构,因此数据不是随机的。

6.自相关图 (6. Autocorrelation plot)

Autocorrelation plots are used to check the randomness in time series. They are produced by calculating the autocorrelations for data values at varying time lags.

自相关图用于检查时间序列中的随机性。 它们是通过计算在不同时滞下数据值的自相关来产生的。

Lag is the time difference. If the autocorrelations are very close to zero for all time lags, the time series is random.

滞后是时差。 如果对于所有时滞,自相关都非常接近零,则时间序列是随机的。

If we observe one or more significantly non-zero autocorrelations, then we can conclude that time series is not random.

如果我们观察到一个或多个显着的非零自相关,则可以得出时间序列不是随机的结论。

Let’s first create a random time series and see the autocorrelation plot.

我们首先创建一个随机时间序列,然后查看自相关图。

noise = pd.Series(np.random.randn(250)*100)noise.plot(figsize=(12,6))
Image for post

This time series is clearly random. The autocorrelation plot of this time series:

这个时间序列显然是随机的。 该时间序列的自相关图:

from pandas.plotting import autocorrelation_plotplt.figure(figsize=(12,6))autocorrelation_plot(noise)
Image for post

As expected, all autocorrelation values are very close to zero.

不出所料,所有自相关值都非常接近零。

Let’s do an example of non-random time series. The plot below shows a very simple upward trend.

让我们举一个非随机时间序列的例子。 下图显示了非常简单的上升趋势。

upward = pd.Series(np.arange(100))upward.plot(figsize=(10,6))plt.grid()
Image for post

The autocorrelation plot for this time series:

此时间序列的自相关图:

plt.figure(figsize=(12,6))autocorrelation_plot(upward)
Image for post

This autocorrelation clearly indicates a non-random time series as there are many significantly non-zero values.

由于存在许多明显的非零值,因此这种自相关清楚地指示了非随机时间序列。

It is very easy to visually check the non-randomness of simple upward and downward trends. However, in real life data sets, we are likely to see highly complex time series. We may not able see the trends or seasonality in those series. In such cases, autocorrelation plots are very helpful for time series analysis.

直观地检查简单的向上和向下趋势的非随机性非常容易。 但是,在现实生活中的数据集中,我们可能会看到非常复杂的时间序列。 我们可能看不到那些系列的趋势或季节性。 在这种情况下,自相关图对于时间序列分析非常有帮助。

Pandas provide two more plotting tools which are bootstap plot and RadViz. They can also be used in exploratory data analysis process.

熊猫提供了另外两种绘图工具,即引导绘图和RadViz 。 它们也可以用于探索性数据分析过程。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。 如果您有任何反馈意见,请告诉我。

翻译自: https://towardsdatascience.com/6-lesser-known-pandas-plotting-tools-fda5adb232ef

鲜为人知的6个黑科技网站

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389434.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

VRRP网关冗余

实验要求 1、R1创建环回口,模拟外网 2、R2,R3使用VRRP技术 3、路由器之间使用EIGRP路由协议  实验拓扑  实验配置  R1(config)#interface loopback 0R1(config-if)#ip address 1.1.1.1 255.255.255.0R1(config-if)#int e0/0R1(config-if)#ip addr…

大熊猫卸妆后_您不应错过的6大熊猫行动

大熊猫卸妆后数据科学 (Data Science) Pandas is used mainly for reading, cleaning, and extracting insights from data. We will see an advanced use of Pandas which are very important to a Data Scientist. These operations are used to analyze data and manipulate…

数据eda_关于分类和有序数据的EDA

数据eda数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING) Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the …

jdk重启后步行_向后介绍步行以一种新颖的方式来预测未来

jdk重启后步行“永远不要做出预测,尤其是关于未来的预测。” (KK Steincke) (“Never Make Predictions, Especially About the Future.” (K. K. Steincke)) Does this picture portray a horse or a car? 这张照片描绘的是马还是汽车? How likely is …

mongodb仲裁者_真理的仲裁者

mongodb仲裁者Coming out of college with a background in mathematics, I fell upward into the rapidly growing field of data analytics. It wasn’t until years later that I realized the incredible power that comes with the position. As Uncle Ben told Peter Par…

优化 回归_使用回归优化产品价格

优化 回归应用数据科学 (Applied data science) Price and quantity are two fundamental measures that determine the bottom line of every business, and setting the right price is one of the most important decisions a company can make. Under-pricing hurts the co…

大数据数据科学家常用面试题_进行数据科学工作面试

大数据数据科学家常用面试题During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent amo…

scrapy模拟模拟点击_模拟大流行

scrapy模拟模拟点击复杂系统 (Complex Systems) In our daily life, we encounter many complex systems where individuals are interacting with each other such as the stock market or rush hour traffic. Finding appropriate models for these complex systems may give…

vue.js python_使用Python和Vue.js自动化报告过程

vue.js pythonIf your organization does not have a data visualization solution like Tableau or PowerBI nor means to host a server to deploy open source solutions like Dash then you are probably stuck doing reports with Excel or exporting your notebooks.如果…

plsql中导入csvs_在命令行中使用sql分析csvs

plsql中导入csvsIf you are familiar with coding in SQL, there is a strong chance you do it in PgAdmin, MySQL, BigQuery, SQL Server, etc. But there are times you just want to use your SQL skills for quick analysis on a small/medium sized dataset.如果您熟悉SQ…

计算机科学必读书籍_5篇关于数据科学家的产品分类必读文章

计算机科学必读书籍Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.产品分类/产品分类是将产品…

交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统

交替最小二乘矩阵分解pyspark上的动手推荐系统 (Hands-on recommender system on pyspark) Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For ex…

python 网页编程_通过Python编程检索网页

python 网页编程The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext T…

火种 ctf_分析我的火种数据

火种 ctfOriginally published at https://www.linkedin.com on March 27, 2020 (data up to date as of March 20, 2020).最初于 2020年3月27日 在 https://www.linkedin.com 上 发布 (数据截至2020年3月20日)。 Day 3 of social distancing.社会疏离的第三天。 As I sit on…

data studio_面向营销人员的Data Studio —报表指南

data studioIn this guide, we describe both the theoretical and practical sides of reporting with Google Data Studio. You can use this guide as a comprehensive cheat sheet in your everyday marketing.在本指南中,我们描述了使用Google Data Studio进行…

人流量统计系统介绍_统计介绍

人流量统计系统介绍Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the …

乐高ev3 读取外部数据_数据就是新乐高

乐高ev3 读取外部数据When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were…

图像灰度化与二值化

图像灰度化 什么是图像灰度化? 图像灰度化并不是将单纯的图像变成灰色,而是将图片的BGR各通道以某种规律综合起来,使图片显示位灰色。 规律如下: 手动实现灰度化 首先我们采用手动灰度化的方式: 其思想就是&#…

分析citibike数据eda

数据科学 (Data Science) CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — no…

上采样(放大图像)和下采样(缩小图像)(最邻近插值和双线性插值的理解和实现)

上采样和下采样 什么是上采样和下采样? • 缩小图像(或称为下采样(subsampled)或降采样(downsampled))的主要目的有 两个:1、使得图像符合显示区域的大小;2、生成对应图…