鲜为人知的6个黑科技网站
Pandas is the go-to Python library for data analysis and manipulation. It provides numerous functions and methods that expedice the data analysis process.
Pandas是用于数据分析和处理的Python库。 它提供了加速数据分析过程的众多功能和方法。
When it comes to data visualization, pandas is not the prominent choice because there exist great visualization libraries such as matplotlib, seaborn, and plotly.
在数据可视化方面,大熊猫并不是首选,因为存在强大的可视化库,例如matplotlib,seaborn和plotly。
With that being said, we cannot just ignore the plotting tools of pandas. They help to discover relations within dataframes or series and syntax is pretty simple. Very informative plots can be created with just one line of code.
话虽如此,我们不能仅仅忽略熊猫的绘图工具。 它们有助于发现数据框或序列中的关系,语法非常简单。 只需一行代码就可以创建非常有用的图。
In this post, we will cover 6 plotting tools of pandas which definitely add value to the exploratory data analysis process.
在本文中,我们将介绍6种熊猫绘图工具,这些工具肯定会为探索性数据分析过程增添价值。
The first step to create a great machine learning model is to explore and understand the structure and relations within the data.
创建出色的机器学习模型的第一步是探索和理解数据内的结构和关系。
These 6 plotting tools will help you understand the data better:
这6种绘图工具将帮助您更好地理解数据:
Scatter matrix plot
散点图
Density plot
密度图
Andrews curves
安德鲁斯曲线
Parallel coordinates
平行坐标
Lag plots
滞后图
Autocorrelation plot
自相关图
I will use a diabetes dataset available on kaggle. Let’s first read the dataset into a pandas dataframe.
我将使用kaggle上提供的糖尿病数据集 。 首先让我们将数据集读入pandas数据框。
import pandas as pd
import numpy as npimport matplotlib.pyplot as plt
%matplotlib inlinedf = pd.read_csv("/content/diabetes.csv")
print(df.shape)
df.head()
The dataset contains 8 numerical features and a target variable indicating if the person has diabetes.
该数据集包含8个数字特征和一个指示该人是否患有糖尿病的目标变量。
1.散点图 (1. Scatter matrix plot)
Scatter plots are typically used to explore the correlation between two variables (or features). The values of data points are shown using the cartesian coordinates.
散点图通常用于探索两个变量(或特征)之间的相关性。 使用笛卡尔坐标显示数据点的值。
Scatter plot matrix produces a grid of scatter plots with just one line of code.
散点图矩阵仅用一行代码即可生成散点图的网格。
from pandas.plotting import scatter_matrixsubset = df[['Glucose','BloodPressure','Insulin','Age']]scatter_matrix(subset, figsize=(10,10), diagonal='hist')
I’ve selected a subset of the dataframe with 4 features for demonstration purposes. The diagonal shows the histogram of each variable but we can change it to show kde plot by setting diagonal parameter as ‘kde’.
为了演示目的,我选择了具有4个功能的数据框的子集。 对角线显示每个变量的直方图,但我们可以通过将对角线参数设置为' kde '来更改它以显示kde图。
2.密度图 (2. Density plot)
We can produce density plots using kde() function on series or dataframe.
我们可以在系列或数据框上使用kde()函数生成密度图。
subset = df[['Glucose','BloodPressure','BMI']]subset.plot.kde(figsize=(12,6), alpha=1)
We are able to see the distribution of features with one line of code. Alpha parameter is used to adjust the darkness of lines.
我们可以用一行代码看到功能的分布。 Alpha参数用于调整线条的暗度。
3.安德鲁斯曲线 (3. Andrews curves)
Andrews curves, named after the statistician David F. Andrews, is a tool to plot multivariate data with lots of curves. The curves are created using the attributes (features) of samples as coefficients of Fourier series.
以统计学家大卫·安德鲁斯(David F. 使用样本的属性(特征)作为傅立叶级数的系数来创建曲线。
We get an overview of clustering of different classes by coloring the curves that belong to each class differently.
我们通过对属于每个类别的曲线进行不同的着色来获得对不同类别的聚类的概述。
from pandas.plotting import andrews_curvesplt.figure(figsize=(12,8))subset = df[['Glucose','BloodPressure','BMI', 'Outcome']]andrews_curves(subset, 'Outcome', colormap='Paired')
We need to pass a dataframe and name of the variable that hold class information. Colormap parameter is optional. There seems to be a clear distinction (with some exceptions) between 2 classes based on the features in subset.
我们需要传递一个保存类信息的数据框和变量名。 Colormap参数是可选的。 根据子集中的功能,两个类之间似乎有明显的区别(有些例外)。
4.平行坐标 (4. Parallel coordinates)
Parallel coordinates is another tool for plotting multivariate data. Let’s first create the plot and then talk about what it tells us.
平行坐标是另一个用于绘制多元数据的工具。 让我们首先创建情节,然后谈论它告诉我们的内容。
from pandas.plotting import parallel_coordinatescols = ['Glucose','BloodPressure','BMI', 'Age']plt.figure(figsize=(12,8))parallel_coordinates(df,'Outcome',color=['Blue','Gray'],cols=cols)
We first import parallel_coordinates from pandas plotting tools. Then create a list of columns to use. Then a matplotlib figure is created. The last line creates parallel coordinates plot. We pass a dataframe and name of the class variable. Color parameter is optional and used to determine colors for each class. Finally cols parameter is used to select columns to be used in the plot. If not specified, all columns are used.
我们首先从熊猫绘图工具导入parallel_coordinates 。 然后创建要使用的列的列表。 然后创建一个matplotlib图形。 最后一行创建平行坐标图。 我们传递一个数据框和类变量的名称。 Color参数是可选的,用于确定每个类的颜色。 最后, cols参数用于选择要在绘图中使用的列。 如果未指定,则使用所有列。
Each column is represented with a vertical line. The horizontal lines represent data points (rows in dataframe). We get an overview of how classes are separated according to features. “Glucose” variable seems to a good predictor to separate these two classes. On the other hand, lines of different classes overlap on “BloodPressure” which indicates it does not perform well in separating the classes.
每列均以垂直线表示。 水平线代表数据点(数据帧中的行)。 我们对如何根据功能分离类进行了概述。 “葡萄糖”变量似乎是区分这两个类别的良好预测指标。 另一方面,不同类别的行在“ BloodPressure”上重叠,这表明在分隔类别时效果不佳。
5.滞后图 (5. Lag plot)
Lag plots are used to check the randomness in a data set or time series. If a structure is displayed in lag plot, we can conclude that the data is not random.
滞后图用于检查数据集或时间序列中的随机性。 如果在滞后图中显示结构,则可以得出结论,数据不是随机的。
from pandas.plotting import lag_plotplt.figure(figsize=(10,6))lag_plot(df)
There is no structure in our data set that indicates randomness.
我们的数据集中没有任何结构表明随机性。
Let’s see an example of non-random data. I will use the synthetic sample in pandas documentation page.
让我们看一个非随机数据的例子。 我将在pandas文档页面中使用合成样本。
spacing = np.linspace(-99 * np.pi, 99 * np.pi, num=1000)data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(spacing))plt.figure(figsize=(10,6))lag_plot(data)
We can clearly see a structure on lag plot so the data is not random.
我们可以清楚地看到滞后图上的结构,因此数据不是随机的。
6.自相关图 (6. Autocorrelation plot)
Autocorrelation plots are used to check the randomness in time series. They are produced by calculating the autocorrelations for data values at varying time lags.
自相关图用于检查时间序列中的随机性。 它们是通过计算在不同时滞下数据值的自相关来产生的。
Lag is the time difference. If the autocorrelations are very close to zero for all time lags, the time series is random.
滞后是时差。 如果对于所有时滞,自相关都非常接近零,则时间序列是随机的。
If we observe one or more significantly non-zero autocorrelations, then we can conclude that time series is not random.
如果我们观察到一个或多个显着的非零自相关,则可以得出时间序列不是随机的结论。
Let’s first create a random time series and see the autocorrelation plot.
我们首先创建一个随机时间序列,然后查看自相关图。
noise = pd.Series(np.random.randn(250)*100)noise.plot(figsize=(12,6))
This time series is clearly random. The autocorrelation plot of this time series:
这个时间序列显然是随机的。 该时间序列的自相关图:
from pandas.plotting import autocorrelation_plotplt.figure(figsize=(12,6))autocorrelation_plot(noise)
As expected, all autocorrelation values are very close to zero.
不出所料,所有自相关值都非常接近零。
Let’s do an example of non-random time series. The plot below shows a very simple upward trend.
让我们举一个非随机时间序列的例子。 下图显示了非常简单的上升趋势。
upward = pd.Series(np.arange(100))upward.plot(figsize=(10,6))plt.grid()
The autocorrelation plot for this time series:
此时间序列的自相关图:
plt.figure(figsize=(12,6))autocorrelation_plot(upward)
This autocorrelation clearly indicates a non-random time series as there are many significantly non-zero values.
由于存在许多明显的非零值,因此这种自相关清楚地指示了非随机时间序列。
It is very easy to visually check the non-randomness of simple upward and downward trends. However, in real life data sets, we are likely to see highly complex time series. We may not able see the trends or seasonality in those series. In such cases, autocorrelation plots are very helpful for time series analysis.
直观地检查简单的向上和向下趋势的非随机性非常容易。 但是,在现实生活中的数据集中,我们可能会看到非常复杂的时间序列。 我们可能看不到那些系列的趋势或季节性。 在这种情况下,自相关图对于时间序列分析非常有帮助。
Pandas provide two more plotting tools which are bootstap plot and RadViz. They can also be used in exploratory data analysis process.
熊猫提供了另外两种绘图工具,即引导绘图和RadViz 。 它们也可以用于探索性数据分析过程。
Thank you for reading. Please let me know if you have any feedback.
感谢您的阅读。 如果您有任何反馈意见,请告诉我。
翻译自: https://towardsdatascience.com/6-lesser-known-pandas-plotting-tools-fda5adb232ef
鲜为人知的6个黑科技网站
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389434.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!