熊猫tv新功能介绍
Pandas is a powerful package for data scientists. There are many reasons we use Pandas, e.g. Data wrangling, Data cleaning, and Data manipulation. Although, there is a method that rarely talks about regarding Pandas package and that is the Data plotting.
Pandas是数据科学家的强大工具包。 我们使用Pandas的原因很多,例如数据整理,数据清理和数据操作。 虽然,有一种方法很少谈论有关Pandas软件包的问题,那就是Data plotting 。
Data plotting, just like the name implies, is a process to plot the data into some graph or chart to visualise the data. While we have much fancier visualisation package out there, some method is just available in the pandas plotting API.
顾名思义,数据绘制是将数据绘制到某些图形或图表中以可视化数据的过程。 虽然我们有很多更好的可视化程序包,但熊猫绘图API中仅提供了一些方法。
Let’s see a few selected method I choose.
让我们看看我选择的一些选定方法。
1.拉德维兹 (1. radviz)
RadViz is a method to visualise N-dimensional data set into a 2D plot. The problem where we have more than 3-dimensional (features) data or more is that we could not visualise it, but RadViz allows it to happen.
RadViz是一种将N维数据集可视化为2D图的方法。 我们拥有超过3维(特征)数据或更多数据的问题是我们无法可视化它,但是RadViz允许它发生。
According to Pandas, radviz allows us to project an N-dimensional data set into a 2D space where the influence of each dimension can be interpreted as a balance between the importance of all dimensions. In a simpler term, it means we could project a multi-dimensional data into a 2D space in a primitive way.
根据Pandas的说法,radviz允许我们将N维数据集投影到2D空间中,其中每个维的影响可以解释为所有维的重要性之间的平衡。 简单来说,这意味着我们可以以原始方式将多维数据投影到2D空间中 。
Let’s try to use the function in a sample dataset.
让我们尝试在样本数据集中使用该函数。
#RadViz example
import pandas as pd
import seaborn as sns#To use the pd.plotting.radviz, you need a multidimensional data set with all numerical columns but one as the class column (should be categorical).mpg = sns.load_dataset('mpg')pd.plotting.radviz(mpg.drop(['name'], axis =1), 'origin')
Above is the result of RadViz function, but how you would interpret the plot?
上面是RadViz函数的结果,但是如何解释该图呢?
So, each Series in the DataFrame is represented as an evenly distributed slice on a circle. Just look at the example above, there is a circle with the series name.
因此,DataFrame中的每个Series均表示为圆上均匀分布的切片。 只要看一下上面的例子,就会有一个带有系列名称的圆圈。
Each data point then is plotted in the circle according to the value on each Series. Highly correlated Series in the DataFrame are placed closer on the unit circle. In the example, we could see the japan and europe car data are closer to the model_year while the usa car is closer to the displacement. It means japan and europe car are most likely correlated to the model_year while usa car is with the displacement.
然后,根据每个系列的值将每个数据点绘制在圆圈中。 DataFrame中高度相关的Series位于单位圆上。 在示例中,我们可以看到日本和欧洲的汽车数据更接近model_year,而美国汽车的数据更接近排量。 这意味着日本和欧洲的汽车最有可能与model_year相关,而美国汽车则与排量相关。
If you want to know more about RadViz, you could check the paper here.
如果您想了解有关RadViz的更多信息,可以在此处查看该论文。
2. bootstrap_plot (2. bootstrap_plot)
According to Pandas, the bootstrap plot is used to estimate the uncertainty of a statistic by relying on random sampling with replacement. In simpler words, it is used to trying to determine the uncertainty in fundamental statistic such as mean and median by resampling the data with replacement (you could sample the same data multiple times). You could read more about bootstrap here.
根据Pandas的说法, 引导程序图依赖于随机抽样和替换来估计统计的不确定性。 用简单的话来说, 它用于尝试通过替换对数据进行重采样来确定基本统计数据的不确定性,例如均值和中位数 (您可以多次采样同一数据)。 您可以在此处阅读有关引导的更多信息。
The boostrap_plot function will generate bootstrapping plots for mean, median and mid-range statistics for the given number of samples of the given size. Let’s try using the function with an example dataset.
boostrap_plot函数将为给定大小的给定数量的样本生成均值,中值和中间范围统计量的自举图。 让我们尝试将函数与示例数据集一起使用。
For example, I have the mpg dataset and already have the information regarding the mpg feature data.
例如,我有mpg数据集,并且已经有了有关mpg特征数据的信息。
mpg['mpg'].describe()
We could see that the mpg mean is 23.51 and the median is 23. Although this is just a snapshot of the real-world data. How are the values actually is in the population is unknown, that is why we could measure the uncertainty with the bootstrap methods.
我们可以看到mpg平均值为23.51,中位数为23。尽管这只是真实数据的快照。 实际值如何在总体中是未知的,这就是为什么我们可以使用自举法来测量不确定性的原因。
#bootstrap_plot examplepd.plotting.bootstrap_plot(mpg['mpg'],size = 50 , samples = 500)
Above is the result example of bootstap_plot function. Mind that the result could be different than the example because it relies on random resampling.
上面是bootstap_plot函数的结果示例。 请注意,结果可能与示例不同,因为它依赖于随机重采样。
We could see in the first set of the plots (first row) is the sampling result, where the x-axis is the repetition, and the y-axis is the statistic. In the second set is the statistic distribution plot (Mean, Median and Midrange).
我们可以在第一组图(第一行)中看到采样结果,其中x轴是重复项,y轴是统计量。 第二组是统计分布图(均值,中位数和中位数)。
Take an example of the mean, most of the result is around 23, but it could be between 22.5 and 25 (more or less). This set the uncertainty in the real world that the mean in the population could be between 22.5 and 25. Note that there is a way to estimate the uncertainty by taking the values in the position 2.5% and 97.5% quantile (95% confident) although it is still up to your judgement.
以平均值为例,大多数结果在23左右,但可能在22.5到25之间(或多或少)。 这设置了现实世界中的不确定性,即总体平均值可能在22.5和25之间。请注意,尽管有2.5%和97.5%的分位数(95%的置信度),但是有一种方法可以估计不确定性这仍然取决于您的判断。
3. lag_plot (3. lag_plot)
A lag plot is a scatter plot for a time series and the same data lagged. Lag itself is a fixed amount of passing time; for example, lag 1 is a day 1 (Y1) with a 1-day time lag (Y1+1 or Y2).
滞后图是时间序列的散点图,并且相同数据滞后。 滞后本身是固定的通过时间; 例如,滞后1是第1天(Y1),时滞为1天(Y1 + 1或Y2)。
A lag plot is used to checks whether the time series data is random or not, and if the data is correlated with themselves. Random data should not have any identifiable patterns, such as linear. Although, why we bother with randomness or correlation? This is because many Time Series models are based on the linear regression, and one assumption is no correlation (Specifically is no Autocorrelation).
滞后图用于检查时间序列数据是否随机,以及数据是否与自身相关。 随机数据不应具有任何可识别的模式,例如线性。 虽然,为什么我们要扰乱随机性或相关性? 这是因为许多时间序列模型都基于线性回归,并且一个假设是不相关的(特别是没有自相关)。
Let’s try with an example data. In this case, I would use a specific package to scrap stock data from Yahoo Finance called yahoo_historical.
让我们尝试一个示例数据。 在这种情况下,我将使用一个名为yahoo_historical的特定程序包从Yahoo Finance抓取股票数据。
pip install yahoo_historical
With this package, we could scrap a specific stock data history. Let’s try it.
有了这个软件包,我们可以抓取特定的库存数据历史记录。 让我们尝试一下。
from yahoo_historical import Fetcher#We would scrap the Apple stock data. I would take the data between 1 January 2007 to 1 January 2017
data = Fetcher("AAPL", [2007,1,1], [2017,1,1])
apple_df = data.getHistorical()#Set the date as the index
apple_df['Date'] = pd.to_datetime(apple_df['Date'])
apple_df = apple_df.set_index('Date')
Above is our Apple stock dataset with the date as the index. We could try to plot the data to see the pattern over time with a simple method.
上面是我们的Apple股票数据集,其中以日期为索引。 我们可以尝试使用一种简单的方法来绘制数据以查看随时间变化的模式。
apple_df['Adj Close'].plot()
We can see the Adj Close is increasing over time but is the data itself shown any pattern in with their lag? In this case, we would use the lag_plot.
我们可以看到,随着时间的推移,“关闭收盘价”(Adj Close)不断增加,但是数据本身是否显示出任何与滞后有关的模式? 在这种情况下,我们将使用lag_plot。
#Try lag 1 day
pd.plotting.lag_plot(apple_df['Adj Close'], lag = 1)
As we can see in the plot above, it is almost near linear. It means there is a correlation between daily Adj Close. It is expected as the daily price of the stock would not be varied much in each day.
如上图所示,它几乎接近线性。 这意味着每日调整关闭之间存在相关性。 可以预期,因为股票的每日价格每天不会有太大变化。
How about a weekly basis? Let’s try to plot it
每周一次如何? 让我们尝试绘制它
#The data only consist of work days, so one week is 5 dayspd.plotting.lag_plot(apple_df['Adj Close'], lag = 5)
We can see the pattern is similar to the lag 1 plot. How about 365 days? would it have any differences?
我们可以看到该模式类似于滞后1图。 365天怎么样? 有什么区别吗?
pd.plotting.lag_plot(apple_df['Adj Close'], lag = 365)
We can see right now the pattern becomes more random, although the non-linear pattern still exists.
现在我们可以看到模式变得更加随机,尽管非线性模式仍然存在。
4. scatter_matrix (4. scatter_matrix)
The scatter_matrix is just like the name implies; it creates a matrix of scatter plot. Let’s try it with an example at once.
顾名思义, scatter_matrix就是一样。 它创建了散点图矩阵。 让我们立即尝试一个示例。
import matplotlib.pyplot as plttips = sns.load_dataset('tips')
pd.plotting.scatter_matrix(tips, figsize = (8,8))
plt.show()
We can see the scatter_matrix function automatically detects the numerical features within the Data Frame we passed to the function and create a matrix of the scatter plot.
我们可以看到scatter_matrix函数自动检测我们传递给该函数的数据框内的数字特征,并创建散点图的矩阵。
In the example above, between two numerical features are plotted together to create a scatter plot (total_bill and size, total_bill and tip, and tip and size). Whereas, the diagonal part is the histogram of the numerical features.
在上面的示例中,两个数字特征之间被绘制在一起以创建散点图(total_bill和size,total_bill和tip,以及tip和size)。 而对角线部分是数值特征的直方图。
This is a simple function but powerful enough as we could get much information with a single line of code.
这是一个简单的功能,但功能足够强大,因为我们可以用一行代码来获取很多信息。
结论 (Conclusion)
Here I have shown you 4 different pandas plotting functions that you should know, that includes:
在这里,我向您展示了您应该了解的4种不同的熊猫绘图功能,其中包括:
- radviz 拉德维兹
- bootstrap_plot bootstrap_plot
- lag_plot lag_plot
- scatter_matrix scatter_matrix
I hope it helps!
希望对您有所帮助!
翻译自: https://towardsdatascience.com/4-pandas-plotting-function-you-should-know-5a788d848963
熊猫tv新功能介绍
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388564.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!