熊猫tv新功能介绍_您应该知道的4种熊猫绘图功能

熊猫tv新功能介绍

Pandas is a powerful package for data scientists. There are many reasons we use Pandas, e.g. Data wrangling, Data cleaning, and Data manipulation. Although, there is a method that rarely talks about regarding Pandas package and that is the Data plotting.

Pandas是数据科学家的强大工具包。 我们使用Pandas的原因很多,例如数据整理,数据清理和数据操作。 虽然,有一种方法很少谈论有关Pandas软件包的问题,​​那就是Data plotting

Data plotting, just like the name implies, is a process to plot the data into some graph or chart to visualise the data. While we have much fancier visualisation package out there, some method is just available in the pandas plotting API.

顾名思义,数据绘制是将数据绘制到某些图形或图表中以可视化数据的过程。 虽然我们有很多更好的可视化程序包,但熊猫绘图API中仅提供了一些方法。

Let’s see a few selected method I choose.

让我们看看我选择的一些选定方法。

1.拉德维兹 (1. radviz)

RadViz is a method to visualise N-dimensional data set into a 2D plot. The problem where we have more than 3-dimensional (features) data or more is that we could not visualise it, but RadViz allows it to happen.

RadViz是一种将N维数据集可视化为2D图的方法。 我们拥有超过3维(特征)数据或更多数据的问题是我们无法可视化它,但是RadViz允许它发生。

According to Pandas, radviz allows us to project an N-dimensional data set into a 2D space where the influence of each dimension can be interpreted as a balance between the importance of all dimensions. In a simpler term, it means we could project a multi-dimensional data into a 2D space in a primitive way.

根据Pandas的说法,radviz允许我们将N维数据集投影到2D空间中,其中每个维的影响可以解释为所有维的重要性之间的平衡。 简单来说,这意味着我们可以以原始方式将多维数据投影到2D空间中

Let’s try to use the function in a sample dataset.

让我们尝试在样本数据集中使用该函数。

#RadViz example
import pandas as pd
import seaborn as sns#To use the pd.plotting.radviz, you need a multidimensional data set with all numerical columns but one as the class column (should be categorical).mpg = sns.load_dataset('mpg')pd.plotting.radviz(mpg.drop(['name'], axis =1), 'origin')
Image for post
RadViz Result
RadViz结果

Above is the result of RadViz function, but how you would interpret the plot?

上面是RadViz函数的结果,但是如何解释该图呢?

So, each Series in the DataFrame is represented as an evenly distributed slice on a circle. Just look at the example above, there is a circle with the series name.

因此,DataFrame中的每个Series均表示为圆上均匀分布的切片。 只要看一下上面的例子,就会有一个带有系列名称的圆圈。

Each data point then is plotted in the circle according to the value on each Series. Highly correlated Series in the DataFrame are placed closer on the unit circle. In the example, we could see the japan and europe car data are closer to the model_year while the usa car is closer to the displacement. It means japan and europe car are most likely correlated to the model_year while usa car is with the displacement.

然后,根据每个系列的值将每个数据点绘制在圆圈中。 DataFrame中高度相关的Series位于单位圆上。 在示例中,我们可以看到日本和欧洲的汽车数据更接近model_year,而美国汽车的数据更接近排量。 这意味着日本和欧洲的汽车最有可能与model_year相关,而美国汽车则与排量相关。

If you want to know more about RadViz, you could check the paper here.

如果您想了解有关RadViz的更多信息,可以在此处查看该论文。

2. bootstrap_plot (2. bootstrap_plot)

According to Pandas, the bootstrap plot is used to estimate the uncertainty of a statistic by relying on random sampling with replacement. In simpler words, it is used to trying to determine the uncertainty in fundamental statistic such as mean and median by resampling the data with replacement (you could sample the same data multiple times). You could read more about bootstrap here.

根据Pandas的说法, 引导程序图依赖于随机抽样和替换来估计统计的不确定性。 用简单的话来说, 它用于尝试通过替换对数据进行重采样来确定基本统计数据的不确定性,例如均值和中位数 (您可以多次采样同一数据)。 您可以在此处阅读有关引导的更多信息。

The boostrap_plot function will generate bootstrapping plots for mean, median and mid-range statistics for the given number of samples of the given size. Let’s try using the function with an example dataset.

boostrap_plot函数将为给定大小的给定数量的样本生成均值,中值和中间范围统计量的自举图。 让我们尝试将函数与示例数据集一起使用。

For example, I have the mpg dataset and already have the information regarding the mpg feature data.

例如,我有mpg数据集,并且已经有了有关mpg特征数据的信息。

mpg['mpg'].describe()
Image for post

We could see that the mpg mean is 23.51 and the median is 23. Although this is just a snapshot of the real-world data. How are the values actually is in the population is unknown, that is why we could measure the uncertainty with the bootstrap methods.

我们可以看到mpg平均值为23.51,中位数为23。尽管这只是真实数据的快照。 实际值如何在总体中是未知的,这就是为什么我们可以使用自举法来测量不确定性的原因。

#bootstrap_plot examplepd.plotting.bootstrap_plot(mpg['mpg'],size = 50 , samples = 500)
Image for post

Above is the result example of bootstap_plot function. Mind that the result could be different than the example because it relies on random resampling.

上面是bootstap_plot函数的结果示例。 请注意,结果可能与示例不同,因为它依赖于随机重采样。

We could see in the first set of the plots (first row) is the sampling result, where the x-axis is the repetition, and the y-axis is the statistic. In the second set is the statistic distribution plot (Mean, Median and Midrange).

我们可以在第一组图(第一行)中看到采样结果,其中x轴是重复项,y轴是统计量。 第二组是统计分布图(均值,中位数和中位数)。

Take an example of the mean, most of the result is around 23, but it could be between 22.5 and 25 (more or less). This set the uncertainty in the real world that the mean in the population could be between 22.5 and 25. Note that there is a way to estimate the uncertainty by taking the values in the position 2.5% and 97.5% quantile (95% confident) although it is still up to your judgement.

以平均值为例,大多数结果在23左右,但可能在22.5到25之间(或多或少)。 这设置了现实世界中的不确定性,即总体平均值可能在22.5和25之间。请注意,尽管有2.5%和97.5%的分位数(95%的置信度),但是有一种方法可以估计不确定性这仍然取决于您的判断。

3. lag_plot (3. lag_plot)

A lag plot is a scatter plot for a time series and the same data lagged. Lag itself is a fixed amount of passing time; for example, lag 1 is a day 1 (Y1) with a 1-day time lag (Y1+1 or Y2).

滞后图是时间序列的散点图,并且相同数据滞后。 滞后本身是固定的通过时间; 例如,滞后1是第1天(Y1),时滞为1天(Y1 + 1或Y2)。

A lag plot is used to checks whether the time series data is random or not, and if the data is correlated with themselves. Random data should not have any identifiable patterns, such as linear. Although, why we bother with randomness or correlation? This is because many Time Series models are based on the linear regression, and one assumption is no correlation (Specifically is no Autocorrelation).

滞后图用于检查时间序列数据是否随机,以及数据是否与自身相关。 随机数据不应具有任何可识别的模式,例如线性。 虽然,为什么我们要扰乱随机性或相关性? 这是因为许多时间序列模型都基于线性回归,并且一个假设是不相关的(特别是没有自相关)。

Let’s try with an example data. In this case, I would use a specific package to scrap stock data from Yahoo Finance called yahoo_historical.

让我们尝试一个示例数据。 在这种情况下,我将使用一个名为yahoo_historical的特定程序包从Yahoo Finance抓取股票数据。

pip install yahoo_historical

With this package, we could scrap a specific stock data history. Let’s try it.

有了这个软件包,我们可以抓取特定的库存数据历史记录。 让我们尝试一下。

from yahoo_historical import Fetcher#We would scrap the Apple stock data. I would take the data between 1 January 2007 to 1 January 2017 
data = Fetcher("AAPL", [2007,1,1], [2017,1,1])
apple_df = data.getHistorical()#Set the date as the index
apple_df['Date'] = pd.to_datetime(apple_df['Date'])
apple_df = apple_df.set_index('Date')
Image for post

Above is our Apple stock dataset with the date as the index. We could try to plot the data to see the pattern over time with a simple method.

上面是我们的Apple股票数据集,其中以日期为索引。 我们可以尝试使用一种简单的方法来绘制数据以查看随时间变化的模式。

apple_df['Adj Close'].plot()
Image for post

We can see the Adj Close is increasing over time but is the data itself shown any pattern in with their lag? In this case, we would use the lag_plot.

我们可以看到,随着时间的推移,“关闭收盘价”(Adj Close)不断增加,但是数据本身是否显示出任何与滞后有关的模式? 在这种情况下,我们将使用lag_plot。

#Try lag 1 day
pd.plotting.lag_plot(apple_df['Adj Close'], lag = 1)
Image for post

As we can see in the plot above, it is almost near linear. It means there is a correlation between daily Adj Close. It is expected as the daily price of the stock would not be varied much in each day.

如上图所示,它几乎接近线性。 这意味着每日调整关闭之间存在相关性。 可以预期,因为股票的每日价格每天不会有太大变化。

How about a weekly basis? Let’s try to plot it

每周一次如何? 让我们尝试绘制它

#The data only consist of work days, so one week is 5 dayspd.plotting.lag_plot(apple_df['Adj Close'], lag = 5)
Image for post

We can see the pattern is similar to the lag 1 plot. How about 365 days? would it have any differences?

我们可以看到该模式类似于滞后1图。 365天怎么样? 有什么区别吗?

pd.plotting.lag_plot(apple_df['Adj Close'], lag = 365)
Image for post

We can see right now the pattern becomes more random, although the non-linear pattern still exists.

现在我们可以看到模式变得更加随机,尽管非线性模式仍然存在。

4. scatter_matrix (4. scatter_matrix)

The scatter_matrix is just like the name implies; it creates a matrix of scatter plot. Let’s try it with an example at once.

顾名思义, scatter_matrix就是一样。 它创建了散点图矩阵。 让我们立即尝试一个示例。

import matplotlib.pyplot as plttips = sns.load_dataset('tips')
pd.plotting.scatter_matrix(tips, figsize = (8,8))
plt.show()
Image for post

We can see the scatter_matrix function automatically detects the numerical features within the Data Frame we passed to the function and create a matrix of the scatter plot.

我们可以看到scatter_matrix函数自动检测我们传递给该函数的数据框内的数字特征,并创建散点图的矩阵。

In the example above, between two numerical features are plotted together to create a scatter plot (total_bill and size, total_bill and tip, and tip and size). Whereas, the diagonal part is the histogram of the numerical features.

在上面的示例中,两个数字特征之间被绘制在一起以创建散点图(total_bill和size,total_bill和tip,以及tip和size)。 而对角线部分是数值特征的直方图。

This is a simple function but powerful enough as we could get much information with a single line of code.

这是一个简单的功能,但功能足够强大,因为我们可以用一行代码来获取很多信息。

结论 (Conclusion)

Here I have shown you 4 different pandas plotting functions that you should know, that includes:

在这里,我向您展示了您应该了解的4种不同的熊猫绘图功能,其中包括:

  1. radviz

    拉德维兹
  2. bootstrap_plot

    bootstrap_plot
  3. lag_plot

    lag_plot
  4. scatter_matrix

    scatter_matrix

I hope it helps!

希望对您有所帮助!

翻译自: https://towardsdatascience.com/4-pandas-plotting-function-you-should-know-5a788d848963

熊猫tv新功能介绍

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388564.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

win与linux渊源,微软与Linux从对立走向合作,WSL是如何诞生的

原标题:微软与Linux从对立走向合作,WSL是如何诞生的正文Windows Subsystem for Linux(WSL)的开发,让微软从Linux的对立面走向合作,并且不断加大对开源社区的支持力度。而作为微软历史上的重要转折点,外界对WSL技术在Pr…

MFC80.DLL复制到程序目录中,也有的说复制到安装目录中

在用VS2005学习C调试程序的时候,按F5键,总提示这个问题, 不晓得什么原因,网上有的说找到MFC80.DLL复制到程序目录中,也有的说复制到安装目录中,可结果很失望,也有的VS2005安装有问题&#xff0…

vs显示堆栈数据分析_什么是“数据分析堆栈”?

vs显示堆栈数据分析A poor craftsman blames his tools. But if all you have is a hammer, everything looks like a nail.一个可怜的工匠责怪他的工具。 但是,如果您只有一把锤子,那么一切看起来都像钉子。 It’s common for web developers or databa…

树莓派 zero linux,树莓派 zero基本调试

回家之前就从网上购买了一堆设备,回去也不能闲着,可以利用家里相对齐全的准备安装调试。结果人还没回来,东西先到了。购买的核心装备是树莓派zero w,虽然已经知道它比家族大哥树莓派小不少,但拿到手里还是惊奇它的小巧…

简单的编译流程

简易编译器流程图: 一个典型的编译器,可以包含为一个前端,一个后端。前端接收源程序产生一个中间表示,后端接收中间表示继续生成一个目标程序。所以,前端处理的是跟源语言有关的属性,后端处理跟目标机器有关的属性。 复…

广告投手_测量投手隐藏自己的音高的程度

广告投手As the baseball community has recently seen with the Astros 2017 cheating scandal, knowing what pitch is being thrown gives batters a game-breaking advantage. However, unless you have an intricate system of cameras and trash cans set up, knowing wh…

验证部分表单是否重复

1. 效果 图片中的名称、机构编码需要进行重复验证2. 思路及实现 表单验证在获取数据将需要验证的表单数据进行保存this.nameChangeTemp response.data.orgName;this.codeChangeTemp response.data.orgCode; 通过rule对表单进行验证 以名字的验证为例rules: {orgName: [// 设置…

python bokeh_提升视觉效果:使用Python和Bokeh制作交互式地图

python bokehLet’s face it, fellow data scientists: our clients LOVE dashboards. Why wouldn’t they? Visualizing our data helps us tell a story. Visualization turns thousands of rows of data into a compelling and beautiful narrative. In fact, dashboard vi…

用C#写 四舍五入函数(原理版)

doubled 0.06576523;inti (int)(d/0.01);//0.01决定了精度 doubledd (double)i/100;//还原 if(d-dd>0.005)dd0.01;//四舍五入 MessageBox.Show((dd*100).ToString()"%");//7%,dd*100就变成百分的前面那一部分了

浪里个浪 FZU - 2261

TonyY是一个喜欢到处浪的男人,他的梦想是带着兰兰姐姐浪遍天朝的各个角落,不过在此之前,他需要做好规划。 现在他的手上有一份天朝地图,上面有n个城市,m条交通路径,每条交通路径都是单行道。他已经预先规划…

C#设计模式(9)——装饰者模式(Decorator Pattern)

一、引言 在软件开发中,我们经常想要对一类对象添加不同的功能,例如要给手机添加贴膜,手机挂件,手机外壳等,如果此时利用继承来实现的话,就需要定义无数的类,如StickerPhone(贴膜是手…

nosql_探索NoSQL系列

nosql数据科学 (Data Science) Knowledge on NoSQL databases seems to be an increasing requirement in data science applications, yet, the taxonomy is so diverse and problem-centered that it can be a challenge to grasp them. This post attempts to shed light on…

C++TCP和UDP属于传输层协议

TCP和UDP属于传输层协议。其中TCP提供IP环境下的数据可靠传输,它事先为要发送的数据开辟好连接通道(三次握手),然后再进行数据发送;而UDP则不为IP提供可靠性,一般用于实时的视频流传输,像rtp、r…

程序员如何利用空闲时间挣零花钱

一: 私活 作为一名程序员,在上班之余,我们有大把的时间,不能浪费,这些时间其实都是可以用来挖掘自己潜在的创造力,今天要讨论的话题就是,程序员如何利用空余时间挣零花钱?比如说周末…

python中api_通过Python中的API查找相关的工作技能

python中api工作技能世界 (The World of Job Skills) So you want to figure out where your skills fit into today’s job market. Maybe you’re just curious to see a comprehensive constellation of job skills, clean and standardized. Or you need a taxonomy of ski…

欺诈行为识别_使用R(编程)识别欺诈性的招聘广告

欺诈行为识别背景 (Background) Online recruitment fraud (ORF) is a form of malicious behaviour that aims to inflict loss of privacy, economic damage or harm the reputation of the stakeholders via fraudulent job advertisements.在线招聘欺诈(ORF)是一种恶意行为…

c语言实验四报告,湖北理工学院14本科C语言实验报告实验四数组

湖北理工学院14本科C语言实验报告实验四 数组.doc实验四 数 组实验课程名C语言程序设计专业班级 14电气工程2班 学号 201440210237 姓名 熊帆 实验时间 5.12-5.26 实验地点 K4-208 指导教师 祁文青 一、实验目的和要求1. 掌握一维数组和二维数组的定义、赋值和输入输出的方法&a…

rabbitmq channel参数详解【转】

1、Channel 1.1 channel.exchangeDeclare(): type:有direct、fanout、topic三种durable:true、false true:服务器重启会保留下来Exchange。警告:仅设置此选项,不代表消息持久化。即不保证重启后消息还在。原…

nlp gpt论文_GPT-3:NLP镇的最新动态

nlp gpt论文什么是GPT-3? (What is GPT-3?) The launch of Open AI’s 3rd generation of the pre-trained language model, GPT-3 (Generative Pre-training Transformer) has got the data science fraternity buzzing with excitement!Open AI的第三代预训练语言…

真实不装| 阿里巴巴新人上路指北

新手上路,总想听听前辈们分享他们走过的路。橙子选取了阿里巴巴合伙人逍遥子(阿里巴巴集团CEO) 、Eric(蚂蚁金服董事长兼CEO)、Judy(阿里巴巴集团CPO)的几段分享,他们是如何看待职场…