熊猫分发
This article is a continuation of a previous article which kick-started the journey to learning Python for data analysis. You can check out the previous article here: Pandas for Newbies: An Introduction Part I.
本文是上一篇文章的延续,该文章开始了学习Python进行数据分析的旅程。 您可以在此处查看上一篇文章: 新手熊猫:简介第一部分 。
For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.
对于刚接触数据科学的人来说,Python编程语言是学习数据科学的先决条件,因此,如果您不熟悉Python,请先熟悉一下,然后再回到这里开始学习Pandas。
You can start learning Python with a series of articles I just started called Minimal Python Required for Data Science.
您可以从我刚刚开始的一系列文章开始学习Python,这些文章称为“数据科学所需的最小Python” 。
As a reminder, what I’m doing here is a brief tour of just some of the things you can do with Pandas. It’s the deep-dive before the actual deep-dive.
提醒一下,我在这里所做的只是对熊猫可以做的一些事情的简要介绍。 这是真正的深潜之前的深潜。
Both the data and the inspiration for this series comes from Ted Petrou’s excellent courses on Dunder Data.
数据和本系列的灵感都来自Ted Petrou的Dunder Data精品课程。
先决条件 (Prerequisites)
- Python Python
- pandas 大熊猫
- Jupyter 朱皮特
You’ll be ready to begin once you have these three things in order.
将这三件事整理好后,您就可以准备开始。
聚合 (Aggregation)
We left off last time with the pandas query
method as an alternative to regular filtering via boolean conditional logic. While it does have its limits, the query
is a much more readable method.
上一次我们没有使用pandas query
方法,而是通过布尔条件逻辑进行常规过滤的替代方法。 尽管确实有其限制,但query
是一种更具可读性的方法。
Today we continue with aggregation which is the act of summarizing data with a single number. Examples include sum, mean, median, min and max.
今天,我们继续进行汇总,这是用单个数字汇总数据的操作。 示例包括总和,均值,中位数,最小值和最大值。
Let’s try this on different dataset.
让我们在不同的数据集上尝试一下。
Get the mean by calling the mean method.
通过调用均值方法获得均值。
students.mean()math score 66.089
reading score 69.169
writing score 68.054
dtype: float64
User the axis
parameter to calculate the sum of all the scores (math, reading, and writing) across rows:
使用axis
参数来计算各行中所有分数的总和(数学,阅读和写作):
scores = students[['math score', 'reading score', 'writing score']]scores.sum(axis=1).head(3)0 218
1 247
2 278
dtype: int64
非汇总方法 (Non-aggregating methods)
Perform calculations on the data that do not necessarily aggregate the data. I.E. the round
method:
对不一定要汇总数据的数据执行计算。 IE的round
方法:
scores.round(-1).head(3)math score reading score writing score0 70 70 701 70 90 902 90 100 90
组内汇总 (Aggregating within groups)
Let’s get the frequency of unique values in a single column.
让我们在单个列中获得唯一值的频率。
students['parental level of education'].value_counts()some college 226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64
Use the groupby
method to create a group and then apply and aggregation. Here we get the mean math scores for each gender:
使用groupby
方法创建一个组,然后应用和聚合。 在这里,我们获得了每种性别的平均数学成绩:
students.groupby('gender').agg(
mean_math_score=('math score', 'mean')
)mean_math_scoregenderfemale 63.633205male 68.728216
多重聚合 (Multiple aggregation)
Here we do multiple aggregations at the same time.
在这里,我们同时进行多个聚合。
students.groupby('gender').agg(
mean_math_score=('math score', 'mean'),
max_math_score=('math score', 'max'),
count_math_score=('math score', 'count')
)
We can create groups from more than one column.
我们可以从多个列中创建组。
students.groupby(['gender', 'test preparation course']).agg(
mean_math_score=('math score', 'mean'),
max_math_score=('math score', 'max'),
count_math_score=('math score', 'count')
)
It looks like students who prepped for test for both sexes scored higher than those who didn’t.
看来为两性做准备的学生的得分都比那些没有参加过测试的学生要高。
数据透视表 (Pivot Table)
A better way to present information to consumers of information would be to use the pivot_table
function which does the same thing as groupby
but makes use of one of the grouping columns as the new columns.
将信息呈现给信息消费者的一种更好的方法是使用pivot_table
函数,该函数与groupby
相同,但是将分组列之一用作新列。
Again, it’s the same information presented in a more readable and intuitive format.
同样,它是以更易读和直观的格式呈现的相同信息。
数据整理 (Data Wrangling)
Let’s bring a new dataset to examine datasets with missing values
让我们带来一个新的数据集来检查缺少值的数据集
providing the na_values
argument will mark the NULL values in a dataset as NaN (Not a Number).
提供na_values
参数会将数据集中的NULL值标记为NaN(非数字)。
You might also be confronted with a dataset where all the columns should all be part of one column.
您可能还会遇到一个数据集,其中所有列都应该都属于一个列。
We can use the melt
method to stack columns one after another.
我们可以使用melt
法将一列又一列堆叠。
合并数据集 (Merging Datasets)
Knowing a little SQL will come in handy when studying this part of the pandas library.
在学习pandas库的这一部分时,了解一点SQL会很方便。
There are multiple ways to join data in pandas, but the one method you should definitely get comfortable with is the merge
method which connects rows in DataFrames based on one or more keys. It’s basically an implementation of SQL JOINS.
在熊猫中联接数据有多种方法,但是您绝对应该习惯的一种方法是merge
方法,该方法基于一个或多个键连接DataFrames中的行。 它基本上是SQL JOINS的实现。
Let’s say I had the following data from a movie rental database:
假设我从电影租借数据库中获得了以下数据:
To perform an “INNER” join using merge
:
要使用merge
执行“ INNER” merge
:
The SQL (PostgreSQL) equivalent would be something like:
等效SQL(PostgreSQL)如下所示:
SELECT * FROM customer
INNER JOIN payment
ON payment.customer_id = customer.customer_id
ORDER BY customer.first_name ASC
LIMIT 5;
时间序列分析 (Time Series Analysis)
The name pandas is actually derived from Panel Data Analysis which combines cross-sectional data with time-series used most widely in medical research and economics.
熊猫这个名字实际上是来自面板数据分析,它结合了横截面数据和在医学研究和经济学中使用最广泛的时间序列。
Let’s say I had the following data where I knew it was time-series data, but without a DatetimeIndex
specifying it as a time-series:
假设我在知道它是时间序列数据的地方有以下数据,但是没有DatetimeIndex
将其指定为时间序列:
p a0 0.749 28.961 1.093 67.812 0.920 55.153 0.960 78.624 0.912 60.15
I can simply set the index as a DatetimeIndex
with:
我可以简单地将索引设置为DatetimeIndex
:
Which results in:
结果是:
p a1986-12-31 0.749 28.961987-12-31 1.093 67.811988-12-31 0.920 55.151989-12-31 0.960 78.621990-12-31 0.912 60.151991-12-31 1.054 45.541992-12-31 1.079 33.621993-12-31 1.525 44.581994-12-31 1.310 41.94
Here we have a dataset where p is the dependent variable and a is the independent variable. Before running an econometric model called AR(1) we’d have to lag the dependent variable to deal with autocorrelation which we could do using:
在这里,我们有一个数据集,其中p是因变量,而a是自变量。 在运行称为AR(1)的计量经济学模型之前,我们必须将因变量滞后以处理自相关,我们可以使用以下方法进行处理:
p a p_lagged1986-12-31 0.749 28.96 NaN1987-12-31 1.093 67.81 0.7491988-12-31 0.920 55.15 1.0931989-12-31 0.960 78.62 0.9201990-12-31 0.912 60.15 0.960
可视化 (Visualization)
The combination of matplotlib
and pandas
allows us to make rudimentary simple plots in the blink of an eye:
matplotlib
和pandas
的组合使我们能够在眨眼间做出基本的简单图:
# Using the previous datasetbangla.plot();
pivot_scores.plot(kind='bar');
That concludes our brief bus tour of the pandas toolbox for data analysis. There’s a lot more that we’ll dive into for my next series of articles. So stay tuned!
到此为止,我们简要介绍了熊猫工具箱进行数据分析的过程。 在我的下一系列文章中,我们将涉及更多内容。 敬请期待!
我做的事 (What I do)
I help people find mentors, code in Python, and write about life. If you’re thinking about switching careers into the tech industry or just want to talk you can sign up for my Slack Channel via VegasBlu.
我帮助人们找到导师,用Python编写代码,并撰写有关生活的文章。 如果您正在考虑将职业转向科技行业,或者只是想谈谈,可以通过VegasBlu注册我的Slack频道。
翻译自: https://towardsdatascience.com/pandas-for-newbies-an-introduction-part-ii-9f69a045dd95
熊猫分发
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389155.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!