

This article is a continuation of a previous article which kick-started the journey to learning Python for data analysis. You can check out the previous article here: Pandas for Newbies: An Introduction Part I.

本文是上一篇文章的延续,该文章开始了学习Python进行数据分析的旅程。 您可以在此处查看上一篇文章: 新手熊猫:简介第一部分 。

For those just starting out in data science, the Python programming language is a pre-requisite to learning data science so if you aren’t familiar with Python go make yourself familiar and then come back here to start on Pandas.


You can start learning Python with a series of articles I just started called Minimal Python Required for Data Science.

您可以从我刚刚开始的一系列文章开始学习Python,这些文章称为“数据科学所需的最小Python” 。

As a reminder, what I’m doing here is a brief tour of just some of the things you can do with Pandas. It’s the deep-dive before the actual deep-dive.

提醒一下,我在这里所做的只是对熊猫可以做的一些事情的简要介绍。 这是真正的深潜之前的深潜。

Both the data and the inspiration for this series comes from Ted Petrou’s excellent courses on Dunder Data.

数据和本系列的灵感都来自Ted Petrou的Dunder Data精品课程。

先决条件 (Prerequisites)

  1. Python

  2. pandas

  3. Jupyter


You’ll be ready to begin once you have these three things in order.


聚合 (Aggregation)

We left off last time with the pandas query method as an alternative to regular filtering via boolean conditional logic. While it does have its limits, the query is a much more readable method.

上一次我们没有使用pandas query方法,而是通过布尔条件逻辑进行常规过滤的替代方法。 尽管确实有其限制,但query是一种更具可读性的方法。

Today we continue with aggregation which is the act of summarizing data with a single number. Examples include sum, mean, median, min and max.

今天,我们继续进行汇总,这是用单个数字汇总数据的操作。 示例包括总和,均值,中位数,最小值和最大值。

Let’s try this on different dataset.


Image for post

Get the mean by calling the mean method.


students.mean()math score       66.089
reading score 69.169
writing score 68.054
dtype: float64

User the axis parameter to calculate the sum of all the scores (math, reading, and writing) across rows:


scores = students[['math score', 'reading score', 'writing score']]scores.sum(axis=1).head(3)0    218
1 247
2 278
dtype: int64

非汇总方法 (Non-aggregating methods)

Perform calculations on the data that do not necessarily aggregate the data. I.E. the round method:

对不一定要汇总数据的数据执行计算。 IE的round方法:

scores.round(-1).head(3)math score  reading score  writing score0          70             70             701          70             90             902          90            100             90

组内汇总 (Aggregating within groups)

Let’s get the frequency of unique values in a single column.


students['parental level of education'].value_counts()some college          226
associate's degree 222
high school 196
some high school 179
bachelor's degree 118
master's degree 59
Name: parental level of education, dtype: int64

Use the groupby method to create a group and then apply and aggregation. Here we get the mean math scores for each gender:

使用groupby方法创建一个组,然后应用和聚合。 在这里,我们获得了每种性别的平均数学成绩:

mean_math_score=('math score', 'mean')
)mean_math_scoregenderfemale 63.633205male 68.728216

多重聚合 (Multiple aggregation)

Here we do multiple aggregations at the same time.


mean_math_score=('math score', 'mean'),
max_math_score=('math score', 'max'),
count_math_score=('math score', 'count')
Image for post

We can create groups from more than one column.


students.groupby(['gender', 'test preparation course']).agg(
mean_math_score=('math score', 'mean'),
max_math_score=('math score', 'max'),
count_math_score=('math score', 'count')
Image for post

It looks like students who prepped for test for both sexes scored higher than those who didn’t.


数据透视表 (Pivot Table)

A better way to present information to consumers of information would be to use the pivot_table function which does the same thing as groupby but makes use of one of the grouping columns as the new columns.


Image for post

Again, it’s the same information presented in a more readable and intuitive format.


数据整理 (Data Wrangling)

Let’s bring a new dataset to examine datasets with missing values


Image for post

providing the na_values argument will mark the NULL values in a dataset as NaN (Not a Number).


You might also be confronted with a dataset where all the columns should all be part of one column.


Image for post

We can use the melt method to stack columns one after another.


Image for post

合并数据集 (Merging Datasets)

Knowing a little SQL will come in handy when studying this part of the pandas library.


There are multiple ways to join data in pandas, but the one method you should definitely get comfortable with is the merge method which connects rows in DataFrames based on one or more keys. It’s basically an implementation of SQL JOINS.

在熊猫中联接数据有多种方法,但是您绝对应该习惯的一种方法是merge方法,该方法基于一个或多个键连接DataFrames中的行。 它基本上是SQL JOINS的实现。

Let’s say I had the following data from a movie rental database:


Image for post
Image for post

To perform an “INNER” join using merge :

要使用merge执行“ INNER” merge

Image for post

The SQL (PostgreSQL) equivalent would be something like:


SELECT * FROM customer
INNER JOIN payment
ON payment.customer_id = customer.customer_id
ORDER BY customer.first_name ASC

时间序列分析 (Time Series Analysis)

The name pandas is actually derived from Panel Data Analysis which combines cross-sectional data with time-series used most widely in medical research and economics.


Let’s say I had the following data where I knew it was time-series data, but without a DatetimeIndex specifying it as a time-series:


p      a0  0.749  28.961  1.093  67.812  0.920  55.153  0.960  78.624  0.912  60.15

I can simply set the index as a DatetimeIndex with:


Which results in:


p       a1986-12-31  0.749   28.961987-12-31  1.093   67.811988-12-31  0.920   55.151989-12-31  0.960   78.621990-12-31  0.912   60.151991-12-31  1.054   45.541992-12-31  1.079   33.621993-12-31  1.525   44.581994-12-31  1.310   41.94

Here we have a dataset where p is the dependent variable and a is the independent variable. Before running an econometric model called AR(1) we’d have to lag the dependent variable to deal with autocorrelation which we could do using:

在这里,我们有一个数据集,其中p是因变量,而a是自变量。 在运行称为AR(1)的计量经济学模型之前,我们必须将因变量滞后以处理自相关,我们可以使用以下方法进行处理:

p      a  p_lagged1986-12-31  0.749  28.96       NaN1987-12-31  1.093  67.81     0.7491988-12-31  0.920  55.15     1.0931989-12-31  0.960  78.62     0.9201990-12-31  0.912  60.15     0.960

可视化 (Visualization)

The combination of matplotlib and pandas allows us to make rudimentary simple plots in the blink of an eye:


# Using the previous datasetbangla.plot();
Image for post
Image for post

That concludes our brief bus tour of the pandas toolbox for data analysis. There’s a lot more that we’ll dive into for my next series of articles. So stay tuned!

到此为止,我们简要介绍了熊猫工具箱进行数据分析的过程。 在我的下一系列文章中,我们将涉及更多内容。 敬请期待!

我做的事 (What I do)

I help people find mentors, code in Python, and write about life. If you’re thinking about switching careers into the tech industry or just want to talk you can sign up for my Slack Channel via VegasBlu.

我帮助人们找到导师,用Python编写代码,并撰写有关生活的文章。 如果您正在考虑将职业转向科技行业,或者只是想谈谈,可以通过VegasBlu注册我的Slack频道。

翻译自: https://towardsdatascience.com/pandas-for-newbies-an-introduction-part-ii-9f69a045dd95






本文是【浅析微信支付】系列文章的第八篇,主要讲解商户如何处理微信申请退款、退款回调、查询退款接口,其中有一些坑的地方,会着重强调。 浅析微信支付系列已经更新七篇了哟~,没有看过的朋友们可以看一下哦。 浅析微信…


view的视图有两种情况: 内容型视图:由视图的内容决定其大小。图形型视图:父视图为view动态调整大小。 ### measure的本质 把视图布局使用的“相对值”转化成具体值的过程,即把WRAP_CONTENT,MATCH_PARENT转化为具体的值。 measur…



数据分析 绩效_如何在绩效改善中使用数据分析

数据分析 绩效Imagine you need to do a bank transaction, but the website is so slow. The page takes so much time to load, all you can see is a blue circle.想象您需要进行银行交易,但是网站是如此缓慢。 该页面需要花费很多时间来加载,您只能看…


隐私策略During its 2020 Worldwide Developers Conference, Apple spent time on one of today’s hottest topics — privacy. During the past couple of years, Apple has been rolling out various public campaigns aiming to position itself as a company that respect…


Insightful and aesthetic visualizations don’t have to be a pain to create. This article will prevent 5 simple one-liners you can add to your code to increase its style and informational value.富有洞察力和美学的可视化不必费心创建。 本文将防止您添加到代码中…

figma 安装插件_彩色滤光片Figma插件,用于色盲

figma 安装插件So as a UX Designer, it is important to design with disabilities in mind. One of these is color blindness. It is important to make sure important information on your product is legible to everyone. This is why I like using this tool:因此&…


产品观念:更好的捕鼠器重点 (Top highlight)Telling a compelling story helps you get your point across effectively else you get lost in translation.讲一个引人入胜的故事可以帮助您有效地传达观点,否则您会迷失在翻译中。 Great stories happen…


今天复习了springMVC的框架搭建。 思维导图: 转载于:https://www.cnblogs.com/kangy123/p/9315919.html


For $250, a business can pay a graphic designer to create a logo for their business. Or, for $10,000 a business can hire a graphic designer to form a design strategy that contextually places the business’s branding in a stronghold against the market it’s…


用PYTHON探索数据 (EXPLORING DATA WITH PYTHON) And we’re back! Let’s pick up where we left off in the first article of this series and use the visual we built there as a starting point.我们回来了! 让我们从在本系列的第一篇文章中停下来的地方开始&…


As an agency co-founder and design lead, I’ve been participating in many recruitment processes. I’ve seen hundreds of portfolios and CVs of aspiring designers. If you’re applying for a UI designer position, it is good to have some things in mind and to …


上面两篇博客讲了MySQL的安装、登录,密码重置,为接下来的MySQL命令学习做好了准备,现在开启MySQL命令学习之旅吧。 首先打开CMD,输入命令:mysql -u root -p 登录MySQL。 注意:MySQL命令终止符为分号 (;) …


大肠杆菌原核表达实验心得(上篇)对于大肠杆菌蛋白表达,大部分小伙伴都觉得 so easy! 做大肠杆菌蛋白表达十几年经历的老司机还经常阴沟翻船,被大肠杆菌表达蛋白虐千百遍的惨痛经历,很多小伙伴都有切肤之痛。福因德接下…


目录介绍 1.Animation和Animator区别 2.Animation运行原理和源码分析 2.1 基本属性介绍2.2 如何计算动画数据2.3 什么是动画更新函数2.4 动画数据如何存储2.5 Animation的调用 3.Animator运行原理和源码分析 3.1 属性动画的基本属性3.2 属性动画新的概念3.3 PropertyValuesHold…

《SQL Server 2008从入门到精通》--20180716

1.锁 当多个用户同时对同一个数据进行修改时会产生并发问题,使用事务就可以解决这个问题。但是为了防止其他用户修改另一个还没完成的事务中的数据,就需要在事务中用到锁。 SQL Server 2008提供了多种锁模式:排他锁,共享锁&#x…




深度拷贝指的是将一个引用类型&#xff08;包含该类型里的引用类型&#xff09;拷贝一份(在内存中完完全全是两个对象&#xff0c;没有任何引用关系)..........  直接上代码&#xff1a; 1 /// <summary>2 /// 对象的深度拷贝&#xff08;序列化的方式&#xf…

Okhttp 源码解析

HTTP及okhttp的优势 http结构 请求头 列表内容表明本次请求的客户端本次请求的cookie本次请求希望返回的数据类型本次请求是否采用数据压缩等等一系列设置 请求体 指定本次请求所使用的方法请求所使用的方法 响应头 - 服务器标识 - 状态码 - 内容编码 - cookie 返回给客…


python中定义数据结构I remembered the day when I made up my mind to learn python then the very first things I learned about data types and data structures. So in this article, I would like to discuss different data structures in python.我记得当初下定决心学习…