13 分组和汇总 Grouping and aggregation
《Python数据分析技术栈》第06章使用 Pandas 准备数据 13 分组和汇总 Grouping and aggregation
Aggregation is the process of summarizing a group of values into a single value.
聚合是将一组值汇总为一个值的过程。
Hadley Wickham, a statistician, laid down the “Split-Apply-Combine” methodology (the paper can be accessed here: https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf), which has three steps:
- Split the data into smaller groups that are manageable andindependent of each other. This is done using the groupby methodin Pandas.
- Apply functions on each of these groups. We can apply any of theaggregation functions, including minimum, maximum, median,mean, sum, count, standard deviation, variance, and size. Eachof these aggregate functions calculate the aggregate value of theentire group. Note that we can also write a customized aggregationfunction.
- Combine the results after applying functions to each group into asingle combined object.
统计学家哈德利-威克姆(Hadley Wickham)提出了 "拆分-应用-合并 "方法(论文可在此处查阅:https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf),该方法分为三个步骤:
- 将数据分割成易于管理且相互独立的较小分组。这可以使用 Pandas 中的 groupby 方法来完成。
- 在每个分组上应用函数。我们可以应用任何聚合函数,包括最小值、最大值、中位数、平均值、总和、计数、标准偏差、方差和大小。每个聚合函数都会计算整个组的聚合值。请注意,我们也可以编写自定义的聚合函数。
- 将每个组应用函数后的结果合并为一个组合对象。
In the following section, we look at the groupby method, aggregation functions, the transform, filter, and apply methods, and the properties of the groupby object.
下面我们将介绍 groupby 方法、聚合函数、变换、筛选和应用方法以及 groupby 对象的属性。
Here, we again use the same COVID-19 dataset, which shows the number of cases and deaths for all countries on 12th April 2020.
在此,我们再次使用相同的 COVID-19 数据集,该数据集显示了 2020 年 4 月 12 日所有国家的病例数和死亡数。
df=pd.read_csv('subset-covid-data.csv')
df.head()
As we can see, there are several countries belonging to the same continent. Let us find the total number of cases and deaths for each continent. For this, we need to do grouping using the ‘continent’ column.
我们可以看到,有几个国家属于同一个大洲。让我们找出各大洲的病例总数和死亡人数。为此,我们需要使用 "洲 "列进行分组。
df.groupby('continent')['cases','deaths'].sum()
Here, we are grouping by the column “continent”, which becomes the grouping column. We are aggregating the values of the number of cases and deaths, which makes the columns named “cases” and “deaths” the aggregating columns. The sum method, which becomes our aggregating function, calculates the total of cases and deaths for all countries belonging to a given continent. Whenever you perform a groupby operation, it is recommended that these three elements (grouping column, aggregating column, and aggregating function) be identified at the outset.
在这里,我们按 "洲 "列进行分组,"洲 "列成为分组列。我们对病例数和死亡数进行汇总,因此 "病例 "和 "死亡 "列成为汇总列。求和方法是我们的聚合函数,用于计算属于某一大洲的所有国家的病例数和死亡数的总和。在执行分组操作时,建议从一开始就确定这三个元素(分组列、汇总列和汇总函数)。
The following thirteen aggregate functions can be applied to groups: sum(), max(), min(), std(), var(), mean(), count(), size(), sem(), first(), last(), describe(), nth().
以下 13 个聚合函数可用于分组:sum(), max(), min(), std(), var(), mean(), count(), size(), sem(), first(), last(), describe(), nth()。
We can also use the agg method, with np.sum as an attribute, which produces the same output as the previous statement:
我们还可以使用以 np.sum 为属性的 agg 方法,其输出结果与上一条语句相同:
df.groupby('continent')['cases','deaths'].agg(np.sum)
The agg method can accept any of the aggregating methods, like mean, sum, max, and so on, and these methods are implemented in NumPy.
agg 方法可以接受任何聚合方法,如 mean、sum、max 等,这些方法都在 NumPy 中实现。
We can also pass the aggregating column and the aggregating method as a dictionary to the agg method, as follows, which would again produce the same output.
我们还可以将聚合列和聚合方法作为字典传递给 agg 方法,如下所示,同样会产生相同的输出结果。
df.groupby('continent').agg({'cases':np.sum,'deaths':np.sum})
If there is more than one grouping column, use a list to save the column names as strings and pass this list as an argument to the groupby method.
如果有多个分组列,请使用列表将列名称保存为字符串,并将此列表作为参数传递给 groupby 方法。
Further reading on aggregate functions: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation
关于集合函数的进一步阅读: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation
检查 groupby 对象的属性 Examining the properties of the groupby object
The result of applying the groupby method is a groupby object. This groupby object has several properties that are explained in this section.
应用 groupby 方法的结果是一个 groupby 对象。这个 groupby 对象有几个属性,本节将对此进行说明。
分组对象的数据类型 Data type of groupby object
The data type of a groupby object can be accessed using the type function.
可使用 type 函数访问 groupby 对象的数据类型。
grouped_continents=df.groupby('continent')
type(grouped_continents)
获取组名 Obtaining the names of the groups
The groupby object has an attribute called groups. Using this attribute on the groupby object would return a dictionary, with the keys of this dictionary being the names of the groups.
groupby 对象有一个名为 groups 的属性。在 groupby 对象上使用该属性将返回一个字典,字典的键就是组的名称。
grouped_continents.groups.keys()
使用第 n 种方法返回每组中位置相同的记录 Returning records with the same position in each group using the nth method
Let us say that you want to see the details of the fourth country belonging to each continent. Using the nth method, we can retrieve this data by using a positional index value of 3 for the fourth position.
假设您想查看属于各大洲的第四个国家的详细信息。使用第 n 次方法,我们可以在第四个位置使用位置索引值 3 来检索该数据。
grouped_continents.nth(3)
使用 get_group 方法获取特定组的所有数据 Get all the data for a particular group using the get_group method
Use the get_group method with the name of the group as an argument to this method. In this example, we retrieve all data for the group named ‘Europe’.
使用 get_group 方法,并将组的名称作为该方法的参数。在本例中,我们将检索名为 "欧洲 "的组的所有数据。
grouped_continents.get_group('Europe')
We have seen how to apply aggregate functions to the groupby object. Now let us look at some other functions, like filter, apply, and transform, that can also be used with a groupby object.
我们已经了解了如何将聚合函数应用于 groupby 对象。现在让我们看看其他一些函数,如 filter、apply 和 transform,它们也可以与 groupby 对象一起使用。
过滤组 Filtering groups
The filter method removes or filters out groups based on a particular condition. While the agg (aggregate) method returns one value for each group, the filter method returns records from each group depending on whether the condition is satisfied.
筛选器方法根据特定条件删除或筛选出组。agg(聚合)方法为每个组返回一个值,而过滤器方法则根据条件是否满足从每个组返回记录。
Let us consider an example to understand this. We want to return all the rows for the continents where the average death rate is greater than 40. The filter method is called on a groupby object and the argument to the filter method is a lambda function or a predefined function. The lambda function here calculates the average death rate for every group, represented by the argument “x”. This argument is a DataFrame representing each group (which is the continent in our example). If the condition is satisfied for the group, all its rows are returned. Otherwise, all the rows of the group are excluded.
让我们看一个例子来理解这一点。我们希望返回平均死亡率大于 40 的各大洲的所有行。过滤方法是在 groupby 对象上调用的,过滤方法的参数是 lambda 函数或预定义函数。这里的 lambda 函数计算参数 "x "所代表的每个组的平均死亡率。该参数是一个 DataFrame,代表每个组(在我们的示例中是大陆)。如果满足该组的条件,则返回其所有行。否则,将排除该组的所有行。
grouped_continents=df.groupby('continent')
grouped_continents.filter(lambda x:x['deaths'].mean()>=40)
In the output, we see that only the rows for the groups (continents) ‘America’ and ‘Europe’ are returned since these are the only groups that satisfy the condition (group mean death rate greater than 40).
在输出结果中,我们可以看到只有 "美洲 "和 "欧洲 "这两个组(大洲)的行被返回,因为只有这两个组满足条件(组平均死亡率大于 40)。
变换方法和分组 Transform method and groupby
The transform method is another method that can be used with the groupby object, which applies a function on each value of the group. It returns an object that has the same rows as the original data frame or Series and is similarly indexed as well.
transform 方法是另一种可与 groupby 对象一起使用的方法,它可对组中的每个值应用一个函数。它返回一个与原始数据帧或系列具有相同行数的对象,并具有类似的索引。
Let us use the transform method on the population column to obtain the population in millions by dividing each value in the row by 1000000.
让我们在人口列上使用变换法,将该行中的每个值除以 1000000,即可得到以百万为单位的人口数。
grouped_continents['population'].transform(lambda x:x/1000000)
Notice that while the filter method returns lesser records as compared to its input object, the transform method returns the same number of records as the input object.
请注意,与输入对象相比,过滤方法返回的记录数量较少,而转换方法返回的记录数量与输入对象相同。
In the preceding example, we have applied the transform method on a Series. We can also use it on an entire DataFrame. A common application of the transform method is used to fill null values. Let us fill the missing values in our DataFrame with the value 0. In the output, notice that the values for the country code and population for the country ‘Anguilla’ (which were missing earlier) are now replaced with the value 0.
在前面的示例中,我们对一个系列应用了变换方法。我们也可以在整个 DataFrame 中使用该方法。transform 方法的一个常见应用是填充空值。让我们用 0 来填充 DataFrame 中的缺失值。在输出结果中,请注意 "安圭拉 "国家的国家代码和人口值(之前缺失)现在被替换为 0。
grouped_continents.transform(lambda x:x.fillna(0))
The transform method can be used with any Series or a DataFrame and not just with groupby objects. Creating a new column from an existing column is a common application of the transform method.
transform 方法不仅可用于 groupby 对象,还可用于任何 Series 或 DataFrame。从现有列创建新列是 transform 方法的常见应用。
应用方法和分组 Apply method and groupby
The apply method “applies” a function to each group of the groupby object. The difference between the apply and transform method is that the apply method is more flexible in that it can return an object of any shape while the transform method needs to return an object of the same shape.
apply 方法将一个函数 "应用 "到 groupby 对象的每个组中。apply 方法和 transform 方法的区别在于,apply 方法更灵活,它可以返回任何形状的对象,而 transform 方法需要返回相同形状的对象。
The apply method can return a single (scalar) value, Series or DataFrame, and the output need not be in the same structure as the input. Also, while the transform method applies the function on each column of a group, the apply method applies the function on the entire group.
apply 方法可以返回单个(标量)值、Series 或 DataFrame,输出不必与输入结构相同。此外,transform 方法在组的每一列上应用函数,而 apply 方法则在整个组上应用函数。
Let us use the apply method to calculate the total missing values in each group (continent).
让我们使用应用法来计算各组(洲)的缺失值总数。
grouped_continents.apply(lambda x:x.isna().sum())
The apply method, similar to the transform method, can be used with Series and DataFrame objects in addition to the groupby object.
apply 方法与 transform 方法类似,除用于 groupby 对象外,还可用于 Series 和 DataFrame 对象。