如何在 Pandas 的多索引上进行分组？

在本文中，我们将展示如何在Pandas 中的多索引数据帧上使用groupby 。在数据科学中，当我们进行探索性数据分析时，我们经常使用 groupby 将一列的数据基于另一列进行分组。因此，我们能够分析一列的数据如何分组或基于另一列的数据。还有一个 groupby 的替代方案，我们也可以使用Pivot Table 。

groupby操作涉及拆分对象、应用函数和组合结果的某种组合。这可用于对这些组上的大量数据和计算操作进行分组。任何 groupby 操作都涉及对原始 DataFrame 的以下操作之一。他们是 -

拆分对象。
组合输出。
应用函数。

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters :

by: mapping, function, label or list of tables
axis: { 0 or ‘index’, 1 or ‘columns’}, default 0
level: level name
sort: bool, default True

Returns : DataFrameGroupBy

编程需要懂一点英语

我们必须将列表中的索引名称传递给 groupby函数的level参数。 'region' 索引是级别 (0)索引，'state' 索引是级别 (1)索引。在本文中，我们将使用此CSV 文件。

让我们看看 CSV 文件

Python3

# importing pandas library 
# as alias pd
import pandas as pd
  
# storing the data in the df dataframe
# using pandas 'read_csv()'.
df = pd.read_csv('homelessness.csv')
  
print(df.head())

Python3

# using pandas columns attribute.
col = df.columns
  
print(col)

Python3

# using the pandas set_index().
# passing the name of the columns in the list.
  
df = df.set_index(['region' , 'state'])
  
# sort the data using sort_index()
df.sort_index()
  
print(df.head())

Python3

# passing the level of indexes in 
# the list to the level argument. 
df.groupby(level=[0,1]).sum()

Python3

# passing name of the index in 
# the level argument.
y = df.groupby(level=['region'])['individuals'].mean()
  
print(y)

Python3

# import numpy library as alias np
import numpy as np
  
# applying .apply(), inside which passing 
# the lambda function. lambda function,
# counting the no of states in each region
# where are more than 1000 family_members.
fam_1000 = df.groupby(
  level=["region"])["family_members].apply(lambda x : np.sum(x>1000))
  
print(fam_1000)

Python3

# performing max() and min() operation, 
# on the 'state_pop' column.
df_agg = df.groupby(
  level=["region", "state"])["state_pop"].agg(["max", "min"])
  
print(df_agg)

Python3

# defining the lambda function as 'score'
score  = (lambda x : (x / x.mean()))
  
# applying transform() on all the 
# columns of DataFrame inside the
# transform(), passing the score 
df_tra = df.groupby(level=["region"]).transform(score)
print(df_tra.head(10))

输出：

DataFrame 中的列

我们可以通过使用 Pandas 的列属性来知道 DataFrame 的列。

蟒蛇3

# using pandas columns attribute.
col = df.columns
  
print(col)

输出：

由于 DataFrame 中没有索引，或者我们可以说这个 DataFrame 没有索引。首先，我们必须制作这个 DataFrame、 Multi index DataFrame 或 Hierarchical index DataFrame 。

多索引

具有多个索引的 DataFrame称为Multi-index DataFrame 。想了解更多关于多索引DataFrame、如何让DataFrame多索引以及如何使用多索引DataFrame进行数据探索，可以参考这篇文章。

为了使 DataFrame 具有多索引，我们将使用 Pandas set_index()函数。我们将把 Dataframe 的 ' region ' 和 ' state ' 列作为索引。

蟒蛇3

# using the pandas set_index().
# passing the name of the columns in the list.
  
df = df.set_index(['region' , 'state'])
  
# sort the data using sort_index()
df.sort_index()
  
print(df.head())

输出：

现在，DataFrame 是一个多索引的 DataFrame，它以 ' region ' 和 ' state ' 列作为索引。

在 Multi-index DataFrame 上使用 Groupby 操作：

这里我们将用从 0 开始的编号索引来表示级别。

蟒蛇3

# passing the level of indexes in 
# the list to the level argument. 
df.groupby(level=[0,1]).sum()

输出：

除了级别编号，我们还可以传递列的名称。

蟒蛇3

# passing name of the index in 
# the level argument.
y = df.groupby(level=['region'])['individuals'].mean()
  
print(y)

输出：

我们也可以用groupby的一些方法来探索更多。

1.在groupby中apply()：

假设我们想知道每个地区有多少个州，有一个'family_members'超过1000 。对于这种问题陈述，我们可以使用apply()。在apply() 中，我们必须传递一种专门为特定任务设计的函数。因此，在这种情况下，我们将使用lambda函数，这是一种在一行中编写函数的好方法。

蟒蛇3

# import numpy library as alias np
import numpy as np
  
# applying .apply(), inside which passing 
# the lambda function. lambda function,
# counting the no of states in each region
# where are more than 1000 family_members.
fam_1000 = df.groupby(
  level=["region"])["family_members].apply(lambda x : np.sum(x>1000))
  
print(fam_1000)

输出：

2. agg() 在 groupby 中：

agg()函数可用于执行一些统计操作，如min()、max()、mean()等。如果我们想一次执行多个统计操作，那么我们可以将它们传递到列表中。

蟒蛇3

# performing max() and min() operation, 
# on the 'state_pop' column.
df_agg = df.groupby(
  level=["region", "state"])["state_pop"].agg(["max", "min"])
  
print(df_agg)

输出：

3. groupby中的transform()：

transform() 用于在给定条件下转换列。在转换函数内部，我们必须传递负责执行特殊任务的函数。我们将使用lambda函数。

蟒蛇3

# defining the lambda function as 'score'
score  = (lambda x : (x / x.mean()))
  
# applying transform() on all the 
# columns of DataFrame inside the
# transform(), passing the score 
df_tra = df.groupby(level=["region"]).transform(score)
print(df_tra.head(10))

输出：

注意：有一个 groupby 操作的替代方法， Pivot_table ，它也用于根据其他列对第一列进行分组，但如果我们想统计分析组，数据透视表可能更有用。