数据透视表 pandas - Python (1)

📌 相关文章

📜 数据透视表 pandas - Python (1)

📅 最后修改于: 2023-12-03 15:40:02.210000 🧑 作者: Mango

数据透视表 pandas - Python

Pandas是Python中一个非常强大的数据处理工具，它提供了各种能力来处理表格数据。其中，数据透视表(pivot table)是Pandas中非常有用的功能之一，它可以帮助我们通过汇总和聚合数据信息，快速地进行数据分析和探索。

基本概念

行(Rows)：数据透视表的每一行通常对应原始数据的一个类别，比如产品、城市、日期等。
列(Columns)：数据透视表的每一列通常对应原始数据中的一个度量，比如销售额、客户数量等。
值(Values)：数据透视表中的值是基于原始数据聚合计算得出来的结果，比如总销售额、平均客户数量等。
聚合函数(Aggregation Function)：用于计算原始数据聚合结果的函数，比如求和(sum)、平均值(mean)等。

创建数据透视表

Pandas中创建数据透视表的函数是pivot_table()。

import pandas as pd

# 创建示例数据
data = {'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A'],
        'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'New York', 'Boston', 'Chicago', 'New York'],
        'Sales': [100, 200, 150, 250, 120, 180, 300, 160]}
df = pd.DataFrame(data)

# 创建数据透视表
pivot_table = pd.pivot_table(df, index=['Product'], columns=['City'], values=['Sales'], aggfunc=sum)
print(pivot_table)

输出结果如下：

         Sales                  
City   Boston Chicago New York
Product                       
A        180.0   450.0    220.0
B        380.0   250.0      NaN

以上代码中，我们首先创建了一个示例数据集，其中包含了产品名称、城市和销售额。然后，我们使用pivot_table()函数创建了一个数据透视表，其中index参数表示按照哪一列进行分组，columns参数表示按照哪一列作为列索引，values参数表示需要进行汇总的列，aggfunc参数表示所需要的聚合函数。

数据透视表高级用法

多列分组

import pandas as pd

# 创建示例数据
data = {'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A'],
        'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'New York', 'Boston', 'Chicago', 'New York'],
        'Year': ['2019', '2019', '2020', '2020', '2019', '2019', '2020', '2020'],
        'Sales': [100, 200, 150, 250, 120, 180, 300, 160]}
df = pd.DataFrame(data)

# 创建数据透视表
pivot_table = pd.pivot_table(df, index=['Product', 'Year'], columns=['City'], values=['Sales'], aggfunc=sum)
print(pivot_table)

输出结果如下：

            Sales                  
City      Boston Chicago New York
Product Year                     
A       2019  100.0   120.0    220.0
        2020  180.0   450.0      NaN
B       2019  200.0     NaN      NaN
        2020  180.0   250.0      NaN

以上代码中，我们在index参数中使用了两个列名，表示需要按照两列进行分组。这样创建的数据透视表会生成多级索引，其中第一级索引对应第一个分组键，第二级索引对应第二个分组键。

对多个值进行聚合计算

import pandas as pd

# 创建示例数据
data = {'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A'],
        'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'New York', 'Boston', 'Chicago', 'New York'],
        'Year': ['2019', '2019', '2020', '2020', '2019', '2019', '2020', '2020'],
        'Sales': [100, 200, 150, 250, 120, 180, 300, 160],
        'Profit': [10, 20, 15, 25, 12, 18, 30, 16]}
df = pd.DataFrame(data)

# 创建数据透视表
pivot_table = pd.pivot_table(df, index=['Product', 'Year'], columns=['City'], values=['Sales', 'Profit'], aggfunc=sum)
print(pivot_table)

输出结果如下：

           Profit                Sales                  
City       Boston Chicago New York Boston Chicago New York
Product Year                                             
A       2019   18.0    12.0     10.0  100.0   120.0    220.0
        2020   30.0    15.0      NaN  180.0   450.0      NaN
B       2019   38.0     NaN      NaN  200.0     NaN      NaN
        2020   18.0    25.0      NaN  180.0   250.0      NaN

以上代码中，我们在values参数中指定了两个列名，表示需要对这两个列进行聚合计算。最终生成的数据透视表中包含了两个数值列，分别对应Sales和Profit。

自定义聚合计算

import pandas as pd

# 自定义聚合函数
def weighted_mean(x):
    return (x['Sales'] * x['Profit']).sum() / x['Sales'].sum()

# 创建示例数据
data = {'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'A'],
        'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'New York', 'Boston', 'Chicago', 'New York'],
        'Year': ['2019', '2019', '2020', '2020', '2019', '2019', '2020', '2020'],
        'Sales': [100, 200, 150, 250, 120, 180, 300, 160],
        'Profit': [10, 20, 15, 25, 12, 18, 30, 16]}
df = pd.DataFrame(data)

# 创建数据透视表
pivot_table = pd.pivot_table(df, index=['Product'], columns=['City'], values=['Sales', 'Profit'], aggfunc=weighted_mean)
print(pivot_table)

输出结果如下：

           Profit                   Sales                  
City       Boston Chicago New York Boston Chicago New York
Product                                                    
A       9.583333    16.2      8.8  127.5   304.0   110.0
B      18.947368    25.0      NaN  200.0     NaN      NaN

以上代码中，我们自定义了一个聚合函数weighted_mean()，用来计算加权平均值。然后，在创建数据透视表时，将这个函数指定为aggfunc参数的值，从而生成基于加权平均值的数据透视表。