📜  pyspark groupby sum - Python (1)

📅  最后修改于: 2023-12-03 14:45:52.544000             🧑  作者: Mango

pyspark groupby sum - Python

Introduction

In PySpark, the groupBy transformation is used to group the data based on one or more columns. After grouping, you can apply aggregation functions like sum, count, avg, etc. on the grouped data.

This tutorial will explain how to use groupBy with the sum function in PySpark. We will provide code examples and explain the steps involved in the process.

Prerequisites

To follow along, make sure you have the following installed:

  • PySpark (Python library for Apache Spark)
  • Apache Spark (a distributed big data computing framework)
Code Examples

Let's assume we have a dataset of sales transactions with the following columns: customer_name, product_name, and amount.

Step 1: Create a PySpark DataFrame

First, we need to create a PySpark DataFrame from our dataset. You can use various methods to create a DataFrame, such as reading data from a file or converting a Pandas DataFrame to a PySpark DataFrame.

Here is an example of creating a PySpark DataFrame from a Python list:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Define the schema for the DataFrame
schema = StructType([
    StructField("customer_name", StringType(), True),
    StructField("product_name", StringType(), True),
    StructField("amount", DoubleType(), True)
])

# Create the DataFrame
data = [("John", "Apple", 10.0),
        ("Mary", "Orange", 15.0),
        ("John", "Banana", 5.0),
        ("Mary", "Apple", 12.0)]
df = spark.createDataFrame(data, schema)
Step 2: Group by the desired columns

Next, we will use the groupBy transformation to group the DataFrame by one or more columns. In our case, we will group by the customer_name column.

grouped_df = df.groupBy("customer_name")
Step 3: Apply the sum aggregation

After grouping, we can apply the sum aggregation function to calculate the total amount for each customer.

result_df = grouped_df.sum("amount")
Step 4: Show the result

Finally, we can use the show action to display the grouped and aggregated result.

result_df.show()

Output:

+-------------+----------+
|customer_name|sum(amount)|
+-------------+----------+
|          Jon|      15.0|
|         Mary|      27.0|
+-------------+----------+
Conclusion

Using groupBy with the sum function in PySpark allows us to group the data by a specific column and calculate the sum of another column within each group. This is a powerful feature for analyzing and summarizing large datasets.