使用 withColumn 向现有 PySpark DataFrame 添加两列

在本文中，我们将了解如何使用 WithColumns 向现有 Pyspark 数据框添加两列。

WithColumns 用于更改值、转换现有列的数据类型、创建新列等等。

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.

示例 1：创建 Dataframe，然后添加两列。

在这里，我们将从给定数据集的列表中创建一个数据框。

Python3

# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Months",
           "Course_Fees", "Discount",
           "Start_Date", "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3, 10000, 1000,
     "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills", 2,
     8000, 800, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting", 6,
     15000, 1500, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, 900, "02-12-2021", False),
]
  
df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
df.show()

Python3

new_df = df.withColumn('After_discount',
                       df.Course_Fees - df.Discount).withColumn('Before_discount',
                                                                df.Course_Fees)
new_df.show()

Python3

# import pandas to read json file
import pandas as pd
  
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# create Dataframe
df = spark.read.option("header",True).csv("Cricket_data_set_odi.csv")
  
# Display Schema
df.printSchema()
  
# Show Dataframe
df.show()

Python3

new_df = df.withColumn(
    'Hundred_run', df.Hundreds*100).withColumn(
    'Avg_run', df.Runs / df.Matches)
  
new_df.show()

输出：

现在添加列：

在这里，我们根据现有列创建两列。

蟒蛇3

new_df = df.withColumn('After_discount',
                       df.Course_Fees - df.Discount).withColumn('Before_discount',
                                                                df.Course_Fees)
new_df.show()

输出：

示例 2：从 csv 创建数据框，然后添加列。

在这里，我们将使用 cricket_data_set_odi.csv 文件作为数据集并从该文件创建数据帧。

创建用于演示的数据框：

蟒蛇3

# import pandas to read json file
import pandas as pd
  
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# create Dataframe
df = spark.read.option("header",True).csv("Cricket_data_set_odi.csv")
  
# Display Schema
df.printSchema()
  
# Show Dataframe
df.show()

输出：

然后，在现有数据框中添加列：

蟒蛇3

new_df = df.withColumn(
    'Hundred_run', df.Hundreds*100).withColumn(
    'Avg_run', df.Runs / df.Matches)
  
new_df.show()

输出：