📜  Pyspark 数据框 - 将字符串映射到数字

📅  最后修改于: 2022-05-13 01:55:16.521000             🧑  作者: Mango

Pyspark 数据框 - 将字符串映射到数字

在本文中,我们将了解如何将地图字符串转换为数字。

为演示创建数据框:

在这里,我们为大学名称创建一行数据,然后传递 createdataframe() 方法,然后显示数据框。

Python3
# importing module
import pyspark
 
# importing sparksession from pyspark.sql module and Row module
from pyspark.sql import SparkSession,Row
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of college data
dataframe = spark.createDataFrame([Row("vignan"),
                                   Row("rvrjc"),
                                   Row("klu"),
                                   Row("rvrjc"),
                                   Row("klu"),
                                   Row("vignan"),
                                   Row("iit")],
                                  ["college"])
 
# display dataframe
dataframe.show()


Python3
# function that converts string to numeric
def string_to_numeric(x):
   
      # return numeric value 1 if college is iit
    if(x == 'iit'):
       return 1
    elif(x == "vignan"):
       
    # return numeric value 2 if college is vignan
       return 2
    elif(x == "rvrjc"):
   
      # return numeric value 3 if college is rvrjc
       return 3
    else:
       
    # return numeric value 4 if college
    # is other than above three
       return 4
 
# map the  numeric value by using lambda
# function and rename college name as college_number
dataframe.select("college").
rdd.map(lambda x: string_to_numeric(x[0])).
map(lambda x: Row(x)).toDF(["college_number"]).show()


Python3
# import col and when modules
from pyspark.sql.functions import col, when
 
# map college name with college number
# using with column method along with when module
dataframe.withColumn("college_number",
                     when(col("college")=='iit', 1)
                     .when(col("college")=='vignan', 2)
                     .when(col("college")=='rvrjc', 3)
                     .otherwise(4)).show()


输出:



方法一:使用map()函数

这里我们创建了一个函数,通过 lambda 表达式将字符串转换为数字

在这里,我们将使用 Row 方法创建一个大学 spark 数据框,然后我们将使用 lambda函数映射数值并将大学名称重命名为 College_number。为此,我们将创建一个函数并检查条件,如果大学是 IIT,则返回数值 1,如果大学是 vignan,则返回数值 2,如果大学是 rvrjc,则返回数值 3,如果大学是其他则返回数值 4比以上三个

蟒蛇3

# function that converts string to numeric
def string_to_numeric(x):
   
      # return numeric value 1 if college is iit
    if(x == 'iit'):
       return 1
    elif(x == "vignan"):
       
    # return numeric value 2 if college is vignan
       return 2
    elif(x == "rvrjc"):
   
      # return numeric value 3 if college is rvrjc
       return 3
    else:
       
    # return numeric value 4 if college
    # is other than above three
       return 4
 
# map the  numeric value by using lambda
# function and rename college name as college_number
dataframe.select("college").
rdd.map(lambda x: string_to_numeric(x[0])).
map(lambda x: Row(x)).toDF(["college_number"]).show()

输出:

方法二:使用 withColumn() 方法。

这里我们使用 withColumn() 方法来选择列。



示例:这里我们将使用 Row 方法创建一个大学 spark 数据框,并使用 with column 方法和 when() 将大学名称与大学编号映射。

蟒蛇3

# import col and when modules
from pyspark.sql.functions import col, when
 
# map college name with college number
# using with column method along with when module
dataframe.withColumn("college_number",
                     when(col("college")=='iit', 1)
                     .when(col("college")=='vignan', 2)
                     .when(col("college")=='rvrjc', 3)
                     .otherwise(4)).show()

输出: