PySpark DataFrame – 删除带有 NULL 或 None 值的行

有时在处理数据帧内的数据时，我们可能会得到空值。为了清理数据集，我们必须删除数据框中的所有空值。因此，在本文中，我们将学习如何在 PySpark DataFrame 中删除具有 NULL 或 None 值的行。

函数

在 pyspark 中，可以使用drop()函数从数据框中删除空值。它采用以下参数：-

Syntax: dataframe_name.na.drop(how=”any/all”,thresh=threshold_value,subset=[“column_name_1″,”column_name_2”])

how – This takes either of the two values ‘any’ or ‘all’. ‘any’, drop a row if it contains NULLs on any columns and ‘all’, drop a row only if all columns have NULL values. By default it is set to ‘any’
thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to ‘None’.
subset – This parameter is used to select a specific column to target the NULL values in it. By default it’s ‘None

注意： DataFrame 有一个变量 na，它代表类 DataFrameNaFunctions 的一个实例。因此，我们在 DataFrame 上使用 na 变量来使用 drop()函数。

我们正在使用findspark.init()函数指定我们的 spark 目录路径，以使我们的程序能够在我们的本地机器中找到 apache spark 的位置。如果您在云上运行程序，请忽略此行。假设我们在 c 驱动器中有我们的 spark 文件夹，名称为 spark，所以函数看起来像：- findspark.init('c:/spark')。在本地运行程序时，不指定路径有时可能会导致 py4j.protocol.Py4JError 错误。

例子

示例 1：删除具有任何空值的所有行

在此示例中，我们将创建自己的自定义数据集并使用 drop()函数消除具有空值的行。我们将删除数据框中具有 Null 值的所有行。由于我们正在创建我们自己的数据，因此我们需要连同它一起指定我们的模式以创建数据集。

Python3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
import findspark
  
  
# spark location
# add the respective path to your spark
findspark.init('_path-to-spark_')
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# create the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
  
# drop None Values
df.na.drop(how="any").show(truncate=False)
  
# stop spark session
spark.stop()

Python3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
import findspark
  
  
# spark location
# add the respective path to your spark
findspark.init('_path-to-spark_')
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# create the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
  
# drop None Values
df.na.drop(subset=["Class ID"]).show(truncate=False)
  
# stop spark session
spark.stop()

Python3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
import findspark
  
  
# spark location
# add the respective path to your spark
findspark.init('_path-to-spark_')
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# create the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
  
# drop None Values
df.dropna().show(truncate=False)
  
# stop spark session
spark.stop()

输出：

示例 2：删除特定列中具有任何空值的所有行

我们还可以使用子集字段选择要检查的特定列。在这个例子中，我们使用我们定制的数据集，将仅删除在类 ID 列中具有空值的行的数据。由于我们正在创建我们自己的数据，因此我们需要连同它一起指定我们的模式以创建数据集。我们可以通过以下方式执行操作：-

蟒蛇3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
import findspark
  
  
# spark location
# add the respective path to your spark
findspark.init('_path-to-spark_')
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# create the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
  
# drop None Values
df.na.drop(subset=["Class ID"]).show(truncate=False)
  
# stop spark session
spark.stop()

输出：

示例 3：使用 dropna() 方法删除具有任何 Null 值的所有行

删除空值行的第三种方法是使用dropna()函数。 dropna()函数的执行方式与 na.drop() 类似。这里我们不需要指定任何变量，因为它会检测空值并自行删除行。由于我们正在创建我们自己的数据，因此我们需要连同它一起指定我们的模式以创建数据集。我们可以通过以下方式在 pyspark 中使用它：-

蟒蛇3

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.sql import SparkSession
import findspark
  
  
# spark location
# add the respective path to your spark
findspark.init('_path-to-spark_')
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# create the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
  
# drop None Values
df.dropna().show(truncate=False)
  
# stop spark session
spark.stop()

输出：