在 Pyspark 中使用 dropna 清理数据

在处理由许多行和列组成的大尺寸数据框时，它们还包含某些行或列的许多 NULL 或 None 值，或者某些行完全为 NULL 或 None。因此，在这种情况下，如果我们对包含许多 NULL 或 None 值的同一个 Dataframe 应用操作，那么我们将无法从该 Dataframe 中获得正确或所需的输出。为了从 Dataframe 获得正确的输出，我们必须清理它，这意味着我们必须使 Dataframe 没有 NULL 或 None 值。

所以在本文中，我们将学习如何清理 Dataframe。为了清理数据框，我们使用dropna()函数。此函数用于根据给定的参数从 Dataframe 中删除 NULL 值。

Syntax: df.dropna(how=”any”, thresh=None, subset=None)

where, df is the Dataframe

Parameter:

how: This parameter is used to determine if the row or column has to remove or not.
- ‘any’ – If any of the value in Dataframe is NULL then drop that row or column.
- ‘all’ – If all the values of particular row or column is NULL then drop.
thresh: If non NULL values of particular row or column is less than thresh value then drop that row or column.
subset: If the given subset column contains any of the null value then dop that row or column.

编程需要懂一点英语

要使用 dropna 方法删除空值，首先，我们将创建一个 Pyspark 数据框，然后应用它。

Python

# importing necessary libraries
from pyspark.sql import SparkSession
 
# function to create new SparkSession
def create_session():
    spk = SparkSession.builder \
        .master("local") \
        .appName("Employee_detail.com") \
        .getOrCreate()
    return spk
 
 
def create_df(spark, data, schema):
    df1 = spark.createDataFrame(data, schema)
    return df1
 
 
if __name__ == "__main__":
 
    # calling function to create SparkSession
    spark = create_session()
 
    input_data = [(1, "Shivansh", "Data Scientist", "Noida"),
                  (2, None, "Software Developer", None),
                  (3, "Swati", "Data Analyst", "Hyderabad"),
                  (4, None, None, "Noida"),
                  (5, "Arpit", "Android Developer", "Banglore"),
                  (6, "Ritik", None, None),
                  (None, None, None, None)]
    schema = ["Id", "Name", "Job Profile", "City"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
    df.show()

Python

# if any row having any Null
# value we are dropping that
# rows
df = df.dropna(how="any")
df.show()

Python

# if any row having all Null
#  values we are dropping that
# rows.
df = df.dropna(how="all")
df.show()

Python

# if thresh value is not
# satisfied then dropping
# that row
df = df.dropna(thresh=2)
df.show()

Python

# if the subset column any value
# is NULL then dropping that row
df = df.dropna(subset="City")
df.show()

Python

# if thresh value is satisfied with subset
# column then dropping that row
df = df.dropna(thresh=2,subset=("Id","Name","City"))
df.show()

输出：

示例 1：使用 PySpark 中的任何参数使用 dropna 清理数据。

在下面的代码中，我们传递了how="any" dropna()函数中的参数，这意味着如果有任何行或列具有任何 Null 值，那么我们将从 Dataframe 中删除该行或列。

Python

# if any row having any Null
# value we are dropping that
# rows
df = df.dropna(how="any")
df.show()

输出：

示例 2：使用 PySpark 中的所有参数使用 dropna 清理数据。

在下面的代码中，我们通过了 dropna()函数中的how=”all”参数意味着如果所有行或列都具有所有 Null 值，那么我们将从 Dataframe 中删除该特定行或列。

Python

# if any row having all Null
#  values we are dropping that
# rows.
df = df.dropna(how="all")
df.show()

输出：

示例 3：在 PySpark 中使用 thresh 参数使用 dropna 清理数据。

在下面的代码中，我们在 dropna()函数传递了thresh=2参数，这意味着如果有任何行或列的非空值少于 thresh 值，那么我们将从中删除该行或列数据框。

Python

# if thresh value is not
# satisfied then dropping
# that row
df = df.dropna(thresh=2)
df.show()

输出：

示例 4：在 PySpark 中使用子集参数使用 dropna 清理数据。

在下面的代码中，我们在 dropna()函数传递了subset='City'参数，该参数是 City 列中的列名，如果该列中存在任何 NULL 值，那么我们将从 Dataframe 中删除该行.

Python

# if the subset column any value
# is NULL then dropping that row
df = df.dropna(subset="City")
df.show()

输出：

示例 5：在 PySpark 中使用 thresh 和子集参数使用 dropna 清理数据。

在下面的代码中，我们在 dropna()函数传递了(thresh=2,subset=("Id","Name","City"))参数，因此当thresh=2和subset =("Id","Name","City")这两个条件都将满足意味着在这三列中 dropna函数检查thresh=2是否也满足，如果满足则删除该特定行或列。

Python

# if thresh value is satisfied with subset
# column then dropping that row
df = df.dropna(thresh=2,subset=("Id","Name","City"))
df.show()

输出：