📜  根据 PySpark DataFrame 中的特定列删除重复行

📅  最后修改于: 2022-05-13 01:54:29.609000             🧑  作者: Mango

根据 PySpark DataFrame 中的特定列删除重复行

在本文中,我们将使用Python的pyspark 从数据框中删除基于特定列的重复行。重复数据意味着基于某些条件(列值)的相同数据。为此,我们使用 dropDuplicates() 方法:

让我们创建数据框。

Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"],
        ["3", "rohith", "vvit"], ["4", "sridevi", "vignan"], 
        ["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]]
  
# specify column names
columns = ['student ID', 'student NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print('Actual data in dataframe')
dataframe.show()


Python3
# remove duplicate rows based on college 
# column
dataframe.dropDuplicates(['college']).show()


Python3
# remove duplicate rows based on college 
# and ID column
dataframe.dropDuplicates(['college', 'student ID']).show()


输出:

基于一列删除

蟒蛇3

# remove duplicate rows based on college 
# column
dataframe.dropDuplicates(['college']).show()

输出:

基于多列删除

蟒蛇3

# remove duplicate rows based on college 
# and ID column
dataframe.dropDuplicates(['college', 'student ID']).show()

输出: