📌  相关文章
📜  在 PySpark 数据框中显示不同的列值

📅  最后修改于: 2022-05-13 01:54:45.039000             🧑  作者: Mango

在 PySpark 数据框中显示不同的列值

在本文中,我们将使用Python的pyspark 显示来自数据帧的不同列值。为此,我们使用了 distinct() 和 dropDuplicates() 函数以及 select()函数。

让我们创建一个示例数据框。

Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data =[["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["2", "ojaswi", "company 2"],
       ["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["4", "rohith", "company 2"],
       ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID','Employee NAME','Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
dataframe.show()


Python3
# select first column to get
# unique data using distinct function()
dataframe.select("Employee ID").distinct().show()


Python3
# select first and second column
# to get unique data using distinct function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).distinct().show()


Python3
#select first  column to get
# unique data using dropDuplicates function()
dataframe.select("Employee ID").dropDuplicates().show()


Python3
#select first and second column
# to get unique data using dropDuplicates function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).dropDuplicates().show()


输出:



方法一:使用distinct()

此函数使用 distinct()函数从列返回不同的值。

示例 1:对于单个列。

蟒蛇3

# select first column to get
# unique data using distinct function()
dataframe.select("Employee ID").distinct().show()

输出:

+-----------+
|Employee ID|
+-----------+
|          3|
|          5|
|          1|
|          4|
|          2|
+-----------+

示例 2:对于多列。

使用 distinct()函数显示 2 列中的唯一数据的Python代码。

代码:

蟒蛇3

# select first and second column
# to get unique data using distinct function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).distinct().show()

输出:

+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
|          5|      gnanesh|
|          4|       rohith|
|          1|       sravan|
|          2|       ojaswi|
|          3|        bobby|
+-----------+-------------+

方法 2:使用 dropDuplicates()

此函数使用 dropDuplicates()函数在数据框中的一列中显示唯一数据。

示例 1:对于单列。

蟒蛇3

#select first  column to get
# unique data using dropDuplicates function()
dataframe.select("Employee ID").dropDuplicates().show()

输出:

+-----------+
|Employee ID|
+-----------+
|          3|
|          5|
|          1|
|          4|
|          2|
+-----------+

示例 2:对于多列

使用 dropDuplicates()函数显示来自 2 列的唯一数据的Python代码

蟒蛇3

#select first and second column
# to get unique data using dropDuplicates function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).dropDuplicates().show()

输出:

+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
|          5|      gnanesh|
|          4|       rohith|
|          1|       sravan|
|          2|       ojaswi|
|          3|        bobby|
+-----------+-------------+