查找 PySpark Dataframe 列的最小值、最大值和平均值
在本文中,我们将在 PySpark 数据框中查找特定列的最大值、最小值和平均值。
为此,我们将使用 agg()函数。这个函数 Compute 聚合并将结果作为 DataFrame 返回。
Syntax: dataframe.agg({‘column_name’: ‘avg/’max/min})
Where,
- dataframe is the input dataframe
- column_name is the column in the dataframe
创建 DataFrame 进行演示:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app
# name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan", 67, 89],
["2", "ojaswi", "vvit", 78, 89],
["3", "rohith", "vvit", 100, 80],
["4", "sridevi", "vignan", 78, 80],
["1", "sravan", "vignan", 89, 98],
["5", "gnanesh", "iit", 94, 98]]
# specify column names
columns = ['student ID', 'student NAME',
'college', 'subject 1', 'subject 2']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display dataframe
dataframe.show()
Python3
# find average of subjects column
dataframe.agg({'subject 1': 'avg'}).show()
Python3
# find average of multiple column
dataframe.agg({'subject 1': 'avg',
'student ID': 'avg',
'subject 2': 'avg'}).show()
Python3
# minimum value from student ID column
dataframe.agg({'student ID': 'min'}).show()
Python3
# minimum value from multiple column
dataframe.agg({'college': 'min',
'student NAME': 'min',
'student ID':'min'}).show()
Python3
# maximum value from student ID column
dataframe.agg({'student ID': 'max'}).show()
Python3
# maximum value from multiple column
dataframe.agg({'college': 'max',
'student NAME': 'max',
'student ID':'max'}).show()
输出:
求平均值
示例 1: Python程序查找数据框列的平均值
蟒蛇3
# find average of subjects column
dataframe.agg({'subject 1': 'avg'}).show()
输出:
示例 2:从多列中获取平均值
蟒蛇3
# find average of multiple column
dataframe.agg({'subject 1': 'avg',
'student ID': 'avg',
'subject 2': 'avg'}).show()
输出:
寻找最小值
示例 1:在数据帧列中查找最小值的Python程序。
蟒蛇3
# minimum value from student ID column
dataframe.agg({'student ID': 'min'}).show()
输出:
示例 2:从多列中获取最小值
蟒蛇3
# minimum value from multiple column
dataframe.agg({'college': 'min',
'student NAME': 'min',
'student ID':'min'}).show()
输出:
寻找最大值
示例 1:在数据框列中查找最大值的Python程序
蟒蛇3
# maximum value from student ID column
dataframe.agg({'student ID': 'max'}).show()
输出:
示例 2:从多列中获取最大值
蟒蛇3
# maximum value from multiple column
dataframe.agg({'college': 'max',
'student NAME': 'max',
'student ID':'max'}).show()
输出: