📌  相关文章
📜  如何从 PySpark DataFrame 中获取随机行?

📅  最后修改于: 2022-05-13 01:55:30.217000             🧑  作者: Mango

如何从 PySpark DataFrame 中获取随机行?

在本文中,我们将学习如何使用Python编程语言从 PySpark DataFrame 中获取随机行。

方法 1:PySpark sample() 方法

PySpark 提供了各种采样方法,用于从给定的 PySpark DataFrame 返回样本。

以下是 sample() 方法的详细信息:

示例

在此示例中,我们需要在 [0.0,1.0] 范围内添加一小部分浮点数据类型。使用公式:

Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Taking a sample of df and storing it in #df2
# please not that the second argument here is a fraction
# of the data set we need(fraction is in float)
# number of rows = fraction * total number of rows
df2 = df.sample(False, 1.0/len(df.collect()))
  
# printing the sample row which is a DataFrame
df2.show()


Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Getting RDD object from the DataFrame
rdd = df.rdd
  
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
                            num=1)
print(rdd_sample)


Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
  
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
  
# Showing our randomly selected row
df_random.show()


输出

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

方法二:使用 takeSample() 方法

我们首先将 PySpark DataFrame 转换为 RDD。弹性分布式数据集 (RDD) 是 PySpark 中最简单、最基础的数据结构。它们是任何数据类型的数据的不可变集合。

我们可以使用DataFrame.rdd获取 Data Frame 的 RDD,然后使用takeSample()方法。

示例:在此示例中,我们在 RDD 上使用参数 num = 1 的 takeSample() 方法来获取 Row 对象。 num 是样本数。

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Getting RDD object from the DataFrame
rdd = df.rdd
  
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
                            num=1)
print(rdd_sample)

输出

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

[Row(Letters='c', Position=3)]

方法 3:将 PySpark DataFrame 转换为 Pandas DataFrame 并使用 sample() 方法

我们可以使用 toPandas()函数将 PySpark DataFrame 转换为 Pandas DataFrame。仅当预期生成的 Pandas 的 DataFrame 很小时才应使用此方法,因为所有数据都加载到驱动程序的内存中。这是一种实验方法

然后我们将使用 Pandas 库的sample()方法。它从 Pandas DataFrame 的轴返回一个随机样本。

示例

在本例中,我们将把 PySpark DataFrame 转换为 Pandas DataFrame 并在其上使用 Pandas sample()函数。

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
  
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
  
# Showing our randomly selected row
df_random.show()

输出

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+