📜  PySpark 中的简单随机抽样和分层抽样

📅  最后修改于: 2022-05-13 01:55:40.526000             🧑  作者: Mango

PySpark 中的简单随机抽样和分层抽样

在本文中,我们将讨论 PySpark 中的简单随机抽样和分层抽样。

简单随机抽样:

在简单随机抽样中,每个元素都不是按特定顺序获得的。换句话说,它们是随机获得的。这就是为什么元素同样可能被选中的原因。简单来说,随机抽样被定义为从大型数据集中随机选择一个子集的过程。 PySpark 中的简单随机抽样可以通过 sample()函数获得。简单抽样有两种类型:替换和不替换。这些类型的随机抽样将在下面详细讨论,

方法一:放回随机抽样

带放回的随机抽样是一种随机抽样,其中先前随机选择的元素返回给总体,现在随机抽取一个随机元素。

例子:

Python3
# Python program to demonstrate random
# sampling in pyspark with replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
 
# Create a session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function with replacement
df_mobile_brands = df.sample(True, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()


Python3
# Python program to demonstrate random
# sampling in pyspark without replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# create the session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()


Python3
# Python program to demonstrate stratified sampling in pyspark
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# Create the session
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe by passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=1000000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=400000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="OPPO",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
  1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
 
# Print to the console
mobile_brands.show()


输出:

方法二:无放回随机抽样

无放回随机抽样是一种随机抽样,其中每一组在样本中只有一次被抽取的机会。

句法:

例子:

Python3

# Python program to demonstrate random
# sampling in pyspark without replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# create the session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()


输出:

方法3:pyspark中的分层抽样

在分层抽样的情况下,每个成员都被分为具有相同结构的组(同质组),称为分层,我们选择每个此类子组的代表(称为分层)。可以使用 sampleBy()函数计算 pyspark 中的分层采样。语法如下,

句法:

例子:

在此示例中,我们有 1000000、400000 和 2000000 三个层,它们分别根据分数 0.2、0.4 和 0.2 进行选择。

Python3

# Python program to demonstrate stratified sampling in pyspark
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# Create the session
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe by passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=1000000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=400000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="OPPO",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
  1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
 
# Print to the console
mobile_brands.show()

输出: