📜  Python|数据平滑的分箱方法

📅  最后修改于: 2022-05-13 01:55:15.861000             🧑  作者: Mango

Python|数据平滑的分箱方法

先决条件:机器学习 |分箱或离散分箱方法用于平滑数据或处理噪声数据。在这种方法中,首先对数据进行排序,然后将排序后的值分配到多个桶或箱中。当分箱方法参考值的邻域时,它们执行局部平滑。执行平滑的三种方法 -

方法:

  1. 对给定数据集的数组进行排序。
  2. 将范围划分为 N 个区间,每个区间包含大致相同数量的样本(等深分区)。
  3. 在每一行中存储平均值/中值/边界。

例子:

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition using equal frequency approach:
      - Bin 1 : 4, 8, 9, 15
      - Bin 2 : 21, 21, 24, 25
      - Bin 3 : 26, 28, 29, 34

Smoothing by bin means:
      - Bin 1: 9, 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:
      - Bin 1: 4, 4, 4, 15
      - Bin 2: 21, 21, 25, 25
      - Bin 3: 26, 26, 26, 34

Smoothing by bin median:
      - Bin 1: 9 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29

下面是上述算法的Python实现——

Python3
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
 
# load iris data set
dataset = load_iris()
a = dataset.data
b = np.zeros(150)
 
# take 1st column among 4 column of data set
for i in range (150):
    b[i]=a[i,1]
 
b=np.sort(b) #sort the array
 
# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))
 
# Bin mean
for i in range (0,150,5):
    k=int(i/5)
    mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5
    for j in range(5):
        bin1[k,j]=mean
print("Bin Mean: \n",bin1)
     
# Bin boundaries
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        if (b[i+j]-b[i]) < (b[i+4]-b[i+j]):
            bin2[k,j]=b[i]
        else:
            bin2[k,j]=b[i+4]   
print("Bin Boundaries: \n",bin2)
 
# Bin median
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)