毫升 |分箱或离散化

现实世界的数据往往是嘈杂的。噪声数据是其中包含大量额外无意义信息的数据，称为噪声。数据清理（或数据清理）例程试图消除噪音，同时识别数据中的异常值。

有以下三种数据平滑技术：

Binning ： Binning 方法通过查询其“邻域”（即它周围的值）来平滑排序的数据值。
回归：它使数据值符合函数。线性回归涉及找到适合两个属性（或变量）的“最佳”线，以便可以使用一个属性来预测另一个。
异常值分析：可以通过聚类检测异常值，例如，将相似值组织成组或“簇”。直观地，落在该组集之外的值可以被认为是异常值。

数据平滑的分箱方法 –
在这里，我们关注的是用于数据平滑的 Binning 方法。在这种方法中，首先对数据进行排序，然后将排序后的值分配到多个桶或箱中。当分箱方法参考值的邻域时，它们执行局部平滑。

基本上有两种分箱方法——

等宽（或距离）分箱：最简单的分箱方法是将变量的范围划分为 k 个等宽间隔。区间宽度就是变量的范围 [A, B] 除以 k，
```
w = (B-A) / k
```
因此，第 i^个区间范围将是[A + (i-1)w, A + iw]其中 i = 1, 2, 3.....k
这种方法不能很好地处理倾斜的数据。
等深（或频率）分箱：在等频分箱中，我们将变量的范围 [A, B] 划分为包含（大约）相等数量点的区间；由于重复值，可能无法实现相同的频率。

如何对数据进行平滑处理？

执行平滑的方法有以下三种：

按 bin 均值平滑：在按 bin 均值进行平滑时，将 bin 中的每个值替换为 bin 的平均值。
按 bin 中值平滑：在此方法中，每个 bin 值都由其 bin 中值替换。
按 bin 边界平滑：在按 bin 边界进行平滑时，将给定 bin 中的最小值和最大值标识为 bin 边界。然后将每个 bin 值替换为最接近的边界值。

价格（美元）的排序数据：2、6、7、9、13、20、21、24、30

Partition using equal frequency approach:
Bin 1 : 2, 6, 7
Bin 2 : 9, 13, 20
Bin 3 : 21, 24, 30

Smoothing by bin mean :
Bin 1 : 5, 5, 5
Bin 2 : 14, 14, 14
Bin 3 : 25, 25, 25

Smoothing by bin median :
Bin 1 : 6, 6, 6
Bin 2 : 13, 13, 13
Bin 3 : 24, 24, 24

Smoothing by bin boundary :
Bin 1 : 2, 7, 7
Bin 2 : 9, 9, 20
Bin 3 : 21, 21, 30

分箱也可以用作离散化技术。这里离散化是指将连续属性、特征或变量转换或划分为离散或名义属性/特征/变量/区间的过程。
例如，属性值可以通过应用等宽或等频binning 来离散化，然后用 bin 均值或中值替换每个 bin 值，如分别用 bin 均值平滑或用 bin 中值平滑。然后可以将连续值转换为与其对应的 bin 的值相同的标称值或离散值。

下面是Python的实现：

bin_mean

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
# import statsmodels.api as sm
import statistics
import math
from collections import OrderedDict
  
x =[]
print("enter the data")
x = list(map(float, input().split()))
  
print("enter the number of bins")
bi = int(input())
  
# X_dict will store the data in sorted order
X_dict = OrderedDict()
# x_old will store the original data
x_old ={}
# x_new will store the data after binning
x_new ={}
  
  
for i in range(len(x)):
    X_dict[i]= x[i]
    x_old[i]= x[i]
  
x_dict = sorted(X_dict.items(), key = lambda x: x[1])
  
# list of lists(bins)
binn =[]
# a variable to find the mean of each bin
avrg = 0
  
i = 0
k = 0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
  
# performing binning
for g, h in X_dict.items():
    if(i


bin_median
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
# import statsmodels.api as sm
import statistics
import math
from collections import OrderedDict
  
  
x =[]
print("enter the data")
x = list(map(float, input().split()))
  
print("enter the number of bins")
bi = int(input())
  
# X_dict will store the data in sorted order
X_dict = OrderedDict()
# x_old will store the original data
x_old ={}
# x_new will store the data after binning
x_new ={}
  
for i in range(len(x)):
    X_dict[i]= x[i]
    x_old[i]= x[i]
  
x_dict = sorted(X_dict.items(), key = lambda x: x[1])
  
  
# list of lists(bins)
binn =[]
# a variable to find the mean of each bin
avrg =[]
  
i = 0
k = 0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
# performing binning
for g, h in X_dict.items():
    if(i

bin_boundary
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
# import statsmodels.api as sm
import statistics
import math
from collections import OrderedDict
  
x =[]
print("enter the data")
x = list(map(float, input().split()))
  
print("enter the number of bins")
bi = int(input())
  
# X_dict will store the data in sorted order
X_dict = OrderedDict()
# x_old will store the original data
x_old ={}
# x_new will store the data after binning
x_new ={}
  
  
for i in range(len(x)):
    X_dict[i]= x[i]
    x_old[i]= x[i]
  
x_dict = sorted(X_dict.items(), key = lambda x: x[1])
  
# list of lists(bins)
binn =[]
# a variable to find the mean of each bin
avrg =[]
  
i = 0
k = 0
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))
  
for g, h in X_dict.items():
    if(i= abs(h-binn[j][1])):
            x_new[g]= binn[j][1]
            i = i + 1
        else:
            x_new[g]= binn[j][0]
            i = i + 1
    else:
        i = 0
        j = j + 1
        if(abs(h-binn[j][0]) >= abs(h-binn[j][1])):
            x_new[g]= binn[j][1]
        else:
            x_new[g]= binn[j][0]
        i = i + 1
  
print("number of data in each bin")
print(math.ceil(len(x)/bi))
for i in range(0, len(x)):
    print('index {2} old value  {0} new value  {1}'.format(x_old[i], x_new[i], i))

参考： https://en.wikipedia.org/wiki/Data_binning