📜  温索化

📅  最后修改于: 2022-05-13 01:57:14.305000             🧑  作者: Mango

温索化

Winsorization 是替换统计数据的极值的过程,以限制异常值对计算或使用该数据获得的结果的影响。这种极值替换后计算的平均值称为winsorized平均值。

例如,90% winsorization 意味着替换前 5% 和后 5% 的数据。数据的前 5% 由第 95 个百分位的数据值替换,数据的后 5% 的值由第 5 个百分位的数据值替换。

输入:

  • 一个数值数组,其上端和下端的值将被 winsorized。
  • 元组的第一个参数是要被 winsorized 的下端值的百分比。
  • 元组的第二个参数是要被 winsorized 的上端值的百分比。

输出:

一个数值数组,其上端和下端的值按照用户的定义进行 winsorized。



示例#1:

Python3
# Libraries to be imported
import numpy as np
import matplotlib.pyplot as plt
import random
from scipy.stats.mstats import winsorize


Python3
# Creating an array with 100 random values
array = [np.random.randint(100) for i in range(100)]
  
# Creating outliers
# Here, the values which are selected for creating outliers 
# are appended so that same outliers are not created again.
AlreadySelected = []
i = 0
  
# Creating 5 outliers on the lower end
while (i < 5):
    x = np.random.choice(array)  # Randomly selecting a value from the array
    y = x - mean*3
    array = np.append(array, y)
    if (x not in already_selected):
        AlreadySelected.append(y)
  
        i += 1
  
    else:
        continue
  
# Creating 5 outliers on the upper end
i = 0
while (i < 5):
    x = np.random.choice(array)  # Randomly selecting a value from the array
    y = x + mean*4
    array = np.append(array, y)
    if (x not in already_selected):
        AlreadySelected.append(y)
  
        i += 1
  
    else:
        continue
  
std = np.std(array)  # Storing the standard deviation of the array
mean = np.mean(array)  # Storing the mean of the array
  
plt.boxplot(array)
plt.title('Array with Outliers')
plt.show()


Python3
print(mean) # mean of the numeric array with outliers


Python3
WinsorizedArray = winsorize(array,(0.05,0.05))
  
plt.boxplot(WinsorizedArray)
plt.title('Winsorized array')
plt.show()


Python3
WinsorizedMean = np.mean(WinsorizedArray)
print(WinsorizedMean)


Python3
# Creating another array with 100 random values
array2 = [np.random.randint(100) for i in range(100)] 
std = np.std(array2)
mean = np.mean(array2)
AlreadySelected = []
# Creating outliers on the upper end
i = 0 
while (i<5):
    x = np.random.choice(array2) # Randomly selecting a value from the array
    y = x + mean*4
    array2 = np.append(array2,y)
    if (x not in AlreadySelected):
        AlreadySelected.append(y)
  
        i+=1
          
    else:
        continue
          
plt.boxplot(array2)
plt.title('Array with outliers')
plt.show()


Python3
OutlierArray2Mean = np.mean(array2)
print(OutlierArray2Mean)


Python3
WinsorizedArray2 = winsorize(array2,(0.1,0.1))
# In this case, the lower 10% values of 
# the data will have their values set equal to the value of the data point at 
#the 10th percentile.
  
plt.boxplot(WinsorizedArray2)
plt.show()
  
WinsorizedArray2Mean = np.mean(WinsorizedArray2)


Python3
WinsorizedArray2Mean = np.mean(WinsorizedArray2)
print(WinsorizedArray2Mean)


让我们看一个示例,其中数据的上端和下端都存在异常值。

蟒蛇3

# Creating an array with 100 random values
array = [np.random.randint(100) for i in range(100)]
  
# Creating outliers
# Here, the values which are selected for creating outliers 
# are appended so that same outliers are not created again.
AlreadySelected = []
i = 0
  
# Creating 5 outliers on the lower end
while (i < 5):
    x = np.random.choice(array)  # Randomly selecting a value from the array
    y = x - mean*3
    array = np.append(array, y)
    if (x not in already_selected):
        AlreadySelected.append(y)
  
        i += 1
  
    else:
        continue
  
# Creating 5 outliers on the upper end
i = 0
while (i < 5):
    x = np.random.choice(array)  # Randomly selecting a value from the array
    y = x + mean*4
    array = np.append(array, y)
    if (x not in already_selected):
        AlreadySelected.append(y)
  
        i += 1
  
    else:
        continue
  
std = np.std(array)  # Storing the standard deviation of the array
mean = np.mean(array)  # Storing the mean of the array
  
plt.boxplot(array)
plt.title('Array with Outliers')
plt.show()

输出:



蟒蛇3

print(mean) # mean of the numeric array with outliers

输出:



现在,我们对数组进行 10% 的 winsorize,即我们对数组的最高值的 5% 和最低值的 5% 进行 winsorize:

蟒蛇3

WinsorizedArray = winsorize(array,(0.05,0.05))
  
plt.boxplot(WinsorizedArray)
plt.title('Winsorized array')
plt.show()

输出:


蟒蛇3

WinsorizedMean = np.mean(WinsorizedArray)
print(WinsorizedMean)

输出:



在这种情况下,数据的平均值只有轻微的变化。

现在,让我们看一个示例,其中异常值仅出现在数据的一端。

蟒蛇3

# Creating another array with 100 random values
array2 = [np.random.randint(100) for i in range(100)] 
std = np.std(array2)
mean = np.mean(array2)
AlreadySelected = []
# Creating outliers on the upper end
i = 0 
while (i<5):
    x = np.random.choice(array2) # Randomly selecting a value from the array
    y = x + mean*4
    array2 = np.append(array2,y)
    if (x not in AlreadySelected):
        AlreadySelected.append(y)
  
        i+=1
          
    else:
        continue
          
plt.boxplot(array2)
plt.title('Array with outliers')
plt.show()

输出:





蟒蛇3

OutlierArray2Mean = np.mean(array2)
print(OutlierArray2Mean)

输出:



蟒蛇3

WinsorizedArray2 = winsorize(array2,(0.1,0.1))
# In this case, the lower 10% values of 
# the data will have their values set equal to the value of the data point at 
#the 10th percentile.
  
plt.boxplot(WinsorizedArray2)
plt.show()
  
WinsorizedArray2Mean = np.mean(WinsorizedArray2)

输出:




蟒蛇3

WinsorizedArray2Mean = np.mean(WinsorizedArray2)
print(WinsorizedArray2Mean)

输出:



在这种情况下,平均值存在显着差异。