📜  机器学习异常检测

📅  最后修改于: 2022-05-13 01:54:21.414000             🧑  作者: Mango

机器学习异常检测

异常检测是一种识别罕见事件或观察结果的技术,这些事件或观察结果在统计上与其他观察结果不同,可能会引起怀疑。这种“异常”行为通常会转化为某种问题,例如信用卡欺诈、服务器中的机器故障、网络攻击等。
异常可大致分为三类——

  1. 点异常:如果数据集中的元组与其余数据相距甚远,则称其为点异常。
  2. 上下文异常:如果由于观察的上下文而导致异常,则观察是上下文异常。
  3. 集体异常:一组数据实例有助于发现异常。

异常检测可以使用机器学习的概念来完成。可以通过以下方式完成——

  1. 监督异常检测:该方法需要一个包含正常和异常样本的标记数据集来构建预测模型来对未来数据点进行分类。为此目的最常用的算法是监督神经网络、支持向量机学习、K-最近邻分类器等。
  2. 无监督异常检测:此方法确实需要任何训练数据,而是假设有关数据的两件事,即只有一小部分数据是异常的,并且任何异常在统计上与正常样本不同。基于上述假设,然后使用相似性度量对数据进行聚类,并将远离聚类的数据点视为异常。

我们现在演示使用 pyod 模块中包含的 K-Nearest Neighbors 算法对合成数据集进行异常检测的过程。
第 1 步:导入所需的库

Python3
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.knn import KNN
from pyod.utils.data import generate_data, get_outliers_inliers


Python3
# generating a random dataset with two features
X_train, y_train = generate_data(n_train = 300, train_only = True,
                                                   n_features = 2)
 
# Setting the percentage of outliers
outlier_fraction = 0.1
 
# Storing the outliers and inliners in different numpy arrays
X_outliers, X_inliers = get_outliers_inliers(X_train, y_train)
n_inliers = len(X_inliers)
n_outliers = len(X_outliers)
 
# Separating the two features
f1 = X_train[:, [0]].reshape(-1, 1)
f2 = X_train[:, [1]].reshape(-1, 1)


Python3
# Visualising the dataset
# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200),
                     np.linspace(-10, 10, 200))
 
# scatter plot
plt.scatter(f1, f2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')


Python3
# Training the classifier
clf = KNN(contamination = outlier_fraction)
clf.fit(X_train, y_train)
 
# You can print this to see all the prediction scores
scores_pred = clf.decision_function(X_train)*-1
 
y_pred = clf.predict(X_train)
n_errors = (y_pred != y_train).sum()
# Counting the number of errors
 
print('The number of prediction errors are ' + str(n_errors))


Python3
# threshold value to consider a
# datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)
 
# decision function calculates the raw
# anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
 
# fill blue colormap from minimum anomaly
# score to threshold value
subplot = plt.subplot(1, 2, 1)
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
                  threshold, 10), cmap = plt.cm.Blues_r)
 
# draw red contour line where anomaly
# score is equal to threshold
a = subplot.contour(xx, yy, Z, levels =[threshold],
                     linewidths = 2, colors ='red')
 
# fill orange contour lines where range of anomaly
# score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors ='orange')
 
# scatter plot of inliers with white dots
b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1],
                                    c ='white', s = 20, edgecolor ='k')
 
# scatter plot of outliers with black dots
c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1],
                                    c ='black', s = 20, edgecolor ='k')
subplot.axis('tight')
 
subplot.legend(
    [a.collections[0], b, c],
    ['learned decision function', 'true inliers', 'true outliers'],
    prop = matplotlib.font_manager.FontProperties(size = 10),
    loc ='lower right')
 
subplot.set_title('K-Nearest Neighbours')
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()


第 2 步:创建合成数据

Python3

# generating a random dataset with two features
X_train, y_train = generate_data(n_train = 300, train_only = True,
                                                   n_features = 2)
 
# Setting the percentage of outliers
outlier_fraction = 0.1
 
# Storing the outliers and inliners in different numpy arrays
X_outliers, X_inliers = get_outliers_inliers(X_train, y_train)
n_inliers = len(X_inliers)
n_outliers = len(X_outliers)
 
# Separating the two features
f1 = X_train[:, [0]].reshape(-1, 1)
f2 = X_train[:, [1]].reshape(-1, 1)

第 3 步:可视化数据

Python3

# Visualising the dataset
# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200),
                     np.linspace(-10, 10, 200))
 
# scatter plot
plt.scatter(f1, f2)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

第 4 步:训练和评估模型

Python3

# Training the classifier
clf = KNN(contamination = outlier_fraction)
clf.fit(X_train, y_train)
 
# You can print this to see all the prediction scores
scores_pred = clf.decision_function(X_train)*-1
 
y_pred = clf.predict(X_train)
n_errors = (y_pred != y_train).sum()
# Counting the number of errors
 
print('The number of prediction errors are ' + str(n_errors))

第 5 步:可视化预测

Python3

# threshold value to consider a
# datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)
 
# decision function calculates the raw
# anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
 
# fill blue colormap from minimum anomaly
# score to threshold value
subplot = plt.subplot(1, 2, 1)
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(),
                  threshold, 10), cmap = plt.cm.Blues_r)
 
# draw red contour line where anomaly
# score is equal to threshold
a = subplot.contour(xx, yy, Z, levels =[threshold],
                     linewidths = 2, colors ='red')
 
# fill orange contour lines where range of anomaly
# score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors ='orange')
 
# scatter plot of inliers with white dots
b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1],
                                    c ='white', s = 20, edgecolor ='k')
 
# scatter plot of outliers with black dots
c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1],
                                    c ='black', s = 20, edgecolor ='k')
subplot.axis('tight')
 
subplot.legend(
    [a.collections[0], b, c],
    ['learned decision function', 'true inliers', 'true outliers'],
    prop = matplotlib.font_manager.FontProperties(size = 10),
    loc ='lower right')
 
subplot.set_title('K-Nearest Neighbours')
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()


参考: https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/