📜  Scikit学习-异常检测

📅  最后修改于: 2020-12-10 05:52:40             🧑  作者: Mango


在这里,我们将了解什么是Sklearn中的异常检测以及如何将其用于识别数据点。

异常检测是一种用于识别数据集中与其他数据不太吻合的数据点的技术。它在商业中具有许多应用程序,例如欺诈检测,入侵检测,系统运行状况监视,监视和预测性维护。异常也称为离群值,可以分为以下三类:

  • 点异常-当单个数据实例被认为与其余数据异常时,会发生异常。

  • 上下文异常-这种异常是上下文特定的。如果数据实例在特定上下文中异常,则会发生这种情况。

  • 集体异常-当相关数据实例的集合相对于整个数据集而不是单个值异常时,就会发生这种情况。

方法

异常检测可以使用异常检测新颖性检测这两种方法。有必要看到它们之间的区别。

离群值检测

训练数据包含离其他数据远的异常值。这些异常值被定义为观察值。这就是原因,离群检测估计器总是尝试拟合训练数据最集中的区域,而忽略了异常观测值。这也称为无监督异常检测。

新颖性检测

它与在训练数据中不包括的新观察中检测到未观察到的模式有关。在这里,训练数据不受异常值的污染。这也称为半监督异常检测。

scikit-learn提供了一套ML工具,可用于异常检测和新颖性检测。这些工具首先通过使用fit()方法在无监督的情况下从数据中实现对象学习-

estimator.fit(X_train)

现在,可以通过使用predict()方法将新观察值分类为离群值(标记为1)离群值(标记为-1) ,如下所示:

estimator.fit(X_test)

估计器将首先计算原始评分函数,然后预测方法将使用该原始评分函数的阈值。我们可以借助score_sample方法访问此原始评分函数,并可以通过污染参数控制阈值。

我们还可以定义Decision_function方法,将离群值定义为负值,将离群值定义为非负值。

estimator.decision_function(X_test)

用于异常值检测的Sklearn算法

让我们首先了解什么是椭圆形信封。

拟合椭圆形信封

该算法假定常规数据来自已知分布,例如高斯分布。为了检测异常值,Scikit-learn提供了一个名为covariance.EllipticEnvelop的对象。

该对象将稳健的协方差估计值拟合到数据,因此将椭圆拟合到中心数据点。它忽略中心模式之外的点。

参量

下表包含sklearn使用的参数。 covariance.EllipticEnvelop方法-

Sr.No Parameter & Description
1

store_precision − Boolean, optional, default = True

We can specify it if the estimated precision is stored.

2

assume_centered − Boolean, optional, default = False

If we set it False, it will compute the robust location and covariance directly with the help of FastMCD algorithm. On the other hand, if set True, it will compute the support of robust location and covarian.

3

support_fraction − float in (0., 1.), optional, default = None

This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination − float in (0., 1.), optional, default = 0.1

It provides the proportion of the outliers in the data set.

5

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

属性

下表包含sklearn使用的属性。 covariance.EllipticEnvelop方法-

Sr.No Attributes & Description
1

support_ − array-like, shape(n_samples,)

It represents the mask of the observations used to compute robust estimates of location and shape.

2

location_ − array-like, shape (n_features)

It returns the estimated robust location.

3

covariance_ − array-like, shape (n_features, n_features)

It returns the estimated robust covariance matrix.

4

precision_ − array-like, shape (n_features, n_features)

It returns the estimated pseudo inverse matrix.

5

offset_ − float

It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

实施实例

import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean = [0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state = 0).fit(X)^M
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

输出

array([ 1, -1])

隔离林

对于高维数据集,一种有效的离群值检测方法是使用随机森林。 scikit-learn提供了ensemble.IsolationForest方法,该方法通过随机选择特征来隔离观察结果。之后,它会在所选特征的最大值和最小值之间随机选择一个值。

在这里,隔离样本所需的拆分次数等于从根节点到终止节点的路径长度。

参量

跟随表包括sklearn使用的参数。 ensemble.IsolationForest方法-

Sr.No Parameter & Description
1

n_estimators − int, optional, default = 100

It represents the number of base estimators in the ensemble.

2

max_samples − int or float, optional, default = “auto”

It represents the number of samples to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_samples samples. If we choose float as its value, it will draw max_samples ∗ 𝑋.shape[0] samples. And, if we choose auto as its value, it will draw max_samples = min(256,n_samples).

3

support_fraction − float in (0., 1.), optional, default = None

This parameter tells the method that how much proportion of points to be included in the support of the raw MCD estimates.

4

contamination − auto or float, optional, default = auto

It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].

5

random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options −

  • int − In this case, random_state is the seed used by random number generator.

  • RandomState instance − In this case, random_state is the random number generator.

  • None − In this case, the random number generator is the RandonState instance used by np.random.

6

max_features − int or float, optional (default = 1.0)

It represents the number of features to be drawn from X to train each base estimator. If we choose int as its value, it will draw max_features features. If we choose float as its value, it will draw max_features * X.shape[𝟏] samples.

7

bootstrap − Boolean, optional (default = False)

Its default option is False which means the sampling would be performed without replacement. And on the other hand, if set to True, means individual trees are fit on a random subset of the training data sampled with replacement.

8

n_jobs − int or None, optional (default = None)

It represents the number of jobs to be run in parallel for fit() and predict() methods both.

9

verbose − int, optional (default = 0)

This parameter controls the verbosity of the tree building process.

10

warm_start − Bool, optional (default=False)

If warm_start = true, we can reuse previous calls solution to fit and can add more estimators to the ensemble. But if is set to false, we need to fit a whole new forest.

属性

下表包含sklearn使用的属性。 ensemble.IsolationForest方法-

Sr.No Attributes & Description
1

estimators_ − list of DecisionTreeClassifier

Providing the collection of all fitted sub-estimators.

2

max_samples_ − integer

It provides the actual number of samples used.

3

offset_ − float

It is used to define the decision function from the raw scores. decision_function = score_samples -offset_

实施实例

下面的Python脚本将使用sklearn。 ensemble.IsolationForest方法可在给定数据上拟合10棵树

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators = 10)
OUTDclf.fit(X)

输出

IsolationForest(
   behaviour = 'old', bootstrap = False, contamination='legacy',
   max_features = 1.0, max_samples = 'auto', n_estimators = 10, n_jobs=None,
   random_state = None, verbose = 0
)

局部离群因子

局部离群因子(LOF)算法是对高维数据执行离群检测的另一种有效算法。 scikit-learn提供neighbors.LocalOutlierFactor方法,该方法计算得分(称为局部异常值),以反映观测值的异常程度。该算法的主要逻辑是检测密度远低于其邻居密度的样本。这就是为什么它测量给定数据点及其邻居的局部密度偏差的原因。

参量

跟随表包括sklearn使用的参数。 neighbors.LocalOutlierFactor方法

Sr.No Parameter & Description
1

n_neighbors − int, optional, default = 20

It represents the number of neighbors use by default for kneighbors query. All samples would be used if .

2

algorithm − optional

Which algorithm to be used for computing nearest neighbors.

  • If you choose ball_tree, it will use BallTree algorithm.

  • If you choose kd_tree, it will use KDTree algorithm.

  • If you choose brute, it will use brute-force search algorithm.

  • If you choose auto, it will decide the most appropriate algorithm on the basis of the value we passed to fit() method.

3

leaf_size − int, optional, default = 30

The value of this parameter can affect the speed of the construction and query. It also affects the memory required to store the tree. This parameter is passed to BallTree or KdTree algorithms.

4

contamination − auto or float, optional, default = auto

It provides the proportion of the outliers in the data set. If we set it default i.e. auto, it will determine the threshold as in the original paper. If set to float, the range of contamination will be in the range of [0,0.5].

5

metric − string or callable, default

It represents the metric used for distance computation.

6

P − int, optional (default = 2)

It is the parameter for the Minkowski metric. P=1 is equivalent to using manhattan_distance i.e. L1, whereas P=2 is equivalent to using euclidean_distance i.e. L2.

7

novelty − Boolean, (default = False)

By default, LOF algorithm is used for outlier detection but it can be used for novelty detection if we set novelty = true.

8

n_jobs − int or None, optional (default = None)

It represents the number of jobs to be run in parallel for fit() and predict() methods both.

属性

下表包含sklearn.neighbors.LocalOutlierFactor方法使用的属性-

Sr.No Attributes & Description
1

negative_outlier_factor_ − numpy array, shape(n_samples,)

Providing opposite LOF of the training samples.

2

n_neighbors_ − integer

It provides the actual number of neighbors used for neighbors queries.

3

offset_ − float

It is used to define the binary labels from the raw scores.

实施实例

下面给出的Python脚本将使用sklearn.neighbors.LocalOutlierFactor方法从对应于我们数据集的任何数组构造NeighborsClassifier类

from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors = 1, algorithm = "ball_tree",p=1)
LOFneigh.fit(samples)

输出

NearestNeighbors(
   algorithm = 'ball_tree', leaf_size = 30, metric='minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 1, p = 1, radius = 1.0
)

现在,我们可以使用以下Python脚本从此构造的分类器中询问[0.5,1.,1.5]的壁橱点-

print(neigh.kneighbors([[.5, 1., 1.5]])

输出

(array([[1.7]]), array([[1]], dtype = int64))

一类SVM

Schölkopf等人介绍的One-Class SVM是无监督的离群值检测。它在高维数据中也非常有效,并估计了高维分布的支持。它在Sklearn.svm.OneClassSVM对象的“支持向量机”模块中实现。为了定义边界,它需要一个内核(最常用的是RBF)和一个标量参数。

为了更好地理解,让我们将数据与svm.OneClassSVM对象配合起来

from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma = 'scale').fit(X)

现在,我们可以获得输入数据的score_samples,如下所示:

OSVMclf.score_samples(X)

输出

array([1.12218594, 1.58645126, 1.58673086, 1.58645127, 1.55713767])