📜  Scikit学习-KNN学习

📅  最后修改于: 2020-12-10 05:54:19             🧑  作者: Mango


k-NN(k最近邻)是最简单的机器学习算法之一,本质上是非参数的和惰性的。非参数意味着没有基础数据分布的假设,即从数据集中确定了模型结构。惰性或基于实例的学习意味着,出于模型生成的目的,它不需要任何训练数据点,并且在测试阶段会使用整个训练数据。

k-NN算法包括以下两个步骤-

第1步

在此步骤中,它计算并存储训练集中每个样本的k个最近邻居。

第2步

在此步骤中,对于未标记的样本,它将从数据集中检索k个最近的邻居。然后,在这些k近邻中,它通过投票来预测班级(多数票的班级获胜)。

实现k近邻算法的模块sklearn.neighbors提供了无监督以及基于监督的基于邻居的学习方法的功能。

无监督的最近邻居实施不同的算法(BallTree,KDTree或蛮力)以找到每个样本的最近邻居。此无监督版本基本上只是上面讨论的步骤1,并且是需要邻居搜索的许多算法(KNN和K-means是著名的算法)的基础。简而言之,它是用于实施邻居搜索的无监督学习者。

另一方面,基于监督的基于邻居的学习被用于分类和回归。

无监督的KNN学习

如讨论的那样,存在许多需要最近邻居搜索的算法,例如KNN和K-Means。这就是Scikit-learn决定将邻居搜索部分实现为自己的“学习者”的原因。进行邻居搜索作为单独的学习者的原因在于,计算所有成对距离来查找最近的邻居显然不是很有效。让我们看一下Sklearn用于实现无监督的最近邻居学习的模块以及示例。

scikit学习模块

sklearn.neighbors.NearestNeighbors是用于实施无监督的最近邻居学习的模块。它使用名为BallTree,KDTree或蛮力的特定最近邻居算法。换句话说,它充当这三种算法的统一接口。

参量

跟随表包含NearestNeighbors模块使用的参数-

Sr.No Parameter & Description
1

n_neighbors − int, optional

The number of neighbors to get. The default value is 5.

2

radius − float, optional

It limits the distance of neighbors to returns. The default value is 1.0.

3

algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method.

4

leaf_size − int, optional

It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30.

5

metric − string or callable

It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string.

We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows −

Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’]

Scipy.spatial.distance −

[‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’].

The default metric is ‘Minkowski’.

6

P − integer, optional

It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2).

7

metric_params − dict, optional

This is the additional keyword arguments for the metric function. The default value is None.

8

N_jobs − int or None, optional

It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None.

实施实例

下面的示例将使用sklearn.neighbors.NearestNeighbors模块在两组数据之间找到最接近的邻居。

首先,我们需要导入所需的模块和软件包-

from sklearn.neighbors import NearestNeighbors
import numpy as np

现在,在导入包之后,在我们要查找最近的邻居之间定义数据集-

Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])

接下来,应用无监督学习算法,如下所示:

nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')

接下来,使用输入数据集拟合模型。

nrst_neigh.fit(Input_data)

现在,找到数据集的K邻居。它将返回每个点的邻居的索引和距离。

distances, indices = nbrs.kneighbors(Input_data)
indices

输出

array(
   [
      [0, 1, 3],
      [1, 2, 0],
      [2, 1, 0],
      [3, 4, 0],
      [4, 5, 3],
      [5, 6, 4],
      [6, 5, 4]
   ], dtype = int64
)
distances

输出

array(
   [
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712],
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712]
   ]
)

上面的输出显示每个点的最近邻居是该点本身,即为零。这是因为查询集与训练集匹配。

我们还可以通过生成如下的稀疏图来显示相邻点之间的连接-

nrst_neigh.kneighbors_graph(Input_data).toarray()

输出

array(
   [
      [1., 1., 0., 1., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 0., 0., 1., 1., 0., 0.],
      [0., 0., 0., 1., 1., 1., 0.],
      [0., 0., 0., 0., 1., 1., 1.],
      [0., 0., 0., 0., 1., 1., 1.]
   ]
)

一旦我们拟合了无监督的NearestNeighbors模型,数据将基于为参数‘algorithm’设置的值存储在数据结构中。在此之后,我们可以在需要邻居的搜索模型中使用这种无监督学习者的kneighbors。

完整的工作/可执行程序

from sklearn.neighbors import NearestNeighbors
import numpy as np
Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])
nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree')
nrst_neigh.fit(Input_data)
distances, indices = nbrs.kneighbors(Input_data)
indices
distances
nrst_neigh.kneighbors_graph(Input_data).toarray()

监督KNN学习

基于监督的基于邻居的学习用于-

  • 分类,用于带有离散标签的数据
  • 回归,用于带有连续标签的数据。

最近邻分类器

我们可以借助以下两个特征来了解基于邻居的分类-

  • 它是根据每个点的最近邻居的简单多数票计算得出的。
  • 它只是存储训练数据的实例,这就是为什么它是一种非一般性学习的原因。

Scikit学习模块

以下是scikit-learn使用的两种不同类型的最近邻居分类器-

S.No. Classifiers & Description
1. KNeighborsClassifier

The K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data.

2. RadiusNeighborsClassifier

The Radius in the name of this classifier represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this classifier implements learning based on the number neighbors within a fixed radius r of each training point.

最近邻居回归

它在数据标签本质上是连续的情况下使用。所分配的数据标签是基于其最近邻居标签的平均值计算的。

以下是scikit-learn使用的两种不同类型的最近邻居回归器-

KNeighborsRegressor

此回归器名称中的K代表k个最近的邻居,其中k是用户指定的整数值。因此,顾名思义,该回归器基于k个最近的邻居实现学习。 k值的选择取决于数据。让我们借助一个实现示例来进一步了解它。

以下是scikit-learn使用的两种不同类型的最近邻居回归器-

实施实例

在此示例中,我们将使用scikit-learn KNeighborsRegressor在名为Iris Flower数据集的数据集上实现KNN。

首先,导入虹膜数据集,如下所示:

from sklearn.datasets import load_iris
iris = load_iris()

现在,我们需要将数据分为训练和测试数据。我们将使用Sklearn train_test_split函数将数据分成70(训练数据)和20(测试数据)的比率-

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

接下来,我们将在Sklearn预处理模块的帮助下进行数据缩放-

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

接下来,从Sklearn导入KNeighborsRegressor类,并按如下所示提供邻居的值。

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 8)
knnr.fit(X_train, y_train)

输出

KNeighborsRegressor(
   algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 8, p = 2,
   weights = 'uniform'
)

现在,我们可以找到MSE(均方误差),如下所示:

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

输出

The MSE is: 4.4333349609375

现在,使用它来预测值,如下所示:

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

输出

[0.66666667]

完整的工作/可执行程序

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=8)
knnr.fit(X_train, y_train)

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

RadiusNeighborsRegressor

此回归器名称中的半径表示指定半径r内的最近邻居,其中r是用户指定的浮点值。因此,顾名思义,该回归器基于每个训练点的固定半径r内的邻居数实现学习。让我们在一个实现示例的帮助下更加了解它-

实施实例

在此示例中,我们将使用scikit-learn RadiusNeighborsRegressor在名为Iris Flower数据集的数据集上实现KNN-

首先,导入虹膜数据集,如下所示:

from sklearn.datasets import load_iris
iris = load_iris()

现在,我们需要将数据分为训练和测试数据。我们将使用Sklearn train_test_split函数将数据分成70(训练数据)和20(测试数据)的比率-

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

接下来,我们将在Sklearn预处理模块的帮助下进行数据缩放-

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

接下来,从Sklearn导入RadiusneighborsRegressor类,并提供radius的值,如下所示:

import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X_train, y_train)

现在,我们可以找到MSE(均方误差),如下所示:

print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))

输出

The MSE is: The MSE is: 5.666666666666667

现在,使用它来预测值,如下所示:

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))

输出

[1.]

完整的工作/可执行程序

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X_train, y_train)
print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))