📜  毫升 | K-means++ 算法

📅  最后修改于: 2022-05-13 01:58:08.023000             🧑  作者: Mango

毫升 | K-means++ 算法

先决条件: K-means 聚类 - 简介

标准K-means算法的缺点:

K-means 算法的一个缺点是它对质心或平均点的初始化很敏感。因此,如果一个质心被初始化为一个“远离”的点,它可能最终没有与之关联的点,同时,多个集群可能最终与一个质心相关联。类似地,多个质心可能被初始化到同一个集群中,从而导致聚类效果不佳。例如,考虑下面显示的图像。
质心初始化不佳导致聚类效果不佳。

聚类应该是这样的:



K-均值++:

为了克服上述缺点,我们使用 K-means++。该算法确保了质心的更智能初始化并提高了聚类质量。除初始化外,其余算法与标准 K-means 算法相同。也就是说,K-means++ 是标准的 K-means 算法,并结合了更智能的质心初始化。

初始化算法:

涉及的步骤是:

直觉:

按照上述初始化过程,我们选取了彼此远离的质心。这增加了最初拾取位于不同集群中的质心的机会。此外,由于质心是从数据点中提取的,所以每个质心最后都有一些与之相关的数据点。

执行:

考虑具有以下分布的数据集:

代码:KMean++ 算法的Python代码

Python3
# importing dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
  
# creating data
mean_01 = np.array([0.0, 0.0])
cov_01 = np.array([[1, 0.3], [0.3, 1]])
dist_01 = np.random.multivariate_normal(mean_01, cov_01, 100)
  
mean_02 = np.array([6.0, 7.0])
cov_02 = np.array([[1.5, 0.3], [0.3, 1]])
dist_02 = np.random.multivariate_normal(mean_02, cov_02, 100)
  
mean_03 = np.array([7.0, -5.0])
cov_03 = np.array([[1.2, 0.5], [0.5, 1,3]])
dist_03 = np.random.multivariate_normal(mean_03, cov_01, 100)
  
mean_04 = np.array([2.0, -7.0])
cov_04 = np.array([[1.2, 0.5], [0.5, 1,3]])
dist_04 = np.random.multivariate_normal(mean_04, cov_01, 100)
  
data = np.vstack((dist_01, dist_02, dist_03, dist_04))
np.random.shuffle(data)
  
# function to plot the selected centroids
def plot(data, centroids):
    plt.scatter(data[:, 0], data[:, 1], marker = '.',
                color = 'gray', label = 'data points')
    plt.scatter(centroids[:-1, 0], centroids[:-1, 1],
                color = 'black', label = 'previously selected centroids')
    plt.scatter(centroids[-1, 0], centroids[-1, 1],
                color = 'red', label = 'next centroid')
    plt.title('Select % d th centroid'%(centroids.shape[0]))
     
    plt.legend()
    plt.xlim(-5, 12)
    plt.ylim(-10, 15)
    plt.show()
          
# function to compute euclidean distance
def distance(p1, p2):
    return np.sum((p1 - p2)**2)
  
# initialization algorithm
def initialize(data, k):
    '''
    initialized the centroids for K-means++
    inputs:
        data - numpy array of data points having shape (200, 2)
        k - number of clusters
    '''
    ## initialize the centroids list and add
    ## a randomly selected data point to the list
    centroids = []
    centroids.append(data[np.random.randint(
            data.shape[0]), :])
    plot(data, np.array(centroids))
  
    ## compute remaining k - 1 centroids
    for c_id in range(k - 1):
         
        ## initialize a list to store distances of data
        ## points from nearest centroid
        dist = []
        for i in range(data.shape[0]):
            point = data[i, :]
            d = sys.maxsize
             
            ## compute distance of 'point' from each of the previously
            ## selected centroid and store the minimum distance
            for j in range(len(centroids)):
                temp_dist = distance(point, centroids[j])
                d = min(d, temp_dist)
            dist.append(d)
             
        ## select data point with maximum distance as our next centroid
        dist = np.array(dist)
        next_centroid = data[np.argmax(dist), :]
        centroids.append(next_centroid)
        dist = []
        plot(data, np.array(centroids))
    return centroids
  
# call the initialize function to get the centroids
centroids = initialize(data, k = 4)


输出:

注意:虽然 K-means++ 中的初始化在计算上比标准 K-means 算法更昂贵,但 K-means++ 收敛到最优的运行时间大大减少。这是因为最初选择的质心很可能已经位于不同的集群中。