先决条件: K个最近的邻居


假设我们获得了一组数据项,每个数据项均具有数值特征(例如身高,体重,年龄等)。如果特征数为n ,则可以将这些项表示为n维网格中的点。给定一个新项目,我们可以计算出该项目到集合中所有其他项目的距离。我们选择k个最近的邻居,然后看到大多数这些邻居被归类在哪里。我们在那里对新项目进行分类。



Given a new item:
    1. Find distances between new item and all other items
    2. Pick k shorter distances
    3. Pick the most common class in these k distances
    4. That class is where we will classify the new item



Height, Weight, Age, Class
1.70, 65, 20, Programmer
1.90, 85, 33, Builder
1.78, 76, 31, Builder
1.73, 74, 24, Programmer
1.81, 75, 35, Builder
1.73, 70, 75, Scientist
1.80, 71, 63, Scientist
1.75, 69, 25, Programmer



我们将从文件(名为“ data.txt”)中读取内容,并将输入内容按行分割:

f = open('data.txt', 'r');
lines = f.read().splitlines();

文件的第一行包含要素名称,末尾带有关键字“ Class”。我们要将功能名称存储到列表中:

# Split the first line by commas,
# remove the first element and 
# save the rest into a list. The
# list now holds the feature 
# names of the data set.
features = lines[0].split(', ')[:-1];


items = [];
for i in range(1, len(lines)):
    line = lines[i].split(', ');
    itemFeatures = {"Class" : line[-1]};
    # Iterate through the features
    for j in range(len(features)):
        # Get the feature at index j
        f = features[j]; 
        # The first item in the line
        # is the class, skip it
        v = float(line[j]);
        # Add feature to dict
        itemFeatures[f] = v; 
    # Append temp dict to items


将数据存储到item中后,我们现在开始构建分类器。对于分类器,我们将创建一个新函数Classify 。我们将要分类的项目,项目列表和k (最接近的邻居数)作为输入。

如果k大于数据集的长度,我们将不继续进行分类,因为我们不能拥有比数据集中的项目总数更近的邻居。 (或者,我们可以将k设置为项目长度,而不是返回错误消息)

if(k > len(Items)):
        # k is larger than list
        # length, abort
        return "k larger than list length";

我们要计算要分类的项目与训练集中的所有项目之间的距离,最后保持k最短的距离。为了保持当前最近的邻居,我们使用一个列表,称为neighbors 。最少的每个元素都具有两个值,一个表示与要分类的项目的距离,另一个表示邻域所在类别的距离。我们将通过广义的欧几里得公式(对于n个维度)计算距离。然后,我们将选择大多数情况下在邻居中出现的班级,这将是我们的选择。在代码中:

def Classify(nItem, k, Items):
    if(k > len(Items)):
        # k is larger than list
        # length, abort
        return "k larger than list length";
    # Hold nearest neighbors.
    # First item is distance, 
    # second class
    neighbors = [];
    for item in Items:
        # Find Euclidean Distance
        distance = EuclideanDistance(nItem, item);
        # Update neighbors, either adding
        # the current item in neighbors 
        # or not.
        neighbors = UpdateNeighbors(neighbors, item, distance, k);
    # Count the number of each
    # class in neighbors
    count = CalculateNeighborsClass(neighbors, k);
    # Find the max in count, aka the
    # class with the most appearances.
    return FindMax(count);




distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}


def EuclideanDistance(x, y):
    # The sum of the squared 
    # differences of the elements
    S = 0; 
    for key in x.keys():
        S += math.pow(x[key]-y[key], 2);
    # The square root of the sum
    return math.sqrt(S);


我们有我们的邻居列表(其最大长度应为k ),并且我们想要以给定的距离将一个项目添加到列表中。首先,我们将检查邻居的长度是否为k 。如果数量较少,则无论距离多长,我们都将其添加到其中(因为在开始拒绝商品之前,我们需要将列表填满最多k个)。如果不是,我们将检查该物品的距离是否短于列表中具有最大距离的物品。如果是这样,我们将用最大距离替换新项。



def UpdateNeighbors(neighbors, item, distance, k):
    if(len(neighbors) > distance):
            # If yes, replace the last
            # element with new item
            neighbors[-1] = [distance, item["Class"]];
            neighbors = sorted(neighbors);
    return neighbors;



def CalculateNeighborsClass(neighbors, k):
    count = {};
    for i in range(k):
        if(neighbors[i][1] not in count):
            # The class at the ith index
            # is not in the count dict.
            # Initialize it to 1.
            count[neighbors[i][1]] = 1;
            # Found another item of class 
            # c[i]. Increment its counter.
            count[neighbors[i][1]] += 1;
    return count;



def FindMax(countList):
    # Hold the max
    maximum = -1;
    # Hold the classification
    classification = ""; 
    for key in countList.keys():
        if(countList[key] > maximum):
            maximum = countList[key];
            classification = key;
    return classification, maximum;




newItem = {'Height' : 1.74, 'Weight' : 67, 'Age' : 22};
print Classify(newItem, 3, items);


# Python Program to illustrate
# KNN algorithm
# For pow and sqrt
import math 
from random import shuffle
###_Reading_### def ReadData(fileName):
    # Read the file, splitting by lines
    f = open(fileName, 'r')
    lines = f.read().splitlines()
    # Split the first line by commas, 
    # remove the first element and save
    # the rest into a list. The list 
    # holds the feature names of the 
    # data set.
    features = lines[0].split(', ')[:-1]
    items = []
    for i in range(1, len(lines)):
        line = lines[i].split(', ')
        itemFeatures = {'Class': line[-1]}
        for j in range(len(features)):
            # Get the feature at index j
            f = features[j]  
            # Convert feature value to float
            v = float(line[j]) 
             # Add feature value to dict
            itemFeatures[f] = v
    return items
###_Auxiliary Function_### def EuclideanDistance(x, y):
    # The sum of the squared differences
    # of the elements
    S = 0  
    for key in x.keys():
        S += math.pow(x[key] - y[key], 2)
    # The square root of the sum
    return math.sqrt(S)
def CalculateNeighborsClass(neighbors, k):
    count = {}
    for i in range(k):
        if neighbors[i][1] not in count:
            # The class at the ith index is
            # not in the count dict. 
            # Initialize it to 1.
            count[neighbors[i][1]] = 1
            # Found another item of class 
            # c[i]. Increment its counter.
            count[neighbors[i][1]] += 1
    return count
def FindMax(Dict):
    # Find max in dictionary, return 
    # max value and max index
    maximum = -1
    classification = ''
    for key in Dict.keys():
        if Dict[key] > maximum:
            maximum = Dict[key]
            classification = key
    return (classification, maximum)
###_Core Functions_### def Classify(nItem, k, Items):
    # Hold nearest neighbours. First item
    # is distance, second class
    neighbors = []
    for item in Items:
        # Find Euclidean Distance
        distance = EuclideanDistance(nItem, item)
        # Update neighbors, either adding the
        # current item in neighbors or not.
        neighbors = UpdateNeighbors(neighbors, item, distance, k)
    # Count the number of each class 
    # in neighbors
    count = CalculateNeighborsClass(neighbors, k)
    # Find the max in count, aka the
    # class with the most appearances
    return FindMax(count)
def UpdateNeighbors(neighbors, item, distance,
                                          k, ):
    if len(neighbors) < k:
        # List is not full, add 
        # new item and sort
        neighbors.append([distance, item['Class']])
        neighbors = sorted(neighbors)
        # List is full Check if new 
        # item should be entered
        if neighbors[-1][0] > distance:
            # If yes, replace the 
            # last element with new item
            neighbors[-1] = [distance, item['Class']]
            neighbors = sorted(neighbors)
    return neighbors
###_Evaluation Functions_### def K_FoldValidation(K, k, Items):
    if K > len(Items):
        return -1
    # The number of correct classifications
    correct = 0  
    # The total number of classifications
    total = len(Items) * (K - 1)  
    # The length of a fold
    l = int(len(Items) / K)  
    for i in range(K):
        # Split data into training set
        # and test set
        trainingSet = Items[i * l:(i + 1) * l]
        testSet = Items[:i * l] + Items[(i + 1) * l:]
        for item in testSet:
            itemClass = item['Class']
            itemFeatures = {}
            # Get feature values
            for key in item:
                if key != 'Class':
                    # If key isn't "Class", add 
                    # it to itemFeatures
                    itemFeatures[key] = item[key]
            # Categorize item based on
            # its feature values
            guess = Classify(itemFeatures, k, trainingSet)[0]
            if guess == itemClass:
                # Guessed correctly
                correct += 1
    accuracy = correct / float(total)
    return accuracy
def Evaluate(K, k, items, iterations):
    # Run algorithm the number of
    # iterations, pick average
    accuracy = 0
    for i in range(iterations):
        accuracy += K_FoldValidation(K, k, items)
    print accuracy / float(iterations)
###_Main_### def main():
    items = ReadData('data.txt')
    Evaluate(5, 5, items, 100)
if __name__ == '__main__':



输出因机器而异。该代码包含Fold Validation函数,但它与算法无关,可用于计算算法的准确性。