毫升 | T 分布随机邻域嵌入 (t-SNE) 算法

T 分布随机邻域嵌入 (t-SNE)是一种非线性降维技术，非常适合在二维或三维的低维空间中嵌入高维数据以进行可视化。

什么是降维？
降维是在 2 维或 3 维中表示 n 维数据（具有许多特征的多维数据）的技术。
降维的一个例子可以讨论为一个分类问题，即学生是否会踢足球，这取决于温度和湿度，可以折叠成一个基本特征，因为这两个特征都高度相关。因此，我们可以减少此类问题中的特征数量。 3-D 分类问题可能难以可视化，而 2-D 分类问题可以映射到简单的 2 维空间，而 1-D 问题可以映射到简单的线。

t-SNE 是如何工作的？
t-SNE 是一种非线性降维算法，根据数据点与特征的相似性来查找数据中的模式，点的相似性计算为点 A 选择点 B 作为其邻居的条件概率。
然后，它尝试最小化高维和低维空间中这些条件概率（或相似性）之间的差异，以完美表示低维空间中的数据点。

空间和时间复杂度：
该算法计算成对条件概率，并试图最小化高维和低维概率差的总和。这涉及大量的计算和计算。所以算法需要大量的时间和空间来计算。 t-SNE 在数据点的数量上具有二次时间和空间复杂度。

代码：在 MNIST 数据集上实现 T-SNE 的Python代码

Python3

# Importing Necessary Modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

Python3

# Reading the data using pandas 
df = pd.read_csv('mnist_train.csv')
 
# print first five rows of df
print(df.head(4))
 
# save the labels into a variable l.
l = df['label']
 
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)

Python3

# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
 
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)

Python3

# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
 
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
 
tsne_data = model.fit_transform(data_1000)
 
# creating a new data frame which
# help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
     columns =("Dim_1", "Dim_2", "label"))
 
# Ploting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
       plt.scatter, 'Dim_1', 'Dim_2').add_legend()
 
plt.show()

代码 #1：读取数据

Python3

# Reading the data using pandas 
df = pd.read_csv('mnist_train.csv')
 
# print first five rows of df
print(df.head(4))
 
# save the labels into a variable l.
l = df['label']
 
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)

输出：

代码 #2：数据预处理

Python3

# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
 
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)

输出：

(15000, 784)

代码#3：

Python3

# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
 
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
 
tsne_data = model.fit_transform(data_1000)
 
# creating a new data frame which
# help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
     columns =("Dim_1", "Dim_2", "label"))
 
# Ploting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
       plt.scatter, 'Dim_1', 'Dim_2').add_legend()
 
plt.show()

输出：