📜  在Python实现自己的word2vec(skip-gram)模型

📅  最后修改于: 2021-04-16 06:20:36             🧑  作者: Mango

先决条件:word2vec简介

自然语言处理(NLP)是计算机科学和人工智能的一个子领域,涉及计算机与人类(自然)语言之间的交互。
在NLP技术中,我们将单词和短语(从词汇或语料库)映射到数字向量,以简化处理。这些类型的语言建模技术称为词嵌入

2013年,Google推出了word2vec ,这是一组用于生成单词嵌入的相关模型。

让我们通过推导神经网络的反向传播方程式,来实现自己的跳跃语法模型(在Python)。

在word2vec的跳过语法体系结构中,输入是中心词,而预测是上下文词。考虑单词W的数组,如果W(i)是输入(中心单词),则W(i-2),W(i-1),W(i + 1)和W(i + 2)为上下文字(如果滑动窗口大小为2)。

Let's define some variables :

V    Number of unique words in our corpus of text ( Vocabulary )
x    Input layer (One hot encoding of our input word ). 
N    Number of neurons in the hidden layer of neural network
W    Weights between input layer and hidden layer
W'   Weights between hidden layer and output layer
y    A softmax output layer having probabilities of every word in our vocabulary
跳过语法架构

跳过语法架构

定义了我们的神经网络架构,现在让我们做一些数学运算以得出梯度下降所需的方程式。

正向传播:

将中心词(由x表示)的一个热编码与第一权重矩阵W相乘,得到隐藏的层矩阵h (大小为N x 1)。
  h = W^T.x
(Vx1)(NxV)(Vx1)现在我们将隐藏层矢量h与第二权重矩阵W’相乘,得到一个新矩阵u

 u = W'^T.h
(Vx1)(VxN)(Nx1)
请注意,我们必须对u层应用softmax>以获得输出层y

u j为层u的j神经元
w j是我们词汇表中的j单词,其中j是任何索引
V w j为矩阵W’的j列(对应于单词w j的列)

 u_{j} = V_{w_{ij}}^T.h$
(1×1)(1xN)(Nx1)

y = softmax(u)
y j = softmax(u j )
y j表示w j是上下文词的概率

  P(w_{j}| w_{i}) = y_{j} = \frac{e^{u_{j}}}{\sum_{j'=1}^{v} e^{u_{j'}}}$ P(w j | w i )是w j是上下文词的概率,假设w i是输入词。

因此,我们的目标是最大化P(w j * | w i ) ,其中j *表示上下文词的索引

显然,我们想最大化
 \prod_{c=1}^{C} \frac{e^{{u_{{j_c}^*}}}}{\sum_{j'=1}^{v} e^{u_{j'}}}$
其中j * c是上下文单词的词汇索引。上下文词的范围是c = 1,2,3..C
让我们对该函数取一个对数负对数以获得损失函数,我们希望将其最小化
  E = - \log{ \left\{ {\prod_{c=1}^{C} \frac{e^{{u_{{j_c}^*}}}}{\sum_{j'=1}^{v} e^{u_{j'}}}} \right\} \,,\, E\, \, being\, our\, loss\, function$

t为来自训练数据的特定中心词的实际输出向量。上下文词的位置为1,其他所有位置为0。 t j * c是上下文词的1。
我们可以乘u_{{j_c}^*}}}{t_{{j_c}^*}}

 E = -log({\prod_{c=1}^{C} e^{ {u_{{j_c}^*}} }) + log(\sum_{j'=1}^{v} e^{u_{j'}})^C

求解此方程,我们得到的损失函数为–

 E = -\sum_{c=1}^C {u_{{j_c}^*}}} } + C.log(\sum_{j'=1}^{v} e^{u_{j'}})

反向传播:

要调整的参数在矩阵W和W’中,因此我们必须找到关于W和W’的损失函数的偏导数,以应用梯度下降算法。
我们必须找到\frac{\partial E}{\partial W'}\, \, and\, \, \frac{\partial E}{\partial W}$

 \frac{\partial E}{\partial w'_{ij}} = \frac{\partial E}{\partial u_j} . \frac{\partial u_j}{\partial w'_{ij}}$

 \frac{\partial E}{\partial u_j}= - \sum_{c=1}^{C}{u_{{j_c}^*}} + C.\frac{1}{\sum_{j'=1}^{v} e^{u_{j'}}}.\frac{\partial}{\partial u_j}{\sum_{j=1}^{V}e^{u_j}}

 \frac{\partial E}{\partial u_j}= - \sum_{c=1}^{C}1 + \sum_{j=1}^{V}y_j

 \frac{\partial E}{\partial u_j}= y_j - t_j = e_j

 \frac{\partial E}{\partial w'_{ij}} = e_j.\frac{\partial u_j}{\partial w'_{ij}} = e_j.\frac{\partial w'_{ij}*h_i}{\partial w'_{ij}}$

 \frac{\partial E}{\partial w'_{ij}} = e_j.h_i

现在,寻找\frac{\partial E}{\partial w_{ij}}

 \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial u_j} . \frac{\partial u_j}{\partial w_{ij}}$

 \frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial u_j} . \frac{\partial u_j}{\partial h_i} . \frac{\partial h_i}{\partial w_{ij}}$

 \frac{\partial E}{\partial w_{ij}} = e_j . w'_{ij} . \frac{\partial w_{ij}*x_i}{\partial w_{ij}}$

 \frac{\partial E}{\partial w_{ij}} = e_j . w'_{ij} . x_i$

下面是实现:

import numpy as np
import string
from nltk.corpus import stopwords 
   
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
   
class word2vec(object):
    def __init__(self):
        self.N = 10
        self.X_train = []
        self.y_train = []
        self.window_size = 2
        self.alpha = 0.001
        self.words = []
        self.word_index = {}
   
    def initialize(self,V,data):
        self.V = V
        self.W = np.random.uniform(-0.8, 0.8, (self.V, self.N))
        self.W1 = np.random.uniform(-0.8, 0.8, (self.N, self.V))
           
        self.words = data
        for i in range(len(data)):
            self.word_index[data[i]] = i
   
       
    def feed_forward(self,X):
        self.h = np.dot(self.W.T,X).reshape(self.N,1)
        self.u = np.dot(self.W1.T,self.h)
        #print(self.u)
        self.y = softmax(self.u)  
        return self.y
           
    def backpropagate(self,x,t):
        e = self.y - np.asarray(t).reshape(self.V,1)
        # e.shape is V x 1
        dLdW1 = np.dot(self.h,e.T)
        X = np.array(x).reshape(self.V,1)
        dLdW = np.dot(X, np.dot(self.W1,e).T)
        self.W1 = self.W1 - self.alpha*dLdW1
        self.W = self.W - self.alpha*dLdW
           
    def train(self,epochs):
        for x in range(1,epochs):        
            self.loss = 0
            for j in range(len(self.X_train)):
                self.feed_forward(self.X_train[j])
                self.backpropagate(self.X_train[j],self.y_train[j])
                C = 0
                for m in range(self.V):
                    if(self.y_train[j][m]):
                        self.loss += -1*self.u[m][0]
                        C += 1
                self.loss += C*np.log(np.sum(np.exp(self.u)))
            print("epoch ",x, " loss = ",self.loss)
            self.alpha *= 1/( (1+self.alpha*x) )
              
    def predict(self,word,number_of_predictions):
        if word in self.words:
            index = self.word_index[word]
            X = [0 for i in range(self.V)]
            X[index] = 1
            prediction = self.feed_forward(X)
            output = {}
            for i in range(self.V):
                output[prediction[i][0]] = i
               
            top_context_words = []
            for k in sorted(output,reverse=True):
                top_context_words.append(self.words[output[k]])
                if(len(top_context_words)>=number_of_predictions):
                    break
       
            return top_context_words
        else:
            print("Word not found in dicitonary")
def preprocessing(corpus):
    stop_words = set(stopwords.words('english'))    
    training_data = []
    sentences = corpus.split(".")
    for i in range(len(sentences)):
        sentences[i] = sentences[i].strip()
        sentence = sentences[i].split()
        x = [word.strip(string.punctuation) for word in sentence
                                     if word not in stop_words]
        x = [word.lower() for word in x]
        training_data.append(x)
    return training_data
       
   
def prepare_data_for_training(sentences,w2v):
    data = {}
    for sentence in sentences:
        for word in sentence:
            if word not in data:
                data[word] = 1
            else:
                data[word] += 1
    V = len(data)
    data = sorted(list(data.keys()))
    vocab = {}
    for i in range(len(data)):
        vocab[data[i]] = i
       
    #for i in range(len(words)):
    for sentence in sentences:
        for i in range(len(sentence)):
            center_word = [0 for x in range(V)]
            center_word[vocab[sentence[i]]] = 1
            context = [0 for x in range(V)]
              
            for j in range(i-w2v.window_size,i+w2v.window_size):
                if i!=j and j>=0 and j
corpus = ""
corpus += "The earth revolves around the sun. The moon revolves around the earth"
epochs = 1000
  
training_data = preprocessing(corpus)
w2v = word2vec()
  
prepare_data_for_training(training_data,w2v)
w2v.train(epochs) 
  
print(w2v.predict("around",3))    

输出: