📜  TF – 用于 Bigrams 和 Trigrams 的 IDF(1)

📅  最后修改于: 2023-12-03 14:47:57.893000             🧑  作者: Mango

TF-IDF for Bigrams and Trigrams

TF-IDF (Term frequency-inverse document frequency) is a method used to quantify the importance of a term in a document or corpus. It is commonly used in Natural Language Processing to rank the importance of words in a text.

TF-IDF takes into account two factors:

  • Term Frequency (TF): The number of times a term appears in a document.
  • Inverse Document Frequency (IDF): The inverse of the number of documents that contain the term.

Commonly, TF-IDF is calculated for individual words. However, when dealing with NLP, the context in which words appear is also important. Therefore, it might be useful to calculate TF-IDF for Bigrams and Trigrams as well.

Calculating TF-IDF for Bigrams and Trigrams

To calculate TF-IDF for Bigrams and Trigrams, we need to define the following:

Document Frequency (DF) Calculation for Bigrams and Trigrams

Document frequency (DF) is the number of documents that contain a certain term. When calculating the DF for Bigrams and Trigrams, we need to consider the following:

  • For Bigrams, a combination of two consecutive words is considered a term, regardless of whether they appear in the same document or not.
  • For Trigrams, a combination of three consecutive words is considered a term, regardless of whether they appear in the same document or not.

We can calculate the DF for Bigrams and Trigrams using the following code:

from collections import defaultdict

# Define documents
documents = ['This is a test document', 'This document is another test', 'And this is yet another test document']

# Define function to return list of n-grams
def get_ngrams(text, n):
    words = text.lower().split()
    return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

# Calculate DF for Bigrams and Trigrams
df = defaultdict(int)
for document in documents:
    for n in range(2,4):
        for term in set(get_ngrams(document, n)):
            df[term] += 1

print(df)
Term Frequency (TF) Calculation for Bigrams and Trigrams

Term frequency (TF) is the number of times a term appears in a document. When calculating the TF for Bigrams and Trigrams, we need to consider the following:

  • For Bigrams, we need to count the number of times the two consecutive words appear in the document.
  • For Trigrams, we need to count the number of times the three consecutive words appear in the document.

We can calculate the TF for Bigrams and Trigrams using the following code:

# Define function to return dictionary of TF for n-grams
def get_tf(text, n):
    words = text.lower().split()
    ngrams = get_ngrams(text, n)
    tf = defaultdict(int)
    for ngram in ngrams:
        tf[ngram] += 1
    return tf

# Calculate TF for Bigrams and Trigrams
tf = []
for document in documents:
    tf_doc = {}
    for n in range(2,4):
        tf_doc.update(get_tf(document, n))
    tf.append(tf_doc)

print(tf)
Inverse Document Frequency (IDF) Calculation for Bigrams and Trigrams

Inverse document frequency (IDF) is the inverse of the number of documents that contain a term. When calculating IDF for Bigrams and Trigrams, we need to consider the following:

  • For Bigrams, a combination of two consecutive words is considered a term.
  • For Trigrams, a combination of three consecutive words is considered a term.

We can calculate the IDF for Bigrams and Trigrams using the following code:

import math

# Calculate IDF for Bigrams and Trigrams
idf = {}
for n in range(2,4):
    for term in df.keys():
        if len(term.split()) == n:
            idf[term] = math.log(len(documents) / df[term])

print(idf)
Calculating TF-IDF for Bigrams and Trigrams

Now that we have calculated the TF, IDF and DF for Bigrams and Trigrams, we can calculate the TF-IDF for each term in the documents using the following code:

# Calculate TF-IDF for Bigrams and Trigrams
tf_idf = []
for t in tf:
    tf_idf_doc = {}
    for term in t.keys():
        tf_idf_doc[term] = t[term] * idf[term]
    tf_idf.append(tf_idf_doc)

print(tf_idf)
Conclusion

In this article, we have discussed how to calculate TF-IDF for Bigrams and Trigrams. We have seen how to calculate DF, TF and IDF for Bigrams and Trigrams, and finally how to combine them to calculate TF-IDF.

Happy Coding!