📜  tf idf document alghorithem by c# Code Example(1)

📅  最后修改于: 2023-12-03 15:35:19.426000             🧑  作者: Mango

TF-IDF document algorithm by C# Code Example

Introduction

TF-IDF (Term Frequency-Inverse Document Frequency) is a simple yet powerful algorithm used in information retrieval and text mining. It is used to measure the importance of a term in a document or a corpus. The idea behind TF-IDF is that if a term appears frequently in a document, but rarely in the rest of the corpus, it is likely to be a key term in that document.

How does TF-IDF work?

TF-IDF consists of two parts, TF (Term Frequency) and IDF (Inverse Document Frequency).

Term Frequency (TF)

Term Frequency is a measure of how often a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in the document.

double ComputeTermFrequency(string term, string document)
{
    int count = 0;
    var words = document.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    foreach (var word in words)
    {
        if (word.ToLower().Trim() == term.ToLower().Trim())
        {
            count++;
        }
    }

    return (double)count / words.Length;
}
Inverse Document Frequency (IDF)

Inverse Document Frequency is a measure of how common or rare a term is across all the documents in a corpus. It is calculated by taking the logarithm of the total number of documents in the corpus divided by the number of documents that contain the term.

double ComputeInverseDocumentFrequency(string term, List<string> documents)
{
    int count = 0;
    foreach (var document in documents)
    {
        if (document.ToLower().Contains(term.ToLower()))
        {
            count++;
        }
    }

    return Math.Log((double)documents.Count / count);
}
TF-IDF Calculation

The TF-IDF value of a term is the product of its term frequency and inverse document frequency:

double ComputeTFIDF(string term, string document, List<string> documents)
{
    double tf = ComputeTermFrequency(term, document);
    double idf = ComputeInverseDocumentFrequency(term, documents);

    return tf * idf;
}
Conclusion

TF-IDF is a simple and effective way to measure the importance of a term in a document or a corpus. By combining term frequency and inverse document frequency, it provides a powerful tool for information retrieval and text mining.