📜  从文件中找到k个最常见的单词

📅  最后修改于: 2021-04-26 18:17:46             🧑  作者: Mango

给出一本书的话。假设您有足够的主内存来容纳所有单词。设计数据结构以查找前K个最大出现的单词。数据结构应该是动态的,以便可以添加新单词。

一个简单的解决方案是使用哈希。将所有单词一个一个地散列在散列表中。如果一个单词已经存在,则增加其数量。最后,遍历哈希表并返回最大计数的k个字。

我们可以使用Trie和Min Heap高效地获取k个最常见的单词。想法是使用Trie搜索现有单词,从而有效地添加新单词。 Trie还存储单词出现的次数。大小为k的最小堆用于在任何时间跟踪k个最频繁的单词(最小堆的用法与我们在本文中用来查找k个最大元素的用法相同)。
通过在Trie’indexMinHeap’中存储一个附加字段和在Min Heap中存储一个指针’trNode’,Trie和Min Heap相互链接。对于当前不在“最小堆”中(或当前不在前k个频繁出现的词中)的单词,“ indexMinHeap”的值保持为-1。对于Min Heap中存在的单词,“ indexMinHeap”包含Min Heap中单词的索引。 Min Heap中的指针“ trNode”指向与Trie中的单词相对应的叶节点。

以下是从文件中打印k个最常见单词的完整过程。

逐一阅读所有单词。对于每个单词,将其插入Trie。增加单词的计数器(如果已经存在)。现在,我们还需要在最小堆中插入这个词。对于最小堆插入,出现3种情况:

1.这个词已经存在。我们只增加最小堆中的相应频率值,然后调用TriHe中“ indexMinHeap”字段获得的索引的minHeapify()。当交换最小堆节点时,我们在Trie中更改相应的minHeapIndex。请记住,最小堆的每个节点还具有指向Trie叶节点的指针。

2. minHeap未满。我们将新单词插入“最小堆”中,并更新“最小堆”节点中的根节点和“特里”叶节点中的“最小堆”索引。现在,调用buildMinHeap()。

3.最小堆已满。出现两种情况。
…。 3.1插入的新单词的频率小于最小堆头中存储的单词的频率。没做什么。

…。 3.2插入的新单词的频率大于最小堆头中存储的单词的频率。替换并更新字段。确保在Trie中用-1更新“要替换的单词”的相应最小堆索引,因为该单词不再存在于最小堆中。

4.最后,在给定文件中,Min Heap将具有所有单词中k个最频繁的单词。因此,我们只需要打印Min Heap中存在的所有单词。

// A program to find k most frequent words in a file
#include 
#include 
#include 
  
# define MAX_CHARS 26
# define MAX_WORD_SIZE 30
  
// A Trie node
struct TrieNode
{
    bool isEnd; // indicates end of word
    unsigned frequency;  // the number of occurrences of a word
    int indexMinHeap; // the index of the word in minHeap
    TrieNode* child[MAX_CHARS]; // represents 26 slots each for 'a' to 'z'.
};
  
// A Min Heap node
struct MinHeapNode
{
    TrieNode* root; // indicates the leaf node of TRIE
    unsigned frequency; //  number of occurrences
    char* word; // the actual word stored
};
  
// A Min Heap
struct MinHeap
{
    unsigned capacity; // the total size a min heap
    int count; // indicates the number of slots filled.
    MinHeapNode* array; //  represents the collection of minHeapNodes
};
  
// A utility function to create a new Trie node
TrieNode* newTrieNode()
{
    // Allocate memory for Trie Node
    TrieNode* trieNode = new TrieNode;
  
    // Initialize values for new node
    trieNode->isEnd = 0;
    trieNode->frequency = 0;
    trieNode->indexMinHeap = -1;
    for( int i = 0; i < MAX_CHARS; ++i )
        trieNode->child[i] = NULL;
  
    return trieNode;
}
  
// A utility function to create a Min Heap of given capacity
MinHeap* createMinHeap( int capacity )
{
    MinHeap* minHeap = new MinHeap;
  
    minHeap->capacity = capacity;
    minHeap->count  = 0;
  
    // Allocate memory for array of min heap nodes
    minHeap->array = new MinHeapNode [ minHeap->capacity ];
  
    return minHeap;
}
  
// A utility function to swap two min heap nodes. This function
// is needed in minHeapify
void swapMinHeapNodes ( MinHeapNode* a, MinHeapNode* b )
{
    MinHeapNode temp = *a;
    *a = *b;
    *b = temp;
}
  
// This is the standard minHeapify function. It does one thing extra.
// It updates the minHapIndex in Trie when two nodes are swapped in
// in min heap
void minHeapify( MinHeap* minHeap, int idx )
{
    int left, right, smallest;
  
    left = 2 * idx + 1;
    right = 2 * idx + 2;
    smallest = idx;
    if ( left < minHeap->count &&
         minHeap->array[ left ]. frequency <
         minHeap->array[ smallest ]. frequency
       )
        smallest = left;
  
    if ( right < minHeap->count &&
         minHeap->array[ right ]. frequency <
         minHeap->array[ smallest ]. frequency
       )
        smallest = right;
  
    if( smallest != idx )
    {
        // Update the corresponding index in Trie node.
        minHeap->array[ smallest ]. root->indexMinHeap = idx;
        minHeap->array[ idx ]. root->indexMinHeap = smallest;
  
        // Swap nodes in min heap
        swapMinHeapNodes (&minHeap->array[ smallest ], &minHeap->array[ idx ]);
  
        minHeapify( minHeap, smallest );
    }
}
  
// A standard function to build a heap
void buildMinHeap( MinHeap* minHeap )
{
    int n, i;
    n = minHeap->count - 1;
  
    for( i = ( n - 1 ) / 2; i >= 0; --i )
        minHeapify( minHeap, i );
}
  
// Inserts a word to heap, the function handles the 3 cases explained above
void insertInMinHeap( MinHeap* minHeap, TrieNode** root, const char* word )
{
    // Case 1: the word is already present in minHeap
    if( (*root)->indexMinHeap != -1 )
    {
        ++( minHeap->array[ (*root)->indexMinHeap ]. frequency );
  
        // percolate down
        minHeapify( minHeap, (*root)->indexMinHeap );
    }
  
    // Case 2: Word is not present and heap is not full
    else if( minHeap->count < minHeap->capacity )
    {
        int count = minHeap->count;
        minHeap->array[ count ]. frequency = (*root)->frequency;
        minHeap->array[ count ]. word = new char [strlen( word ) + 1];
        strcpy( minHeap->array[ count ]. word, word );
  
        minHeap->array[ count ]. root = *root;
        (*root)->indexMinHeap = minHeap->count;
  
        ++( minHeap->count );
        buildMinHeap( minHeap );
    }
  
    // Case 3: Word is not present and heap is full. And frequency of word
    // is more than root. The root is the least frequent word in heap,
    // replace root with new word
    else if ( (*root)->frequency > minHeap->array[0]. frequency )
    {
  
        minHeap->array[ 0 ]. root->indexMinHeap = -1;
        minHeap->array[ 0 ]. root = *root;
        minHeap->array[ 0 ]. root->indexMinHeap = 0;
        minHeap->array[ 0 ]. frequency = (*root)->frequency;
  
        // delete previously allocated memoory and
        delete [] minHeap->array[ 0 ]. word;
        minHeap->array[ 0 ]. word = new char [strlen( word ) + 1];
        strcpy( minHeap->array[ 0 ]. word, word );
  
        minHeapify ( minHeap, 0 );
    }
}
  
// Inserts a new word to both Trie and Heap
void insertUtil ( TrieNode** root, MinHeap* minHeap,
                        const char* word, const char* dupWord )
{
    // Base Case
    if ( *root == NULL )
        *root = newTrieNode();
  
    //  There are still more characters in word
    if ( *word != '\0' )
        insertUtil ( &((*root)->child[ tolower( *word ) - 97 ]),
                         minHeap, word + 1, dupWord );
    else // The complete word is processed
    {
        // word is already present, increase the frequency
        if ( (*root)->isEnd )
            ++( (*root)->frequency );
        else
        {
            (*root)->isEnd = 1;
            (*root)->frequency = 1;
        }
  
        // Insert in min heap also
        insertInMinHeap( minHeap, root, dupWord );
    }
}
  
  
// add a word to Trie & min heap.  A wrapper over the insertUtil
void insertTrieAndHeap(const char *word, TrieNode** root, MinHeap* minHeap)
{
    insertUtil( root, minHeap, word, word );
}
  
// A utility function to show results, The min heap
// contains k most frequent words so far, at any time
void displayMinHeap( MinHeap* minHeap )
{
    int i;
  
    // print top K word with frequency
    for( i = 0; i < minHeap->count; ++i )
    {
        printf( "%s : %d\n", minHeap->array[i].word,
                            minHeap->array[i].frequency );
    }
}
  
// The main function that takes a file as input, add words to heap
// and Trie, finally shows result from heap
void printKMostFreq( FILE* fp, int k )
{
    // Create a Min Heap of Size k
    MinHeap* minHeap = createMinHeap( k );
     
    // Create an empty Trie
    TrieNode* root = NULL;
  
    // A buffer to store one word at a time
    char buffer[MAX_WORD_SIZE];
  
    // Read words one by one from file.  Insert the word in Trie and Min Heap
    while( fscanf( fp, "%s", buffer ) != EOF )
        insertTrieAndHeap(buffer, &root, minHeap);
  
    // The Min Heap will have the k most frequent words, so print Min Heap nodes
    displayMinHeap( minHeap );
}
  
// Driver program to test above functions
int main()
{
    int k = 5;
    FILE *fp = fopen ("file.txt", "r");
    if (fp == NULL)
        printf ("File doesn't exist ");
    else
        printKMostFreq (fp, k);
    return 0;
}

输出:

your : 3
well : 3
and : 4
to : 4
Geeks : 6

上面的输出是针对具有以下内容的文件的。

Welcome to the world of Geeks 
This portal has been created to provide well written well thought and well explained 
solutions for selected questions If you like Geeks for Geeks and would like to contribute 
here is your chance You can write article and mail your article to contribute at 
geeksforgeeks org See your article appearing on the Geeks for Geeks main page and help 
thousands of other Geeks