📌  相关文章
📜  在Python中从数据集中找到k个最频繁的单词

📅  最后修改于: 2022-05-13 01:54:32.193000             🧑  作者: Mango

在Python中从数据集中找到k个最频繁的单词

给定数据集,我们可以找到 k 个最频繁的单词。

这个问题的解决方案已经存在于从文件中查找 k 个最频繁的单词。但是我们可以借助一些高性能模块在Python中非常有效地解决这个问题。

为此,我们将使用高性能数据类型模块,即集合。这个模块有一些专门的容器数据类型,我们将使用这个模块中的计数器类。例子 :

Input : "John is the son of John second. 
         Second son of John second is William second."
Output : [('second', 4), ('John', 3), ('son', 2), ('is', 2)]

Explanation :
1. The string will converted into list like this :
    ['John', 'is', 'the', 'son', 'of', 'John', 
     'second', 'Second', 'son', 'of', 'John', 
     'second', 'is', 'William', 'second']
2. Now 'most_common(4)' will return four most 
   frequent words and its count in tuple. 


Input : "geeks for geeks is for geeks. By geeks
         and for the geeks."
Output : [('geeks', 5), ('for', 3)]

Explanation :
most_common(2) will return two most frequent words and their count.

方法 :

Import Counter class from collections module.Split the string into list using split(), it will
return the lists of words. Now pass the list to the instance of Counter classThe function 'most-common()' inside Counter will return
the list of most frequent words from list and its count.

以下是上述方法的Python实现:

# Python program to find the k most frequent words
# from data set
from collections import Counter
  
data_set = "Welcome to the world of Geeks " \
"This portal has been created to provide well written well" \
"thought and well explained solutions for selected questions " \
"If you like Geeks for Geeks and would like to contribute " \
"here is your chance You can write article and mail your article " \
" to contribute at geeksforgeeks org See your article appearing on " \
"the Geeks for Geeks main page and help thousands of other Geeks. " \
  
# split() returns list of all the words in the string
split_it = data_set.split()
  
# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
  
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counter.most_common(4)
  
print(most_occur)

输出 :

[('Geeks', 5), ('to', 4), ('and', 4), ('article', 3)]