使用 Networxx 模块的超链接诱导主题搜索 (HITS) 算法Python

超链接诱导主题搜索(HITS) 算法是一种对网页进行评分的链接分析算法，由 Jon Kleinberg 开发。该算法用于网络链接结构以发现和排名与特定搜索相关的网页。
HITS 使用集线器和权限来定义网页之间的递归关系。在了解 HITS 算法之前，我们首先需要了解 Hubs 和 Authorities。

给定对搜索引擎的查询，这组高度相关的网页称为Roots 。他们是潜在的权威。
不是很相关但指向 Root 中的页面的页面称为Hubs 。因此，权威是许多中心链接到的页面，而中心是链接到许多权威的页面。

算法 -

-> Let number of iterations be k.
-> Each node is assigned a Hub score = 1 and an Authority score = 1.
-> Repeat k times:

Hub update : Each node’s Hub score = $\Sigma$ (Authority score of each node it points to).
Authority update : Each node’s Authority score = $\Sigma$ (Hub score of each node pointing to it).
Normalize the scores by dividing each Hub score by square root of the sum of the squares of all Hub scores, and dividing each Authority score by square root of the sum of the squares of all Authority scores. (optional)

编程需要懂一点英语

现在，让我们看看如何使用 Networxx 模块来实现这个算法。
让我们考虑下图：

在运行 HITS 算法时 $k = 3$ （没有标准化），

Initially,
Hub Scores:        Authority Scores:
A -> 1             A -> 1
B -> 1             B -> 1
C -> 1             C -> 1
D -> 1             D -> 1
E -> 1             E -> 1
F -> 1             F -> 1
G -> 1             G -> 1
H -> 1             H -> 1

After 1st iteration,
Hub Scores:        Authority Scores:
A -> 1             A -> 3
B -> 2             B -> 2
C -> 1             C -> 4
D -> 2             D -> 2
E -> 4             E -> 1
F -> 1             F -> 1
G -> 2             G -> 0
H -> 1             H -> 1

After 2nd iteration,
Hub Scores:        Authority Scores:
A -> 2             A -> 4
B -> 5             B -> 6
C -> 3             C -> 7
D -> 6             D -> 5
E -> 9             E -> 2
F -> 1             F -> 4
G -> 7             G -> 0
H -> 3             H -> 1

After 3rd iteration,
Hub Scores:        Authority Scores:
A -> 5             A -> 13
B -> 9             B -> 15
C -> 4             C -> 27
D -> 13            D -> 11
E -> 22            E -> 5
F -> 1             F -> 9
G -> 11            G -> 0
H -> 4             H -> 3

Python包 Networkx 具有用于运行 HITS 算法的内置函数。这将参考上图可视化。

Python3

# importing modules
import networkx as nx
import matplotlib.pyplot as plt
 
G = nx.DiGraph()
 
G.add_edges_from([('A', 'D'), ('B', 'C'), ('B', 'E'), ('C', 'A'),
                  ('D', 'C'), ('E', 'D'), ('E', 'B'), ('E', 'F'),
                  ('E', 'C'), ('F', 'C'), ('F', 'H'), ('G', 'A'),
                  ('G', 'C'), ('H', 'A')])
 
plt.figure(figsize =(10, 10))
nx.draw_networkx(G, with_labels = True)
 
hubs, authorities = nx.hits(G, max_iter = 50, normalized = True)
# The in-built hits function returns two dictionaries keyed by nodes
# containing hub scores and authority scores respectively.
 
print("Hub Scores: ", hubs)
print("Authority Scores: ", authorities)

输出：

Hub Scores:  {'A': 0.04642540386472174, 'D': 0.133660375232863,
              'B': 0.15763599440595596, 'C': 0.037389132480584515, 
              'E': 0.2588144594158868, 'F': 0.15763599440595596,
              'H': 0.037389132480584515, 'G': 0.17104950771344754}

Authority Scores:  {'A': 0.10864044085687284, 'D': 0.13489685393050574, 
                    'B': 0.11437974045401585, 'C': 0.3883728005172019,
                    'E': 0.06966521189369385, 'F': 0.11437974045401585,
                    'H': 0.06966521189369385, 'G': 0.0}