📜  在Python中从 RSS 中提取提要详细信息

📅  最后修改于: 2022-05-13 01:54:53.052000             🧑  作者: Mango

在Python中从 RSS 中提取提要详细信息

在本文中,我们将看到如何使用 RSS 提要为 Hashnode 博客提取提要和发布详细信息。尽管我们将它用于 Hashnode 上的博客,但它也可以用于其他提要。

RSS 是指丰富的站点摘要,使用标准 Web 格式发布经常变化的信息,如博客文章、新闻、音频、视频等。RSS 文档通常被称为提要,它由文本和元数据(如时间和作者姓名)组成。

安装提要解析器:

我们将使用 Feedparser Python库来解析博客的 RSS 提要。它是一个非常流行的用于解析博客提要的库。

pip install feedparser

让我们逐步理解这一点:

第 1 步:获取 RSS 提要

使用 feedparser.parse()函数创建一个包含已解析博客的提要对象。它采用博客提要的 URL。

Python3
# url of blog feed
feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
  
blog_feed = feedparser.parse(feed_url)


Python3
# returns title of the blog site
blog_feed.feed.title 
  
# returns the link of the blog
# and number of entries(blogs) in the site.
blog_feed.feed.link
len(blog_feed.entries)
  
# Details of individual blog can
# be accessed by using attribute name
print(blog_feed.entries[0].title)
print(blog_feed.entries[0].link)
print(blog_feed.entries[0].author)
print(blog_feed.entries[0].published)
  
# Getting lists of tags and authors.
tags = [tag.term for tag in blog_feed.entries[0].tags]
authors= [author.name for author in blog_feed.entries[0].authors]


Python3
def get_posts_details(rss=None):
    
    """
    Take link of rss feed as argument
    """
    if rss is not None:
        
          # import the library only when url for feed is passed
        import feedparser
          
        # parsing blog feed
        blog_feed = blog_feed = feedparser.parse(rss)
          
        # getting lists of blog entries via .entries
        posts = blog_feed.entries
          
        # dictionary for holding posts details
        posts_details = {"Blog title" : blog_feed.feed.title,
                        "Blog link" : blog_feed.feed.link}
          
        post_list = []
          
        # iterating over individual posts
        for post in posts:
            temp = dict()
              
            # if any post doesn't have information then throw error.
            try:
                temp["title"] = post.title
                temp["link"] = post.link
                temp["author"] = post.author
                temp["time_published"] = post.published
                temp["tags"] = [tag.term for tag in post.tags]
                temp["authors"] = [author.name for author in post.authors]
                temp["summary"] = post.summary
            except:
                pass
              
            post_list.append(temp)
          
        # storing lists of posts in the dictionary
        posts_details["posts"] = post_list 
          
        return posts_details # returning the details which is dictionary
    else:
        return None
  
if __name__ == "__main__":
  import json
  
  feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
  
  data = get_posts_details(rss = feed_url) # return blogs data as a dictionary
    
  if data:
    # printing as a json string with indentation level = 2
    print(json.dumps(data, indent=2)) 
  else:
    print("None")


第 2 步:从博客中获取详细信息。

蟒蛇3

# returns title of the blog site
blog_feed.feed.title 
  
# returns the link of the blog
# and number of entries(blogs) in the site.
blog_feed.feed.link
len(blog_feed.entries)
  
# Details of individual blog can
# be accessed by using attribute name
print(blog_feed.entries[0].title)
print(blog_feed.entries[0].link)
print(blog_feed.entries[0].author)
print(blog_feed.entries[0].published)
  
# Getting lists of tags and authors.
tags = [tag.term for tag in blog_feed.entries[0].tags]
authors= [author.name for author in blog_feed.entries[0].authors]

下面是完整的实现:现在使用上面的代码编写一个函数,该函数获取 RSS 提要的链接并返回详细信息。

蟒蛇3

def get_posts_details(rss=None):
    
    """
    Take link of rss feed as argument
    """
    if rss is not None:
        
          # import the library only when url for feed is passed
        import feedparser
          
        # parsing blog feed
        blog_feed = blog_feed = feedparser.parse(rss)
          
        # getting lists of blog entries via .entries
        posts = blog_feed.entries
          
        # dictionary for holding posts details
        posts_details = {"Blog title" : blog_feed.feed.title,
                        "Blog link" : blog_feed.feed.link}
          
        post_list = []
          
        # iterating over individual posts
        for post in posts:
            temp = dict()
              
            # if any post doesn't have information then throw error.
            try:
                temp["title"] = post.title
                temp["link"] = post.link
                temp["author"] = post.author
                temp["time_published"] = post.published
                temp["tags"] = [tag.term for tag in post.tags]
                temp["authors"] = [author.name for author in post.authors]
                temp["summary"] = post.summary
            except:
                pass
              
            post_list.append(temp)
          
        # storing lists of posts in the dictionary
        posts_details["posts"] = post_list 
          
        return posts_details # returning the details which is dictionary
    else:
        return None
  
if __name__ == "__main__":
  import json
  
  feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
  
  data = get_posts_details(rss = feed_url) # return blogs data as a dictionary
    
  if data:
    # printing as a json string with indentation level = 2
    print(json.dumps(data, indent=2)) 
  else:
    print("None")

输出: