📌  相关文章
📜  使用Python Web Scraping 读取选定的网页内容

📅  最后修改于: 2021-10-19 06:19:54             🧑  作者: Mango

先决条件:使用Python下载文件,使用 BeautifulSoup 进行网页抓取

我们都知道Python是一种非常简单的编程语言,但让它很酷的是为它编写的大量开源库。 Requests 是使用最广泛的库之一。它允许我们打开任何 HTTP/HTTPS 网站,让我们做任何我们通常在网络上做的事情,还可以保存会话,即 cookie。
众所周知,网页只是一段 HTML 代码,由 Web 服务器发送到我们的浏览器,然后浏览器转换成漂亮的页面。现在我们需要一种机制来获取 HTML 源代码,即使用名为 BeautifulSoup 的包查找某些特定标签。
安装:

pip3 install requests
pip3 install beautifulsoup4

我们以阅读新闻网站《印度斯坦时报》为例

代码可以分为三部分。

  • 请求网页
  • 检查标签
  • 打印适当的内容

脚步:

  1. 请求一个网页:首先我们看到在新闻文本上右击查看源码1
  2. 检查标签:我们需要确定源代码的哪个主体包含我们想要废弃的新闻部分。它是在 ul 下,即无序列表,“searchNews”包含新闻部分。

    2

    注意新闻文本存在于锚标记文本部分。仔细观察给我们的想法是,所有新闻都在无序标签的 li, list, 标签中。

    3

  3. 打印适当的内容:在下面给出的代码的帮助下打印内容。
    import requests
    from bs4 import BeautifulSoup
      
    def news():
        # the target we want to open    
        url='http://www.hindustantimes.com/top-news'
          
        #open with GET method
        resp=requests.get(url)
          
        #http_respone 200 means OK status
        if resp.status_code==200:
            print("Successfully opened the web page")
            print("The news are as follow :-\n")
          
            # we need a parser,Python built-in HTML parser is enough .
            soup=BeautifulSoup(resp.text,'html.parser')    
      
            # l is the list which contains all the text i.e news 
            l=soup.find("ul",{"class":"searchNews"})
          
            #now we want to print only the text part of the anchor.
            #find all the elements of a, i.e anchor
            for i in l.findAll("a"):
                print(i.text)
        else:
            print("Error")
              
    news()
    

    输出

    Successfully opened the web page
    The news are as follow :-
    Govt extends toll tax suspension, use of old notes for utility bills extended till Nov 14
    Modi, Abe seal historic civil nuclear pact: What it means for India
    Rahul queues up at bank, says it is to show solidarity with common man
    IS kills over 60 in Mosul, victims dressed in orange and marked 'traitors'
    Rock On 2 review: Farhan Akhtar, Arjun Rampal's band hasn't lost its magic
    Rumours of shortage in salt supply spark panic among consumers in UP
    Worrying truth: India ranks first in pneumonia, diarrhoea deaths among kids
    To hell with romance, here's why being single is the coolest way to be
    India vs England: Cheteshwar Pujara, Murali Vijay make merry with tons in Rajkot
    Akshay-Bhumi, SRK-Alia, Ajay-Parineeti: Age difference doesn't matter anymore
    Currency ban: Only one-third have bank access; NE, backward regions worst hit
    Nepal's central bank halts transactions with Rs 500, Rs 1000 Indian notes
    Political upheaval in Punjab after SC tells it to share Sutlej water
    Let's not kid ourselves, with Trump, what we have seen is what we will get
    Want to colour your hair? Try rose gold, the hottest hair trend this winter
    

参考

  • 要求
  • 美汤
  • Http_status_codes