📌  相关文章
📜  提取所有嵌套在其中的 URL<li>使用 BeautifulSoup 的标签

📅  最后修改于: 2022-05-13 01:55:11.902000             🧑  作者: Mango

使用 BeautifulSoup 提取所有嵌套在
  • 标签内的 URL
  • Beautiful Soup 是一个用于提取 html 和 xml 文件的Python库。在本文中,我们将了解如何从嵌套在

  • 标签中的网页中提取所有 URLS。

    需要的模块和安装:

    • BeautifulSoup:我们的主要模块包含一个通过 HTTP 访问网页的方法。
    pip install bs4
    • 请求:用于对网页执行 GET 请求并获取其内容。

    注意:不需要单独安装,bs4会自动下载,如果有问题可以手动下载。

    pip install requests

    方法

    1. 我们将首先导入我们需要的库。
    2. 我们将对所需的网页执行 get 请求,我们希望从中获取所有 URL。
    3. 我们将文本传递给 BeautifulSoup函数并将其转换为一个汤对象。
    4. 使用 for 循环,我们将查找网页中的所有
    5. 标签。
    6. 如果
    7. 标签中有一个锚标签,我们将查找 href 属性并将其参数存储在列表中。这是我们要找的网址。
    8. 打印包含所有 url 的列表。

    让我们看一下代码,我们将看到每个重要步骤发生了什么。

    步骤 1:通过导入所有必需的库并设置网页的 URL 来初始化Python程序,您希望在该网页中包含在锚标记中的所有 URL。

    在下面的示例中,我们将使用另一篇关于使用 BeautifulSoup 实现网页抓取的 geek for geeks 文章,并提取存储在嵌套在

  • 标签内的锚标签中的所有 URL。

    文章链接:https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/



    Python3
    # Importing libraries
    import requests
    from bs4 import BeautifulSoup
      
    # setting up the URL
    URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'


    Python3
    # perform get request to the url
    reqs = requests.get(URL)
      
    # extract all the text that you received 
    # from the GET request  
    content = reqs.text
      
    # convert the text to a beautiful soup object
    soup = BeautifulSoup(content, 'html.parser')


    Python3
    # Empty list to store the output
    urls = []
      
    # For loop that iterates over all the 
  • tags for h in soup.findAll('li'):          # looking for anchor tag inside the
  • tag     a = h.find('a')     try:                    # looking for href inside anchor tag         if 'href' in a.attrs:                            # storing the value of href in a separate              # variable             url = a.get('href')                            # appending the url to the output list             urls.append(url)            # if the list does not has a anchor tag or an anchor      # tag does not has a href params we pass     except:         pass


  • Python3
    # print all the urls stored in the urls list
    for url in urls:
        print(url)


    Python3
    # Importing libraries
    import requests
    from bs4 import BeautifulSoup
      
    # setting up the URL
    URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
      
    # perform get request to the url
    reqs = requests.get(URL)
      
    # extract all the text that you received from
    # the GET request
    content = reqs.text
      
    # convert the text to a beautiful soup object
    soup = BeautifulSoup(content, 'html.parser')
      
    # Empty list to store the output
    urls = []
      
    # For loop that iterates over all the 
  • tags for h in soup.findAll('li'):          # looking for anchor tag inside the
  • tag     a = h.find('a')     try:                  # looking for href inside anchor tag         if 'href' in a.attrs:                          # storing the value of href in a separate variable             url = a.get('href')                            # appending the url to the output list             urls.append(url)                    # if the list does not has a anchor tag or an anchor tag     # does not has a href params we pass     except:         pass    # print all the urls stored in the urls list for url in urls:     print(url)


  • 第 2 步:我们将对所需的 URL 执行 get 请求,并将其中的所有文本传递到 BeautifuLSoup 并将其转换为汤对象。我们将解析器设置为 html.parser。您可以根据要抓取的网页进行设置。

    蟒蛇3

    # perform get request to the url
    reqs = requests.get(URL)
      
    # extract all the text that you received 
    # from the GET request  
    content = reqs.text
      
    # convert the text to a beautiful soup object
    soup = BeautifulSoup(content, 'html.parser')
    

    第 3 步:创建一个空列表来存储您将作为所需输出接收的所有 URL。运行一个 for 循环,遍历网页中的所有

  • 标签。然后对于每个
  • 标签检查它是否有一个锚标签。如果该锚标记具有 href 属性,则将该 href 的参数存储在您创建的列表中。

    蟒蛇3

    # Empty list to store the output
    urls = []
      
    # For loop that iterates over all the 
  • tags for h in soup.findAll('li'):          # looking for anchor tag inside the
  • tag     a = h.find('a')     try:                    # looking for href inside anchor tag         if 'href' in a.attrs:                            # storing the value of href in a separate              # variable             url = a.get('href')                            # appending the url to the output list             urls.append(url)            # if the list does not has a anchor tag or an anchor      # tag does not has a href params we pass     except:         pass
  • 第 4 步:我们通过遍历 url 列表来打印输出。

    蟒蛇3

    # print all the urls stored in the urls list
    for url in urls:
        print(url)
    

    完整代码:

    蟒蛇3

    # Importing libraries
    import requests
    from bs4 import BeautifulSoup
      
    # setting up the URL
    URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
      
    # perform get request to the url
    reqs = requests.get(URL)
      
    # extract all the text that you received from
    # the GET request
    content = reqs.text
      
    # convert the text to a beautiful soup object
    soup = BeautifulSoup(content, 'html.parser')
      
    # Empty list to store the output
    urls = []
      
    # For loop that iterates over all the 
  • tags for h in soup.findAll('li'):          # looking for anchor tag inside the
  • tag     a = h.find('a')     try:                  # looking for href inside anchor tag         if 'href' in a.attrs:                          # storing the value of href in a separate variable             url = a.get('href')                            # appending the url to the output list             urls.append(url)                    # if the list does not has a anchor tag or an anchor tag     # does not has a href params we pass     except:         pass    # print all the urls stored in the urls list for url in urls:     print(url)
  • 输出: