📌  相关文章
📜  如何使用 BeautifulSoup 抓取嵌套标签?

📅  最后修改于: 2022-05-13 01:54:37.572000             🧑  作者: Mango

如何使用 BeautifulSoup 抓取嵌套标签?

我们可以借助 . (点)运算符。在创建了一个汤的页面后,如果我们想导航嵌套标签然后借助 。我们能做到。要使用 Beautifulsoup 抓取嵌套标签,请按照以下步骤操作。

循序渐进的方法

步骤1:第一步是抓取我们需要导入beautifulsoup模块并获取我们需要导入requests模块的网站的请求。

from bs4 import BeautifulSoup
import requests

第 2 步:第二步是请求 URL 调用 get 方法。

page=requests.get(sample_website)

第 3 步:第三步是使用 beautifulsoup 方法创建汤,并使用 HTML 解析器创建 HTML 解析树。

BeautifulSoup(page.content, 'html.parser')

第 4 步:第四步是执行。运算符,直到我们想要废弃嵌套标签的标签, 如果我们想在 body 和 table 中删除标签,那么我们将使用下面的语句来删除嵌套的标签。



soup.body.table.tag

实现

下面是描述如何从特定 URL 中抓取不同嵌套标签的各种示例

示例 1:

Python3
from bs4 import BeautifulSoup
import requests
 
# sample website
sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/'
 
# call get method to request the page
page = requests.get(sample_website)
 
# with the help of BeautifulSoup method and
# html parser created soup
soup = BeautifulSoup(page.content, 'html.parser')
 
# With the help of . operator we will scrap a tag
# under body->ui->i
# here we will go a tag inside body then ul then
# i.means under the body tag we will go to ul tag
# and again inside the ul tag we will go i tag
print(soup.body.ul.i)


Python3
from bs4 import BeautifulSoup
import requests
 
# sample website
sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/'
 
# call get method to request the page
page = requests.get(sample_website)
 
# with the help of BeautifulSoup method and html
# parser created soup
soup = BeautifulSoup(page.content, 'html.parser')
 
# With the help of . operator we will scrap a tag
# under body->a
# here we will go a tag inside body then a then
# li.means under the body tag we will go to a tag
print(soup.body.a)


Python3
from bs4 import BeautifulSoup
import requests
 
# sample website
sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/'
 
# call get method to request the page
page = requests.get(sample_website)
 
# with the help of BeautifulSoup method and
# html parser created soup
soup = BeautifulSoup(page.content, 'html.parser')
 
#With the help of . operator we will scrap a
# tag under body->a
# here we will go a tag inside body then a then
# li.means under the body tag we will go to a tag 
print(soup.body.a)
 
# With the help of . operator we will scrap a
# tag under body->ui->li
# here we will go a tag inside body then ul then
# li.means under the body tag we will go to ul tag
# and again inside the ul tag we will go li tag
# and inside to li tag we will go to a tag
print(soup.body.ul.li.a)


输出:

示例 2:

蟒蛇3

from bs4 import BeautifulSoup
import requests
 
# sample website
sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/'
 
# call get method to request the page
page = requests.get(sample_website)
 
# with the help of BeautifulSoup method and html
# parser created soup
soup = BeautifulSoup(page.content, 'html.parser')
 
# With the help of . operator we will scrap a tag
# under body->a
# here we will go a tag inside body then a then
# li.means under the body tag we will go to a tag
print(soup.body.a)

输出:

示例 3:

蟒蛇3

from bs4 import BeautifulSoup
import requests
 
# sample website
sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/'
 
# call get method to request the page
page = requests.get(sample_website)
 
# with the help of BeautifulSoup method and
# html parser created soup
soup = BeautifulSoup(page.content, 'html.parser')
 
#With the help of . operator we will scrap a
# tag under body->a
# here we will go a tag inside body then a then
# li.means under the body tag we will go to a tag 
print(soup.body.a)
 
# With the help of . operator we will scrap a
# tag under body->ui->li
# here we will go a tag inside body then ul then
# li.means under the body tag we will go to ul tag
# and again inside the ul tag we will go li tag
# and inside to li tag we will go to a tag
print(soup.body.ul.li.a)

输出: