requests-html - Python (1)

📌 相关文章

📜 requests-html - Python (1)

📅 最后修改于: 2023-12-03 15:34:42.760000 🧑 作者: Mango

Requests-HTML - Python

Requests-HTML 是一个基于 Requests 和 Pyppeteer 的 Python 包，它简化了从网站获取数据的过程。

特性

对 HTTP 请求和响应进行了封装，易于使用。
支持解析 HTML 页面，随意搜索并提取网页内容。
嵌入了 Pyppeteer，能够在无头浏览器中执行 JavaScript 代码。
完美兼容 Requests，可以完全替换它。

安装

Requests-HTML 可以通过 pip 安装：

pip install requests-html

使用方法

发送请求

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://www.python.org')
print(response.html)

Markdown 代码块：

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://www.python.org')
print(response.html)

解析 HTML 页面

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://www.python.org')
title = response.html.find('title', first=True).text
print(title)

Markdown 代码块：

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://www.python.org')
title = response.html.find('title', first=True).text
print(title)

使用无头浏览器执行 JavaScript 代码

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://dynamicwebscraper.com/blog/')

for link in response.html.links:
    if '/blog/' in link and '/page/' not in link:
        print(link)

response.html.render()
for link in response.html.find('a'):
    if '/blog/' in link.attrs['href'] and '/page/' not in link.attrs['href']:
        print(link.attrs['href'])

Markdown 代码块：

from requests_html import HTMLSession

session = HTMLSession()

response = session.get('https://dynamicwebscraper.com/blog/')

for link in response.html.links:
    if '/blog/' in link and '/page/' not in link:
        print(link)

response.html.render()
for link in response.html.find('a'):
    if '/blog/' in link.attrs['href'] and '/page/' not in link.attrs['href']:
        print(link.attrs['href'])

总结

Requests-HTML 是一个强大的用于网页爬取的 Python 包，它的特性包括发起 HTTP 请求，解析 HTML 页面，封装了 Pyppeteer，并且兼容 Requests。如果你正在寻找一个易于使用的工具来获取网页数据，那么 Requests-HTML 是你的绝佳选择。