📜  如何在Python从网页中提取脚本和 CSS 文件?

📅  最后修改于: 2022-05-13 01:55:26.571000             🧑  作者: Mango

如何在Python从网页中提取脚本和 CSS 文件?

先决条件:

  • 要求
  • 美汤
  • Python的文件处理

在本文中,我们将讨论如何使用Python从网页中提取脚本和 CSS 文件。

为此,我们将下载在编码过程中附加到网站源代码的 CSS 和 JavaScript 文件。首先,确定需要抓取的网站的 URL,并向其发送请求。检索网站的内容后,会创建两个文件类型的两个文件夹并将文件放入其中,然后我们可以根据需要对其进行各种操作。

需要的模块

  • bs4: Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块不是内置在Python的。
  • 请求:请求允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。

示例 1:

在这里,我们正在计算每种类型的获取链接的数量。



Python3
# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
          
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
          
        # if the link tag has the 'href' 
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
print(f"Total {len(js_files)} javascript files found")
print(f"Total {len(cs_files)} CSS files found")


Python3
# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
        
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        
        # if the link tag has the 'href'
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
    for js_file in js_files:
        print(js_file, file=f)
  
with open("css_files.txt", "w") as f:
    for css_file in cs_files:
        print(css_file, file=f)


输出:

我们还可以使用文件处理将获取的链接导入文本文件。

示例 2:

蟒蛇3

# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
        
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        
        # if the link tag has the 'href'
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
    for js_file in js_files:
        print(js_file, file=f)
  
with open("css_files.txt", "w") as f:
    for css_file in cs_files:
        print(css_file, file=f)

输出: