📌  相关文章
📜  如何抓取网站中的所有 PDF 文件?

📅  最后修改于: 2022-05-13 01:55:10.907000             🧑  作者: Mango

如何抓取网站中的所有 PDF 文件?

先决条件:使用 BeautifulSoup 在Python实现 Web Scraping

Web Scraping 是一种从网站中提取数据并将该数据用于其他用途的方法。有几个库和模块可用于在Python进行网络抓取。在本文中,我们将学习如何借助beautifulsoup( Python最好的网页抓取模块之一)和用于 GET 请求的requests模块从网站上抓取 PDF 文件。此外,为了获取有关 PDF 文件的更多信息,我们使用PyPDF2模块。

分步代码 -

步骤 1:导入所有重要的模块和包。

Python3
# for get the pdf files or url
import requests
 
# for tree traversal scraping in webpage
from bs4 import BeautifulSoup
 
# for input and output operations
import io
 
# For getting information about the pdfs
from PyPDF2 import PdfFileReader


Python3
# website to scrap
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
 
# get the url from requests get method
read = requests.get(url)
 
# full html content
html_content = read.content
 
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")


Python3
# created an empty list for putting the pdfs
list_of_pdf = set()
 
# accessed the first p tag in the html
l = soup.find('p')
 
# accessed all the anchors tag from given p tag
p = l.find_all('a')
 
# iterate through p for getting all the href links
for link in p:
     
    # original html links
    print("links: ", link.get('href'))
    print("\n")
     
    # converting the extention from .html to .pdf
    pdf_link = (link.get('href')[:-5]) + ".pdf"
     
    # converted to .pdf
    print("converted pdf links: ", pdf_link)
    print("\n")
     
    # added all the pdf links to set
    list_of_pdf.add(pdf_link)


Python3
def info(pdf_path):
 
    # used get method to get the pdf file
    response = requests.get(pdf_path)
 
    # response.content generate binary code for
    # string function
    with io.BytesIO(response.content) as f:
 
        # initialized the pdf
        pdf = PdfFileReader(f)
 
        # all info about pdf
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
     
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
     
    return information


Python3
# print all the content of pdf in the console
for i in list_of_pdf:
    info(i)


Python3
import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
 
 
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
 
for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)
 
def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
 
for i in list_of_pdf:
    info(i)


第 2 步:传递 URL 并在 BeautifulSoup 的帮助下制作 HTML 解析器。

蟒蛇3

# website to scrap
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
 
# get the url from requests get method
read = requests.get(url)
 
# full html content
html_content = read.content
 
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")

在上面的代码中:

  • 抓取是通过https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/链接完成的
  • requests 模块用于发出 get 请求
  • read.content用于遍历所有 HTML 代码。打印将输出网页的源代码。
  • 具有 HTML 内容并用于解析 HTML

第 3 步:我们需要遍历网站上的 PDF。

蟒蛇3

# created an empty list for putting the pdfs
list_of_pdf = set()
 
# accessed the first p tag in the html
l = soup.find('p')
 
# accessed all the anchors tag from given p tag
p = l.find_all('a')
 
# iterate through p for getting all the href links
for link in p:
     
    # original html links
    print("links: ", link.get('href'))
    print("\n")
     
    # converting the extention from .html to .pdf
    pdf_link = (link.get('href')[:-5]) + ".pdf"
     
    # converted to .pdf
    print("converted pdf links: ", pdf_link)
    print("\n")
     
    # added all the pdf links to set
    list_of_pdf.add(pdf_link)


输出:



在上面的代码中:

  • list_of_pdf是一个空集,用于添加网页中的所有 PDF 文件。使用 Set 是因为它从不重复同名元素。并自动删除重复项。
  • 在所有将 .HTML 转换为 .pdf 的链接中进行迭代。这样做是因为PDF名称和HTML名称在格式上只有一个区别,其余的都是一样的。
  • 我们使用 set 是因为我们需要去掉重复的名字。也可以使用该列表,而不是添加,我们附加所有 PDF。

第 4 步:使用 pypdf2 模块创建info函数以获取 pdf 的所有必需信息。

蟒蛇3

def info(pdf_path):
 
    # used get method to get the pdf file
    response = requests.get(pdf_path)
 
    # response.content generate binary code for
    # string function
    with io.BytesIO(response.content) as f:
 
        # initialized the pdf
        pdf = PdfFileReader(f)
 
        # all info about pdf
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
     
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
     
    return information


在上面的代码中:

  • Info函数负责在 PDF 中提供所有必需的抓取输出。
  • io.BytesIO(response.content) – 使用它是因为response.content是一个二进制代码,并且请求库的级别非常低并且通常被编译(未解释)。所以要处理字节,使用io.BytesIO
  • 有几个 pypdfs2 函数可以访问 pdf 中的不同数据。

注意:请参阅在Python使用 PDF 文件 了解详细信息。

蟒蛇3

# print all the content of pdf in the console
for i in list_of_pdf:
    info(i)

完整代码:

蟒蛇3

import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
 
 
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
 
for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)
 
def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
 
for i in list_of_pdf:
    info(i)

输出: