📜  如何在Python中从网页下载所有图像?

📅  最后修改于: 2022-05-13 01:55:17.074000             🧑  作者: Mango

如何在Python中从网页下载所有图像?

先决条件:

  • 要求
  • 美汤
  • 操作系统
  • 文件处理

网页抓取是一种从网站获取数据的技术。在网上冲浪时,许多网站不允许用户保存数据供个人使用。一种方法是手动复制粘贴数据,这既乏味又耗时。 Web Scraping 是从网站中自动提取数据的过程。在本文中,我们将讨论如何使用Python从网页下载所有图像。

需要的模块

  • bs4: Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python中。
  • 请求:请求允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python中。
  • os: Python中的OS模块提供了与操作系统交互的功能。 OS,属于 Python 的标准实用程序模块。该模块提供了一种使用操作系统相关功能的可移植方式。

方法

  • 导入模块
  • 获取 HTML 代码
  • 使用 Beautiful Soup 中的findAll方法从 HTML 代码中获取img标签列表。
images = soup.findAll('img')

在 os 中使用mkdir方法创建单独的文件夹用于下载图像。

os.mkdir(folder_name)
  • 遍历所有图像并获取该图像的源 URL。
  • 获取源URL后,最后一步是下载图像
  • 获取图片内容
r = requests.get(Source URL).content
  • 使用文件处理下载图像
# Enter File Name with Extension like jpg, png etc..
with open("File Name","wb+") as f:
      f.write(r)

程序:

Python3
from bs4 import *
import requests
import os
  
# CREATE FOLDER
def folder_create(images):
    try:
        folder_name = input("Enter Folder Name:- ")
        # folder creation
        os.mkdir(folder_name)
  
    # if folder exists with that name, ask another name
    except:
        print("Folder Exist with that name!")
        folder_create()
  
    # image downloading start
    download_images(images, folder_name)
  
  
# DOWNLOAD ALL IMAGES FROM THAT URL
def download_images(images, folder_name):
    
    # intitial count is zero
    count = 0
  
    # print total images found in URL
    print(f"Total {len(images)} Image Found!")
  
    # checking if images is not zero
    if len(images) != 0:
        for i, image in enumerate(images):
            # From image tag ,Fetch image Source URL
  
                        # 1.data-srcset
                        # 2.data-src
                        # 3.data-fallback-src
                        # 4.src
  
            # Here we will use exception handling
  
            # first we will search for "data-srcset" in img tag
            try:
                # In image tag ,searching for "data-srcset"
                image_link = image["data-srcset"]
                  
            # then we will search for "data-src" in img 
            # tag and so on..
            except:
                try:
                    # In image tag ,searching for "data-src"
                    image_link = image["data-src"]
                except:
                    try:
                        # In image tag ,searching for "data-fallback-src"
                        image_link = image["data-fallback-src"]
                    except:
                        try:
                            # In image tag ,searching for "src"
                            image_link = image["src"]
  
                        # if no Source URL found
                        except:
                            pass
  
            # After getting Image Source URL
            # We will try to get the content of image
            try:
                r = requests.get(image_link).content
                try:
  
                    # possibility of decode
                    r = str(r, 'utf-8')
  
                except UnicodeDecodeError:
  
                    # After checking above condition, Image Download start
                    with open(f"{folder_name}/images{i+1}.jpg", "wb+") as f:
                        f.write(r)
  
                    # counting number of image downloaded
                    count += 1
            except:
                pass
  
        # There might be possible, that all
        # images not download
        # if all images download
        if count == len(images):
            print("All Images Downloaded!")
              
        # if all images not download
        else:
            print(f"Total {count} Images Downloaded Out of {len(images)}")
  
# MAIN FUNCTION START
def main(url):
    
    # content of URL
    r = requests.get(url)
  
    # Parse HTML Code
    soup = BeautifulSoup(r.text, 'html.parser')
  
    # find all images in URL
    images = soup.findAll('img')
  
    # Call folder create function
    folder_create(images)
  
  
# take url
url = input("Enter URL:- ")
  
# CALL MAIN FUNCTION
main(url)


输出: