📜  Scrapy-请求和响应

📅  最后修改于: 2020-10-31 14:35:59             🧑  作者: Mango


描述

Scrapy可以使用RequestResponse对象对网站进行爬网。请求对象经过系统,使用蜘蛛程序执行请求,并在返回响应对象时返回到请求。

请求对象

该请求对象是一个生成响应的HTTP请求。它具有以下类别-

class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta,
   encoding = 'utf-8', priority = 0, dont_filter = False, errback])

下表显示了Request对象的参数-

Sr.No Parameter & Description
1

url

It is a string that specifies the URL request.

2

callback

It is a callable function which uses the response of the request as first parameter.

3

method

It is a string that specifies the HTTP method request.

4

headers

It is a dictionary with request headers.

5

body

It is a string or unicode that has a request body.

6

cookies

It is a list containing request cookies.

7

meta

It is a dictionary that contains values for metadata of the request.

8

encoding

It is a string containing utf-8 encoding used to encode URL.

9

priority

It is an integer where the scheduler uses priority to define the order to process requests.

10

dont_filter

It is a boolean specifying that the scheduler should not filter the request.

11

errback

It is a callable function to be called when an exception while processing a request is raised.

将其他数据传递给回调函数

当下载响应作为其第一个参数时,将调用请求的回调函数。

例如-

def parse_page1(self, response): 
   return scrapy.Request("http://www.something.com/some_page.html", 
      callback = self.parse_page2)  

def parse_page2(self, response): 
   self.logger.info("%s page visited", response.url) 

如果要将参数传递给可调用函数并在第二个回调中接收这些参数,则可以使用Request.meta属性,如以下示例所示:

def parse_page1(self, response): 
   item = DemoItem() 
   item['foremost_link'] = response.url 
   request = scrapy.Request("http://www.something.com/some_page.html", 
      callback = self.parse_page2) 
   request.meta['item'] = item 
   return request  

def parse_page2(self, response): 
   item = response.meta['item'] 
   item['other_link'] = response.url 
   return item

在请求处理中使用errbacks捕获异常

errback是在处理请求时引发异常时要调用的可调用函数。

以下示例演示了这一点-

import scrapy  

from scrapy.spidermiddlewares.httperror import HttpError 
from twisted.internet.error import DNSLookupError 
from twisted.internet.error import TimeoutError, TCPTimedOutError  

class DemoSpider(scrapy.Spider): 
   name = "demo" 
   start_urls = [ 
      "http://www.httpbin.org/",              # HTTP 200 expected 
      "http://www.httpbin.org/status/404",    # Webpage not found  
      "http://www.httpbin.org/status/500",    # Internal server error 
      "http://www.httpbin.org:12345/",        # timeout expected 
      "http://www.httphttpbinbin.org/",       # DNS error expected 
   ]  
   
   def start_requests(self): 
      for u in self.start_urls: 
         yield scrapy.Request(u, callback = self.parse_httpbin, 
         errback = self.errback_httpbin, 
         dont_filter=True)  
   
   def parse_httpbin(self, response): 
      self.logger.info('Recieved response from {}'.format(response.url)) 
      # ...  
   
   def errback_httpbin(self, failure): 
      # logs failures 
      self.logger.error(repr(failure))  
      
      if failure.check(HttpError): 
         response = failure.value.response 
         self.logger.error("HttpError occurred on %s", response.url)  
      
      elif failure.check(DNSLookupError): 
         request = failure.request 
         self.logger.error("DNSLookupError occurred on %s", request.url) 

      elif failure.check(TimeoutError, TCPTimedOutError): 
         request = failure.request 
         self.logger.error("TimeoutError occurred on %s", request.url) 

Request.meta特殊键

request.meta特殊键是由Scrapy标识的特殊元键的列表。

下表显示了Request.meta的一些键-

Sr.No Key & Description
1

dont_redirect

It is a key when set to true, does not redirect the request based on the status of the response.

2

dont_retry

It is a key when set to true, does not retry the failed requests and will be ignored by the middleware.

3

handle_httpstatus_list

It is a key that defines which response codes per-request basis can be allowed.

4

handle_httpstatus_all

It is a key used to allow any response code for a request by setting it to true.

5

dont_merge_cookies

It is a key used to avoid merging with the existing cookies by setting it to true.

6

cookiejar

It is a key used to keep multiple cookie sessions per spider.

7

dont_cache

It is a key used to avoid caching HTTP requests and response on each policy.

8

redirect_urls

It is a key which contains URLs through which the requests pass.

9

bindaddress

It is the IP of the outgoing IP address that can be used to perform the request.

10

dont_obey_robotstxt

It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled.

11

download_timeout

It is used to set timeout (in secs) per spider for which the downloader will wait before it times out.

12

download_maxsize

It is used to set maximum size (in bytes) per spider, which the downloader will download.

13

proxy

Proxy can be set for Request objects to set HTTP proxy for the use of requests.

请求子类

您可以通过对请求类进行子类化来实现自己的自定义功能。内置的请求子类如下-

FormRequest对象

FormRequest类通过扩展基本请求来处理HTML表单。它具有以下类别-

class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body, 
   cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])

以下是参数-

formdata-这是具有分配给请求主体的HTML表单数据的字典。

–其余参数与请求类相同,在“请求对象”部分中进行了说明。

除请求方法外, FormRequest对象还支持以下类方法-

classmethod from_response(response[, formname = None, formnumber = 0, formdata = None, 
   formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])

下表显示了上述类的参数-

Sr.No Parameter & Description
1

response

It is an object used to pre-populate the form fields using HTML form of response.

2

formname

It is a string where the form having name attribute will be used, if specified.

3

formnumber

It is an integer of forms to be used when there are multiple forms in the response.

4

formdata

It is a dictionary of fields in the form data used to override.

5

formxpath

It is a string when specified, the form matching the xpath is used.

6

formcss

It is a string when specified, the form matching the css selector is used.

7

clickdata

It is a dictionary of attributes used to observe the clicked control.

8

dont_click

The data from the form will be submitted without clicking any element, when set to true.

例子

以下是一些请求用法示例-

使用FormRequest通过HTTP POST发送数据

以下代码演示了当您想在蜘蛛中复制HTML表单POST时如何返回FormRequest对象-

return [FormRequest(url = "http://www.something.com/post/action", 
   formdata = {'firstname': 'John', 'lastname': 'dave'}, 
   callback = self.after_post)]

使用FormRequest.from_response()模拟用户登录

通常,网站使用元素来提供预填充的表单字段。

当您希望这些字段在抓取时自动填充时,可以使用FormRequest.form_response()方法。

以下示例对此进行了演示。

import scrapy  
class DemoSpider(scrapy.Spider): 
   name = 'demo' 
   start_urls = ['http://www.something.com/users/login.php']  
   def parse(self, response): 
      return scrapy.FormRequest.from_response( 
         response, 
         formdata = {'username': 'admin', 'password': 'confidential'}, 
         callback = self.after_login 
      )  
   
   def after_login(self, response): 
      if "authentication failed" in response.body: 
         self.logger.error("Login failed") 
         return  
      # You can continue scraping here

响应对象

它是一个指示HTTP响应的对象,该响应被馈送到蜘蛛进行处理。它具有以下类别-

class scrapy.http.Response(url[, status = 200, headers, body, flags])

下表显示了Response对象的参数-

Sr.No Parameter & Description
1

url

It is a string that specifies the URL response.

2

status

It is an integer that contains HTTP status response.

3

headers

It is a dictionary containing response headers.

4

body

It is a string with response body.

5

flags

It is a list containing flags of response.

响应子类

您可以通过对响应类进行子类化来实现自己的自定义功能。内置的响应子类如下-

TextResponse对象

TextResponse对象用于二进制数据,例如图像,声音等,具有对基本Response类进行编码的能力。它具有以下类别-

class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])

以下是参数-

encoding-这是一个具有编码的字符串,用于编码响应。

–其余参数与响应类相同,将在“响应对象”部分中进行说明。

下表显示了除响应方法之外TextResponse对象支持的属性-

Sr.No Attribute & Description
1

text

It is a response body, where response.text can be accessed multiple times.

2

encoding

It is a string containing encoding for response.

3

selector

It is an attribute instantiated on first access and uses response as target.

下表显示了除响应方法外TextResponse对象支持的方法-

Sr.No Method & Description
1

xpath (query)

It is a shortcut to TextResponse.selector.xpath(query).

2

css (query)

It is a shortcut to TextResponse.selector.css(query).

3

body_as_unicode()

It is a response body available as a method, where response.text can be accessed multiple times.

HtmlResponse对象

通过查看HTML的meta httpequiv属性,该对象支持编码和自动发现。其参数与响应类相同,并在“响应对象”部分中进行了说明。它具有以下类别-

class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])

XmlResponse对象

通过查看XML行,该对象支持编码和自动发现。其参数与响应类相同,并在“响应对象”部分中进行了说明。它具有以下类别-

class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])