📜  Scrapy-蜘蛛

📅  最后修改于: 2020-10-31 14:32:13             🧑  作者: Mango


描述

Spider是负责定义如何通过网站链接并从页面提取信息的类。

Scrapy的默认蜘蛛如下所示-

爬虫

它是所有其他蜘蛛都必须继承的蜘蛛。它具有以下类别-

class scrapy.spiders.Spider

下表显示了scrapy.Spider类的字段-

Sr.No Field & Description
1

name

It is the name of your spider.

2

allowed_domains

It is a list of domains on which the spider crawls.

3

start_urls

It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.

4

custom_settings

These are the settings, when running the spider, will be overridden from project wide configuration.

5

crawler

It is an attribute that links to Crawler object to which the spider instance is bound.

6

settings

These are the settings for running a spider.

7

logger

It is a Python logger used to send log messages.

8

from_crawler(crawler,*args,**kwargs)

It is a class method, which creates your spider. The parameters are −

  • crawler − A crawler to which the spider instance will be bound.

  • args(list) − These arguments are passed to the method _init_().

  • kwargs(dict) − These keyword arguments are passed to the method _init_().

9

start_requests()

When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.

10

make_requests_from_url(url)

It is a method used to convert urls to requests.

11

parse(response)

This method processes the response and returns scrapped data following more URLs.

12

log(message[,level,component])

It is a method that sends a log message through spiders logger.

13

closed(reason)

This method is called when the spider closes.

蜘蛛的争论

Spider参数用于指定起始URL,并通过带有-a选项的爬网命令传递,如下所示-

scrapy crawl first_scrapy -a group = accessories

以下代码演示了Spider如何接收参数-

import scrapy 

class FirstSpider(scrapy.Spider): 
   name = "first" 
   
   def __init__(self, group = None, *args, **kwargs): 
      super(FirstSpider, self).__init__(*args, **kwargs) 
      self.start_urls = ["http://www.example.com/group/%s" % group]

通用蜘蛛

您可以使用通用蜘蛛来继承您的蜘蛛。他们的目的是根据某些规则遵循网站上的所有链接,以从所有页面提取数据。

对于以下蜘蛛网中使用的示例,让我们假设我们有一个包含以下字段的项目-

import scrapy 
from scrapy.item import Item, Field 
  
class First_scrapyItem(scrapy.Item): 
   product_title = Field() 
   product_link = Field() 
   product_description = Field() 

爬行蜘蛛

CrawlSpider定义了一组规则来遵循链接并剪贴多个页面。它具有以下类别-

class scrapy.spiders.CrawlSpider

以下是CrawlSpider类的属性-

规则

它是定义爬网程序如何链接的规则对象列表。

下表显示了CrawlSpider类的规则-

Sr.No Rule & Description
1

LinkExtractor

It specifies how spider follows the links and extracts the data.

2

callback

It is to be called after each page is scraped.

3

follow

It specifies whether to continue following links or not.

parse_start_url(响应)

通过允许解析初始响应,它返回item或request对象。

注意-编写规则时,请确保重命名解析函数,而不是解析,因为CrawlSpider使用解析函数来实现其逻辑。

让我们看下面的示例,其中spider开始抓取demoexample.com的主页,使用parse_items方法收集所有页面,链接和解析-

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DemoSpider(CrawlSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com"]
      
   rules = ( 
      Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
         callback = "parse_item", follow = True),
   )
   
   def parse_item(self, response):
      item = DemoItem()
      item["product_title"] = response.xpath("a/text()").extract()
      item["product_link"] = response.xpath("a/@href").extract()
      item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
      return items

XMLFeedSpider

它是从XML提要中抓取并在节点上进行迭代的Spider的基类。它具有以下类别-

class scrapy.spiders.XMLFeedSpider

下表显示了用于设置迭代器和标记名称的类属性-

Sr.No Attribute & Description
1

iterator

It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes.

2

itertag

It is a string with node name to iterate.

3

namespaces

It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method.

4

adapt_response(response)

It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it.

5

parse_node(response,selector)

It receives the response and a selector when called for each node matching the provided tag name.

Note − Your spider won’t work if you don’t override this method.

6

process_results(response,results)

It returns a list of results and response returned by the spider.

CSVFeedSpider

它遍历每行,接收一个CSV文件作为响应,并调用parse_row()方法。它具有以下类别-

class scrapy.spiders.CSVFeedSpider

下表显示了可以针对CSV文件设置的选项-

Sr.No Option & Description
1

delimiter

It is a string containing a comma(‘,’) separator for each field.

2

quotechar

It is a string containing quotation mark(‘”‘) for each field.

3

headers

It is a list of statements from where the fields can be extracted.

4

parse_row(response,row)

It receives a response and each row along with a key for header.

CSVFeedSpider示例

from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem  

class DemoSpider(CSVFeedSpider): 
   name = "demo" 
   allowed_domains = ["www.demoexample.com"] 
   start_urls = ["http://www.demoexample.com/feed.csv"] 
   delimiter = ";" 
   quotechar = "'" 
   headers = ["product_title", "product_link", "product_description"]  
   
   def parse_row(self, response, row): 
      self.logger.info("This is row: %r", row)  
      item = DemoItem() 
      item["product_title"] = row["product_title"] 
      item["product_link"] = row["product_link"] 
      item["product_description"] = row["product_description"] 
      return item

蜘蛛地图

Sitemaps的帮助下,SitemapSpider通过从robots.txt查找URL来爬网网站。它具有以下类别-

class scrapy.spiders.SitemapSpider

下表显示了SitemapSpider的字段-

Sr.No Field & Description
1

sitemap_urls

A list of URLs which you want to crawl pointing to the sitemaps.

2

sitemap_rules

It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression.

3

sitemap_follow

It is a list of sitemap’s regexes to follow.

4

sitemap_alternate_links

Specifies alternate links to be followed for a single url.

SitemapSpider示例

以下SitemapSpider处理所有URL-

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"]  
   
   def parse(self, response): 
      # You can scrap items here

以下SitemapSpider处理带有回调的某些URL-

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"] 
   
   rules = [ 
      ("/item/", "parse_item"), 
      ("/group/", "parse_group"), 
   ]  
   
   def parse_item(self, response): 
      # you can scrap item here  
   
   def parse_group(self, response): 
      # you can scrap group here 

以下代码显示了robots.txt中网址为/ sitemap_company的站点地图-

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ] 
   sitemap_follow = ["/sitemap_company"]  
   
   def parse_company(self, response): 
      # you can scrap company here 

您甚至可以将SitemapSpider与其他URL结合使用,如以下命令所示。

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ]  
   
   other_urls = ["http://www.demoexample.com/contact-us"] 
   def start_requests(self): 
      requests = list(super(DemoSpider, self).start_requests()) 
      requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] 
      return requests 

   def parse_company(self, response): 
      # you can scrap company here... 

   def parse_other(self, response): 
      # you can scrap other here...