📜  scrapy xpath a rel next (1)

📅  最后修改于: 2023-12-03 15:34:52.242000             🧑  作者: Mango

Scrapy Xpath A Rel Next

Introduction

Scrapy is a fast and powerful open-source web crawling framework that allows developers to easily and quickly scrape data from various sources. One of the powerful features of Scrapy is the ability to extract data using Xpath selectors. In this article, we will show you how to use Scrapy Xpath to navigate to the next page of a list using a rel="next" link.

Requirements

You will need the following tools to follow this tutorial:

  • Python 3 or higher
  • Scrapy 2.5.0 or higher
Step-by-Step Guide
1. Create a Scrapy project

The first step is to create a Scrapy project. You can create a new project using the scrapy startproject command:

scrapy startproject scrapy_xpath_rel_next
2. Create a spider

Next, create a spider using the scrapy genspider command. In this example, we will create a spider named quotes to scrape quotes from the Quotes to Scrape website:

scrapy genspider quotes quotes.toscrape.com
3. Define the spider settings

Open the quotes.py file in your favorite code editor and define the spider settings. You can set the start_urls and allowed_domains attributes in the __init__ method:

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    
    def __init__(self):
        self.start_urls = ['http://quotes.toscrape.com/page/1/']
        self.allowed_domains = ['quotes.toscrape.com']
4. Extract data from the first page

To extract data from the first page, we will define a parse method in the spider. We will use Xpath selectors to extract the quotes and authors from the HTML response:

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    
    # Define start_urls and allowed_domains
    
    def parse(self, response):
        # Extract the quote and author using Xpath selectors
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'quote': quote.xpath('.//span[@class="text"]/text()').get(),
                'author': quote.xpath('.//span/small/text()').get()
            }
5. Extract the URL of the next page

To navigate to the next page of quotes, we need to extract the URL of the rel="next" link. We can do this using the response.css method:

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    
    # Define start_urls and allowed_domains
    
    def parse(self, response):
        # Extract the quote and author using Xpath selectors
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'quote': quote.xpath('.//span[@class="text"]/text()').get(),
                'author': quote.xpath('.//span/small/text()').get()
            }
        
        # Extract the URL of the next page using the rel="next" link
        next_page_url = response.css('li.next a::attr(href)').get()
6. Follow the URL of the next page

Once we have extracted the URL of the next page, we can use the yield scrapy.Request method to follow the link to the next page:

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    
    # Define start_urls and allowed_domains
    
    def parse(self, response):
        # Extract the quote and author using Xpath selectors
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'quote': quote.xpath('.//span[@class="text"]/text()').get(),
                'author': quote.xpath('.//span/small/text()').get()
            }
        
        # Extract the URL of the next page using the rel="next" link
        next_page_url = response.css('li.next a::attr(href)').get()
        
        # Follow the URL of the next page
        if next_page_url:
            yield scrapy.Request(url=next_page_url, callback=self.parse)
7. Run the spider

To run the spider, use the scrapy crawl command followed by the spider name:

scrapy crawl quotes
Conclusion

In this tutorial, we showed you how to use Scrapy Xpath to navigate to the next page of a list using a rel="next" link. By following this tutorial, you should be able to extract data from multiple pages of a list using Scrapy Xpath.