Make Scrapy Follow Links In Order

March 05, 2024 Post a Comment

I wrote a script and used Scrapy to find links in the first phase and follow the links and extract something from the page in the second phase. Scrapy DOES it BUT it follows the li

Solution 1:

yield is not synchronous, you should use meta to achieve this. Doc: https://doc.scrapy.org/en/latest/topics/request-response.html
Code:

import scrapy
class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'
        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = 
              response.urljoin(link.xpath(LinkDestSelector).extract_first())
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText})

    def parse_contents(self, response):
        lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        linkText = response.meta['LinkText']
        yield {"LinkContent": sContent,"LinkText": linkText}

Python College

Make Scrapy Follow Links In Order

Solution 1:

Post a Comment for "Make Scrapy Follow Links In Order"