Separate Output File For Every Url Given In Start_urls List Of Spider In Scrapy

February 28, 2024 Post a Comment

I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise. Following is the start_urls of my spider

Solution 1:

I'd implement a more explicit approach (not tested):

configure list of possible categories in settings.py:

CATEGORIES = ['Arts', 'Business', 'Computers']

define your start_urls based on the setting

start_urls = ['http://www.dmoz.org/%s' % category for category in settings.CATEGORIES]

add categoryField to the Item class

in the spider's parse method set the category field according to the current response.url, e.g.:

defparse(self, response):
     ...
     item['category'] = next(category for category in settings.CATEGORIES if category in response.url)
     ...

in the pipeline open up exporters for all categories and choose which exporter to use based on the item['category']:

defspider_opened(self, spider):
    ...
    self.exporters = {}
    for category in settings.CATEGORIES:
        file = open('output/%s.xml' % category, 'w+b')
        exporter = XmlItemExporter(file)
        exporter.start_exporting()
        self.exporters[category] = exporter

defspider_closed(self, spider):
    for exporter inself.exporters.itervalues(): 
        exporter.finish_exporting()

defprocess_item(self, item, spider):
    self.exporters[item['category']].export_item(item)
    return item

You would probably need to tweak it a bit to make it work but I hope you got the idea - store the category inside the item being processed. Choose a file to export to based on the item category value.

Hope that helps.

Solution 2:

As long as you don't store it in the item itself, you can't really know the staring url. The following solution should work for you:

redefine the make_request_from_url to send the starting url with each Request you make. You can store it in meta attribute of your Request. Bypass this starting url with each following Request.
as soon as you decide to pass the element to pipeline, fill in the starting url for the item from response.meta['start_url']

Hope it helps. Following links may be helpful:

http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.make_requests_from_url

http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#passing-additional-data-to-callback-functions

Solution 3:

Here's how I did it for my project without setting the category in item:

Pass the argument from the command line like this:

scrapy crawl reviews_spider -a brand_name=apple

Receive the argument and set to spider args in the my_spider.py

def__init__(self, brand_name, *args, **kwargs):
    self.brand_name = brand_name
    super(ReviewsSpider, self).__init__(*args, **kwargs)

    # i am reading start_urls from an external file depending on the passed argumentwithopen('make_urls.json') as f:
        self.start_urls = json.loads(f.read())[self.brand_name]

In pipelines.py:

classReviewSummaryItemPipeline(object):
    @classmethoddeffrom_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    defspider_opened(self, spider):
        # change the output file name based on argument
        self.file = open(f'reviews_summary_{spider.brand_name}.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    defspider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    defprocess_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Python College

Separate Output File For Every Url Given In Start_urls List Of Spider In Scrapy

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Separate Output File For Every Url Given In Start_urls List Of Spider In Scrapy"