Separate Output File For Every Url Given In Start_urls List Of Spider In Scrapy
Solution 1:
I'd implement a more explicit approach (not tested):
configure list of possible categories in
settings.py:CATEGORIES = ['Arts', 'Business', 'Computers']define your
start_urlsbased on the settingstart_urls = ['http://www.dmoz.org/%s' % category for category in settings.CATEGORIES]add
categoryFieldto theItemclassin the spider's parse method set the
categoryfield according to the currentresponse.url, e.g.:defparse(self, response): ... item['category'] = next(category for category in settings.CATEGORIES if category in response.url) ...in the pipeline open up exporters for all categories and choose which exporter to use based on the
item['category']:defspider_opened(self, spider): ... self.exporters = {} for category in settings.CATEGORIES: file = open('output/%s.xml' % category, 'w+b') exporter = XmlItemExporter(file) exporter.start_exporting() self.exporters[category] = exporter defspider_closed(self, spider): for exporter inself.exporters.itervalues(): exporter.finish_exporting() defprocess_item(self, item, spider): self.exporters[item['category']].export_item(item) return item
You would probably need to tweak it a bit to make it work but I hope you got the idea - store the category inside the item being processed. Choose a file to export to based on the item category value.
Hope that helps.
Solution 2:
As long as you don't store it in the item itself, you can't really know the staring url. The following solution should work for you:
redefine the
make_request_from_urlto send the starting url with eachRequestyou make. You can store it inmetaattribute of yourRequest. Bypass this starting url with each followingRequest.as soon as you decide to pass the element to pipeline, fill in the starting url for the item from
response.meta['start_url']
Hope it helps. Following links may be helpful:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.make_requests_from_url
Solution 3:
Here's how I did it for my project without setting the category in item:
Pass the argument from the command line like this:
scrapy crawl reviews_spider -a brand_name=apple
Receive the argument and set to spider args in the my_spider.py
def__init__(self, brand_name, *args, **kwargs):
self.brand_name = brand_name
super(ReviewsSpider, self).__init__(*args, **kwargs)
# i am reading start_urls from an external file depending on the passed argumentwithopen('make_urls.json') as f:
self.start_urls = json.loads(f.read())[self.brand_name]
In pipelines.py:
classReviewSummaryItemPipeline(object):
@classmethoddeffrom_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
defspider_opened(self, spider):
# change the output file name based on argument
self.file = open(f'reviews_summary_{spider.brand_name}.csv', 'w+b')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
defspider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
defprocess_item(self, item, spider):
self.exporter.export_item(item)
return item
Post a Comment for "Separate Output File For Every Url Given In Start_urls List Of Spider In Scrapy"