python - Scrapy: retain original order of scraped items in the output -
i have following scrapy spider status of pages list of urls in file url.txt
import scrapy scrapy.contrib.spiders import crawlspider pegaslinks.items import statuslinkitem class finderrorsspider(crawlspider): handle_httpstatus_list = [404,400,401,500] name = "finderrors" allowed_domains = ["domain-name.com"] f = open("urls.txt") start_urls = [url.strip() url in f.readlines()] f.close() def parse(self, response): item = statuslinkitem() item['url'] = response.url item['status'] = response.status yield item here's items.py file:
import scrapy class statuslinkitem(scrapy.item): url = scrapy.field() status = scrapy.field() i use following command output of items in csv:
scrapy crawl finderrors -o file.csv the order of items in ouput file different order of corresponding urls in urls.txt file. how can retain original order or add field items.py kind of global variable, represent id of urls, able restore original order later?
you can not rely on order or urls in start_urls.
you can following thing. override start_requests method in spider add index parameter meta dictionary in created request objects.
def start_requests(self): index, url in enumerate(self.start_urls): yield request(url, dont_filter=true, meta={'index': index}) later can access meta in parse function using response.meta.
Comments
Post a Comment