python - Scrapy: retain original order of scraped items in the output -
i have following scrapy spider status of pages list of urls in file url.txt
import scrapy scrapy.contrib.spiders import crawlspider pegaslinks.items import statuslinkitem class finderrorsspider(crawlspider): handle_httpstatus_list = [404,400,401,500] name = "finderrors" allowed_domains = ["domain-name.com"] f = open("urls.txt") start_urls = [url.strip() url in f.readlines()] f.close() def parse(self, response): item = statuslinkitem() item['url'] = response.url item['status'] = response.status yield item
here's items.py file:
import scrapy class statuslinkitem(scrapy.item): url = scrapy.field() status = scrapy.field()
i use following command output of items in csv:
scrapy crawl finderrors -o file.csv
the order of items in ouput file different order of corresponding urls in urls.txt file. how can retain original order or add field items.py kind of global variable, represent id of urls, able restore original order later?
you can not rely on order or urls in start_urls
.
you can following thing. override start_requests
method in spider add index
parameter meta
dictionary in created request
objects.
def start_requests(self): index, url in enumerate(self.start_urls): yield request(url, dont_filter=true, meta={'index': index})
later can access meta
in parse
function using response.meta
.
Comments
Post a Comment