python - Scrapy: retain original order of scraped items in the output -

i have following scrapy spider status of pages list of urls in file url.txt

import scrapy scrapy.contrib.spiders import crawlspider pegaslinks.items import statuslinkitem  class finderrorsspider(crawlspider):     handle_httpstatus_list = [404,400,401,500]     name = "finderrors"      allowed_domains = ["domain-name.com"]     f = open("urls.txt")     start_urls = [url.strip() url in f.readlines()]     f.close()      def parse(self, response):         item = statuslinkitem()         item['url'] = response.url         item['status'] = response.status         yield item

here's items.py file:

import scrapy  class statuslinkitem(scrapy.item):     url = scrapy.field()     status = scrapy.field()

i use following command output of items in csv:

scrapy crawl finderrors -o file.csv

the order of items in ouput file different order of corresponding urls in urls.txt file. how can retain original order or add field items.py kind of global variable, represent id of urls, able restore original order later?

you can not rely on order or urls in start_urls.

you can following thing. override start_requests method in spider add index parameter meta dictionary in created request objects.

def start_requests(self):     index, url in enumerate(self.start_urls):         yield  request(url, dont_filter=true, meta={'index': index})

later can access meta in parse function using response.meta.

Search This Blog

Braziel

python - Scrapy: retain original order of scraped items in the output -

Comments

Post a Comment

Popular posts from this blog

javascript - Add class to another page attribute using URL id - Jquery -

android - MPAndroidChart - How to add Annotations or images to the chart -

IF statement in MySQL trigger -