6

I am using scrapy to scrap a site

I had written a spider and fetched all the items from the page and saved to a csv file, and now i want to save the total execution time taken by scrapy to run the spider file, actually after spider execution is completed and when we have at at terminal it will display some results like starttime, endtime and so on .... so now in my program i need to calculate the total time taken by scrapy to run the spider and storing the total time some where....

Can anyone let me now how to do this through an example........

Thanks in advance...........

4

3 に答える 3

6

これは便利かもしれません:

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.stats import stats
from datetime import datetime

def handle_spider_closed(spider, reason):
    print 'Spider closed:', spider.name, stats.get_stats(spider)
    print 'Work time:', datetime.now() - stats.get_stats(spider)['start_time']


dispatcher.connect(handle_spider_closed, signals.spider_closed)
于 2012-06-28T13:43:00.773 に答える
1

私はかなり初心者ですが、少し簡単な方法でそれを行いました。それが理にかなっていることを願っています。

import datetime

次に、2つのグローバル変数、つまり self.starting_timeself.ending_timeを宣言します。

スパイダークラスのコンストラクター内で、開始時刻を次のように設定します

def __init__(self, name=None, **kwargs):
        self.start_time = datetime.datetime.now()

その後、closedメソッドを使用して、終了と開始の違いを見つけます。すなわち

def closed(self, response):
   self.ending_time = datetime.datetime.now()
   duration = self.ending_time - self.starting_time
   print(duration)

それはほとんどそれです。スパイダーがプロセスを終了した直後に、closedメソッドが呼び出されます。こちらのドキュメントを参照してください

于 2019-06-01T17:53:05.013 に答える
0

私がこれまでに見つけた最も簡単な方法:

import scrapy

class StackoverflowSpider(scrapy.Spider):
    name = "stackoverflow"

    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self, response):
        for title in response.css(".summary .question-hyperlink::text").getall():
            yield {"Title":title}

    def close(self, reason):
        start_time = self.crawler.stats.get_value('start_time')
        finish_time = self.crawler.stats.get_value('finish_time')
        print("Total run time: ", finish_time-start_time)
于 2020-12-25T20:04:56.487 に答える