現在、次のルールがあります。
# Matches all comments page under user overview,
# http://lookbook.nu/user/50784-Adam-G/comments/
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments/?$'), deny=('\?locale=')),
callback='parse_model_comments'),
# http://lookbook.nu/user/50784-Adam-G/comments?page=2
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments\?page=\d+$'), deny=('\?locale=')),
callback='parse_model_comments'),
私の関数定義では、
def parse_model_comments(self, response):
log.msg("Inside parse_model_comments")
hxs = HtmlXPathSelector(response)
model_url = hxs.select('//div[@id="userheader"]/h1/a/@href').extract()[0]
comments_hxs = hxs.select(
'//div[@id="profile_comments"]/div[@id="comments"]/div[@class="comment"]')
if comments_hxs:
log.msg("Yielding next page." + LookbookSpider.next_page(response.url))
yield Request(LookbookSpider.next_page(response.url))
これは実際の実行ログです。
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: None)
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:46-0800 [scrapy] INFO: Inside parse_model_comments
2012-11-26 18:52:46-0800 [scrapy] INFO: Yielding next page.http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Scraped from <200 http://lookbook.nu/user/1363501-Rachael-Jane-H/comments>
{'model_url': u'http://lookbook.nu/rachinald',
'posted_at': u'2012-11-26T13:21:49-05:00',
'target_url': u'http://lookbook.nu/look/4290423-Blackout-Challenge-One',
'text': u"Thanks Justina :) They're actually purple - the whole premise is to not wear black all week ^^",
'type': 2}
...
2012-11-26 18:52:47-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:48-0800 [lookbook] INFO: Closing spider (finished)
2012-11-26 18:52:48-0800 [lookbook] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2072,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 51499,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 11, 27, 2, 52, 48, 43058),
'item_scraped_count': 14,
'log_count/DEBUG': 23,
'log_count/INFO': 6,
'request_depth_max': 3,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2012, 11, 27, 2, 52, 44, 446851)}
2012-11-26 18:52:48-0800 [lookbook] INFO: Spider closed (finished)
?page = 2がクロールされても、「Inside parse_model_comments」がログに記録されなかったため、parse_model_commentsは呼び出されませんでした。
私re.search('/user/\d+[^/]+/comments\?page=\d+$', 'http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2')
はそれが実際に機能することを確認しました。
page = 2がクロールされたが、関数が呼び出されなかった理由はありますか?