特定のサイトをクロールしようとしているときに、奇妙な問題に直面しています。ベーススパイダーを使用して一部のページをクロールする場合、コードは完全に機能しますが、クロールスパイダーを使用するようにコードを変更すると、スパイダーはエラーなしで完了しますが、スパイダーは発生しません
次のコードは正常に機能します
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(BaseSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
def parse(self, response):
print response.body
print "Inside parse Item"
return []
次のコードは機能しません
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(CrawlSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=()),
'parseItem',
follow=True,
),
)
def parseItem(self, response):
print response.body
print "Inside parse Item"
return []
Scrapy実行からの出力は次のとおりです
scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
{'downloader/request_bytes': 634,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 44395,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27820032, 'memusage/startup': 27652096}
サイトをhushbabies.comから他のサイトに変更すると、コードが機能するようになります。