1

特定のサイトをクロールしようとしているときに、奇妙な問題に直面しています。ベーススパイダーを使用して一部のページをクロールする場合、コードは完全に機能しますが、クロールスパイダーを使用するようにコードを変更すると、スパイダーはエラーなしで完了しますが、スパイダーは発生しません

次のコードは正常に機能します

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.loader import XPathItemLoader
    from dirbot.items import Website
    from urlparse import urlparse
    from scrapy import log


class hushBabiesSpider(BaseSpider):
   name = "hushbabies"
   #download_delay = 10
   allowed_domains = ["hushbabies.com"]
   start_urls = [
       "http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
       "http://www.hushbabies.com/category/mommy-newborn.html",
       "http://www.hushbabies.com"

   ]
   def parse(self, response):
       print response.body
       print "Inside parse Item"
       return []

次のコードは機能しません

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log

class hushBabiesSpider(CrawlSpider):
   name = "hushbabies"
   #download_delay = 10
   allowed_domains = ["hushbabies.com"]
   start_urls = [
       "http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
       "http://www.hushbabies.com/category/mommy-newborn.html",
       "http://www.hushbabies.com"

   ]
   rules = (
        Rule(SgmlLinkExtractor(allow=()),
            'parseItem',
            follow=True,
        ),
    )
  def parseItem(self, response):
       print response.body
       print "Inside parse Item"
       return []

Scrapy実行からの出力は次のとおりです

scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
        {'downloader/request_bytes': 634,
         'downloader/request_count': 3,
         'downloader/request_method_count/GET': 3,
         'downloader/response_bytes': 44395,
         'downloader/response_count': 3,
         'downloader/response_status_count/200': 3,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
         'scheduler/memory_enqueued': 2,
         'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
        {'memusage/max': 27820032, 'memusage/startup': 27652096}

サイトをhushbabies.comから他のサイトに変更すると、コードが機能するようになります。

4

1 に答える 1

1

SgmlLinkExtractorの基になるSGMLパーサーsgmllibに問題があるようです。

次のコードはゼロリンクを返します。

>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> fetch('http://www.hushbabies.com/')
>>> len(SgmlLinkExtractor().extract_links(response))
0

Scraplyに依存するSlybotの代替リンクエクストラクタを試すことができます。

>>> from slybot.linkextractor import LinkExtractor
>>> from scrapely.htmlpage import HtmlPage
>>> p = HtmlPage(body=response.body_as_unicode())
>>> sum(1 for _ in LinkExtractor().links_to_follow(p))
314
于 2012-07-23T21:20:09.573 に答える