0

私は2つの主な問題を抱えています

1)ページをクロールした後にparse_itemメソッドが呼び出されない/実行されない2)「callback ='self.parse_item'」がルールに含まれている場合、scrapyはリンクをたどり続けません。代わりに、StartUrlsからすぐに利用できるリンクのみをたどります。

これがコードです

from scrapy.spider import BaseSpider
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from sheprime.items import SheprimeItem

class HerroomSpider(CrawlSpider):
    name = "herroom"
    allowed_domains = ["herroom.com"]
    start_urls = [
                  "http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
 "http://www.herroom.com/hosiery.aspx",


rules = [
            Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='self.parse_item')


   ]

def parse_item(self, response):
    print "some message"  

#I have put in this simple parse function, because I just want to get it to work

ご協力いただきありがとうございます、

L

4

1 に答える 1

0

あなたのコード:

Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='self.parse_item')

そのはず:

Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='parse_item')

これは私のために働きます:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class HerroomSpider(CrawlSpider):
    name = "herroom"
    allowed_domains = ["herroom.com"]
    start_urls = [
        "http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml",
        "http://www.herroom.com/hosiery.aspx"
    ]


    rules = [
        Rule(SgmlLinkExtractor(allow=(r'/[A-Za-z0-9\-]+\.shtml', )), callback='parse_item')
    ]

    def parse_item(self, response):
        print "some message"

結果:

vic@wic:~/projects/test$ scrapy crawl herroom
2012-07-09 08:08:51+0400 [scrapy] INFO: Scrapy 0.15.1 started (bot: domains_scraper)
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled extensions: LogStats, CloseSpider, CoreStats, SpiderState
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-09 08:08:51+0400 [scrapy] DEBUG: Enabled item pipelines: Pipeline
2012-07-09 08:08:51+0400 [herroom] INFO: Spider opened
2012-07-09 08:08:51+0400 [herroom] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-09 08:08:52+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml> (referer: None)
2012-07-09 08:08:54+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/hosiery.aspx> (referer: None)
2012-07-09 08:08:55+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:56+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p300-trocadero-strapless-bra.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
2012-07-09 08:08:57+0400 [herroom] DEBUG: Crawled (200) <GET http://www.herroom.com/simone-perele-12p342-trocadero-push-up-bra-with-racerback.shtml> (referer: http://www.herroom.com/simone-perele-12p314-trocadero-sheer-seamless-racerback-bra.shtml)
some message
于 2012-07-08T19:14:43.907 に答える