python - Scrapy Spider は Item の代わりに None を返します

Question

答えは下にあります。つまり、ItemPipeline のインデントが間違っていたため、None が返されていました。

これまで Python を使用したことがなく、Scrapy で CrawlSpider を作成しようとしていました。Spider はクロールし、コールバック関数を呼び出し、データを抽出してアイテムを埋めますが、常に None を返します。印刷記事の呼び出しでテストしましたが、すべて正常でした。これをyieldとreturnの両方で試しました(違いはまだわかりませんが)。率直に言って、私はアイデアがありません。以下はコールバック関数です。//edit はスパイダーコードも追加しました。

class ZeitSpider(CrawlSpider):
name= xxxx
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/%d/%d' %(JAHR,39)]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[@class="teaserlist"]/li[@class="archiveteaser"]/h4[@class="title"]')),callback='parse_url',follow=True),)


    def parse_url(self,response):
        hxs = HtmlXPathSelector(response)

        article = Article()

        article['url']= response.url.encode('UTF-8',errors='strict')

        article['author']= hxs.select('//div[@id="informatives"]/ul[@class="tools"]/li[@class="author first"]/text()').extract().pop().encode('UTF-8',errors='strict')
        article['title']= hxs.select('//div[@class="articleheader"]/h1/span[@class="title"]/text()').extract().pop().encode('UTF-8',errors='strict')

        article['text']= hxs.select('//div[@id="main"]/p/text()').extract().pop().encode('UTF-8',errors='strict')

        article['excerpt'] = hxs.select('//p[@class="excerpt"]/text()').extract().pop().encode('UTF-8',errors='strict')
        yield article

およびアイテム定義

class Article(Item):
    url=Field()
    author=Field()
    text=Field()
    title=Field()
    excerpt=Field()

score 3 · Accepted Answer

OK、pdbでプログラムをステップ実行した後、エラーが見つかりました：

私は複数のスパイダーを持っているので、複数の ItemPipelines を書きたかったのです。スパイダーごとに区別するために、

if spider.name=='SpiderName'
    return item

インデントに注意してください。Pipeline は Nothing を返したため、出力は None になりました。

インデントを変更した後、スパイダーは問題なく動作しました。PEBCAC の別の例。

python - Scrapy Spider は Item の代わりに None を返します

1 に答える 1

Related

Reference