python - Scrapy (webcrawler) で複雑なアイテムを返す

Question

結果のオブジェクトを返すスクレイピーを使用して、特に目的のウェブクローラーを作成しようとしています。私は立ち往生しており、おそらく完全に逆方向に進んでいます。

より具体的には、TheScienceForum.com の各サブフォーラム(数学、物理学など) について、各サブフォーラム内のすべてのスレッドのタイトルを取得し、最終的にフォーラムの名前とフォーラム内のスレッドのすべてのタイトルのリスト。

最終的な目標は、スレッドタイトルのテキスト分析を行い、各フォーラムに関連する最も一般的な用語/専門用語を特定することです。最終的には、スレッド自体の分析も行いたいと考えています。

次のように定義された 1 つのクラス Item があります。

from scrapy.item import Item, Field

class ProjectItem(Item):
    name = Field() #the forum name
    titles = Field() #the titles

アイテムの仕組みを誤解しているかもしれませんが、サブフォーラムごとに 1 つのアイテムを作成し、そのサブフォーラムのすべてのスレッドタイトルを同じアイテムのリストにまとめたいと考えています。

私が作成したクローラーは次のように見えますが、期待どおりに機能しません。

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

    from individualProject.items import ProjectItem

    class TheScienceForum(CrawlSpider):
        name = "TheScienceForum.com"
        allowed_domains = ["theScienceForum.com"]
        start_urls = ["http://www.thescienceforum.com"]
        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=['//h2[@class="forumtitle"]/a']), 'parse_one'),Rule(SgmlLinkExtractor(restrict_xpaths=['//div[@class="threadpagenav"]']), 'parse_two')]

        def parse_one(self, response):
            Sel = HtmlXPathSelector(response)
            forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()

            items = []
        for forumName in forumNames:
            item = projectItem()
            item['name'] = forumName
            items.append(item)
        yield items

        def parse_two(self, response):
            Sel = HtmlXPathSelector(response)
            threadNames = Sel.select('////h3[@class="threadtitle"]/a/text()').extract()
            for item in items:
                for title in titles:
                    if Sel.select('//h1/span[@class="forumtitle"]/text()').extract()==item.name:
                        item['titles'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
            return items

アイデアは、すべてのサブフォーラム名があるサイトのメインページから開始することです。最初のルールは、最初のサブフォーラムページへのリンクとそれに関連付けられた解析機能のみを許可します。これは、サブフォーラムごとにアイテムを作成し、'name' 属性にフォーラム名をサブビングすることを意味します。

次のリクエストでは、2 番目のルールを使用して、スパイダーはサブフォーラムのすべてのスレッド (ページ分割されたリンク) を含むページの移動に制限されます。2 番目の解析メソッドは、現在のサブフォーラムの名前 (Sel.select('//h1/span[@class="forumtitle"]/ text()').extract())

スパイダーはすべてのメインフォーラムページをクロールしていますが、各ページで次のエラーが発生しています。

2013-11-01 13:05:37-0400 [TheScienceForum.com] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.thescienceforum.com/mathematics/>

ヘルプやアドバイスをいただければ幸いです。ありがとう！

score 2 · Accepted Answer

発生していたクロールの問題の解決策を見つけました。次のコードは、フォーラムホームページでスパイダーを開始し、サブフォーラムごとに新しいアイテムを作成します。次に、スパイダーはリンクをたどり、サブフォーラムの各ページに移動し、途中でスレッドタイトルを収集します (関連するアイテムにそれらを追加し、そのすべてが次のリクエストと共に渡されます)。コードは次のとおりです。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

from individualProject.items import ProjectItem

class TheScienceForum(BaseSpider):
    name = "TheScienceForum.com"
    allowed_domains = ["www.thescienceforum.com"]
    start_urls = ["http://www.thescienceforum.com"]
    #rules = [Rule(SgmlLinkExtractor(restrict_xpaths=['//h2[@class="forumtitle"]/a']), 'parse_one'),Rule(SgmlLinkExtractor(restrict_xpaths=['//div[@class="threadpagenav"]']), 'parse_two')]

    def parse(self, response):
        Sel = HtmlXPathSelector(response)
        forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
        items = []
        for forumName in forumNames:
            item = ProjectItem()
            item['name'] = forumName
            items.append(item)


        forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
        itemDict = {}
        itemDict['items'] = items
        for forum in forums:
            yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)


    def addThreadNames(self, response):
        items = response.meta['items']
        Sel = HtmlXPathSelector(response)
        currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
        for item in items:
            if currentForum==item['name']:
                item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
        self.log(items)


        itemDict = {}
        itemDict['items'] = items
        threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
        for threadPageNav in threadPageNavs:  
            yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)

私が今直面している問題は、分類する予定のデータを保存する方法です (後で分析します)。その点に関して、ここで別の質問を開きました。

score 0 · Accepted Answer

Christian Temusが示唆しているように、直面している問題をより具体的に説明してください。コードを調べて、いくつかの提案をすることができます

アイテムのリストを返すのではなく、for ループで「yield item」を実行する必要があります。
クロールスパイダーを使う
クロールスパイダーの名前変更「パース」メソッドを使用する場合は、parse_titles などの別の名前に変更します。

python - Scrapy (webcrawler) で複雑なアイテムを返す

2 に答える 2

Related

Reference