python - スクレイピー-ページ付けされたアイテムの解析

Question

私は次の形式のURLを持っています：

example.com/foo/bar/page_1.html

合計53ページあり、それぞれに最大20行あります。

基本的に、すべてのページからすべての行、つまり〜53*20アイテムを取得したいと思います。

解析メソッドに作業コードがあります。これは、単一のページを解析し、アイテムごとに1ページ深くして、アイテムに関する詳細情報を取得します。

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

問題は、各ページをどのようにクロールするかです。

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

score 48 · Accepted Answer

問題を解決するには2つのオプションがあります。一般的なものは、yieldの代わりに新しいリクエストを生成するために使用することですreturn。そうすれば、1つのコールバックから複数の新しいリクエストを発行できます。http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-exampleで2番目の例を確認してください。

あなたの場合、おそらくもっと簡単な解決策があります：次のようなパターンから開始urのリストを生成するだけです：

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

score 12 · Accepted Answer

BaseSpiderの代わりにCrawlSpiderを使用し、SgmlLinkExtractorを使用してページネーションのページを抽出することができます。

例えば：

start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
                , follow= True),
          Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
                , callback='parse_call')
    )

最初のルールはscrapyにxpath式に含まれるリンクをたどるように指示し、2番目のルールはscrapyにxpath式に含まれるリンクへのparse_callを呼び出すように指示します。

詳細については、ドキュメントを参照してください：http: //doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

score 10 · Accepted Answer

「scrapy-ページ付けされたアイテムの解析」には2つのユースケースがあります。

A）。テーブルを移動してデータをフェッチしたいだけです。これは比較的簡単です。

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

最後の4行を確認します。ここ

「次へ」ページ付けボタンから次のページのxpathから次のページのリンクを取得しています。
それがページネーションの終わりではないかどうかをチェックするための条件の場合。
url joinを使用して、このリンク（ステップ1で取得したもの）をメインURLに結合します
parseコールバックメソッドへの再帰呼び出し。

B）ページ間を移動するだけでなく、そのページの1つ以上のリンクからデータを抽出する必要があります。

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
    '''do your parsing here'''

ここで、次の点に注意してください。

親クラスのCrawlSpiderサブクラスを使用していますscrapy.Spider
「ルール」に設定しました

a）最初のルールは、利用可能な「next_page」があるかどうかをチェックし、それに従います。

b）2番目のルールは、ページ上のすべてのリンクを要求し、たとえば/trains/12343、を呼び出して、parse_trains操作を実行および解析します。
重要：サブクラスparseを使用しているため、ここでは通常のメソッドを使用したくないことに注意してください。CrawlSpiderこのクラスにもparseメソッドがあるので、それをオーバーライドしたくありません。コールバックメソッドに。以外の名前を付けることを忘れないでくださいparse。

python - スクレイピー-ページ付けされたアイテムの解析

3 に答える 3

Related

Reference