python - 複数ページで再帰的にスクレイピー取得リンクを使用する

翻译自：https://stackoverflow.com/questions/24493773 2014-06-30T15:32:54.343

386 次

オンラインで見つけた次のコードを使用して、複数のページのリンクを再帰的にスクレイピングしています。すべてのページで必要なすべてのリンクを再帰的に返すことになっています。しかし、最大で 100 リンクしか取得できませんでした。どんなアドバイスも役に立ちます。

class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://seattle.craigslist.org/search/jjj?is_parttime=1"]   

    rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//a[@class="button next"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"]')
        items = []
        for titles in titles:
            item = CraigslistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()           
            items.append(item)     
        return(items)

python - 複数ページで再帰的にスクレイピー取得リンクを使用する

1 に答える 1

Related

Reference