python - 以前にスクレイピングされたリンクを拒否する Scrapy Link Extractor

Question

Scarpy の CrawlSpider クラスを使用してクローラーを構築しています。リンクエクストラクタが同じリンクを何度もループしていると思われます。リンクエクストラクタを制限し、既にスクレイピングされたリンクを拒否する方法はありますか? これは、拒否入力で正規表現なしで実行できますか?

My Rules look like this:

{

rules = (
    #Rule(SgmlLinkExtractor((allow='profile')), follow=True),
    Rule(SgmlLinkExtractor(deny='feedback\.html'),callback='parse_item', follow=True),
    )

}
And my parse_item is:

{

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    element = hxs.select('//table[@id="profilehead"]/tr/td/a/@href').extract()
    try:
        open('urls.txt', 'a').write(element[0])
        open('urls.txt', 'a').write('\n')
    except IndexError:
        # Site doesn't have link to another website
        pass

}

score 0 · Accepted Answer

スクレイピーは、すでに訪問したリンクをたどらないと思います。ただし、フォローされていない部分を制限したい場合は、次のようなことを試すことができます

restrict_xpaths=('//a[starts-with(@title,"Next ")]')),

http://doc.scrapy.org/en/latest/topics/link-extractors.html

python - 以前にスクレイピングされたリンクを拒否する Scrapy Link Extractor

1 に答える 1

Related

Reference