python - スクレイピースパイダーバイパスは私のルールを拒否します

Question

こんにちは、crawlspider を使用しようとしていて、独自の拒否ルールを作成しました

class MySpider(CrawlSpider): 
    name = "craigs" 
    allowed_domains = ["careers-cooperhealth.icims.com"] 
    start_urls = ["careers-cooperhealth.icims.com"] 
    d= [0-9] 
    path_deny_base = [ '.(login)', '.(intro)', '(candidate)', '(referral)', '(reminder)', '(/search)',] 
    rules = (Rule (SgmlLinkExtractor(deny = path_deny_base, 
                                     allow=('careers-cooperhealth.icims.com/jobs/…;*')), 
                                     callback="parse_items", 
                                     follow= True), )

それでも私のスパイダーはhttps://careers-cooperhealth.icims.com/jobs/22660/registered-nurse-prn/loginのようなページをクロールしましたが、ログインをクロールすべきではありません。ここで何が問題なのですか?

score 2 · Accepted Answer

このように変更するだけです（ドットと括弧なし）：

deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']

rules = (Rule (SgmlLinkExtractor(deny = deny, 
                                 allow=allow, 
                                 restrict_xpaths=('*')), 
                                 callback="parse_items", 
                                 follow= True),)

loginこれは、抽出されたリンクにまたはなどがないことを意味introし、含まれているリンクのみを抽出jobsします。

https://careers-cooperhealth.icims.com/jobs/intro?hashed=0リンクをクロールして「YAHOO!」を出力するスパイダーコード全体を次に示します。

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class MySpider(CrawlSpider):
    name = "craigs" 
    allowed_domains = ["careers-cooperhealth.icims.com"] 
    start_urls = ["https://careers-cooperhealth.icims.com"]

    deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
    allow = ['jobs']

    rules = (Rule (SgmlLinkExtractor(deny = deny,
                                     allow=allow,
                                     restrict_xpaths=('*')),
                                     callback="parse_items",
                                     follow= True),)

    def parse_items(self, response):
        print "YAHOO!"

それが役立つことを願っています。

python - スクレイピースパイダーバイパスは私のルールを拒否します

1 に答える 1

Related

Reference