python - ScrapyYahooGroupスパイダー

Question

Yをこすり取ろうとしています！グループと私は1ページからデータを取得できますが、それだけです。私はいくつかの基本的なルールを持っていますが、明らかにそれらは正しくありません。誰かがすでにこれを解決しましたか？

class YgroupSpider(CrawlSpider):
name = "yahoo.com"
allowed_domains = ["launch.groups.yahoo.com"]
start_urls = [
    "http://launch.groups.yahoo.com/group/random_public_ygroup/post"
]

rules = (
    Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))),
    Rule(SgmlLinkExtractor(), callback='parse_item'),
)


def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html')
    item = Item()
    for site in sites:
        item = YgroupItem()
        item['title'] = site.select('//title').extract()
        item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract()
        item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract()
    return item

score 0 · Accepted Answer

何をしているのかほとんどわからないようです。私は Scrapy を初めて使用しますが、必要な Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'), 完全なリンク URL に一致する正規表現を試してみてください。また、必要なルールは 1 つだけのようです。コールバックを最初のものに追加します。リンクエクストラクタは、allow の正規表現に一致するすべてのリンクに一致し、それらから deny に一致するリンクを除外します。そこから、残りの各ページがロードされてに渡されparse_itemます。

データマイニングしているページと必要なデータの性質について何も知らずに、これらすべてを言っています。必要なデータがあるページへのリンクがあるページには、この種のスパイダーが必要です。

python - ScrapyYahooGroupスパイダー

1 に答える 1

Related

Reference