4

I have a web crawler that crawls for news stories on a web page.

I know how to use the XpathSelector to scrape certain information from the elements in the page.

However I cannot seem to figure out how to store the URL of the page that was just crawled.

class spidey(CrawlSpider):
    name = 'spidey'
    start_urls = ['http://nytimes.com'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
        # r'page/\d+' : regular expression for http://nytimes.com/page/X URLs
        Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_articles')]
        # r'\d{4}/\d{2}/\w+' : regular expression for http://nytimes.com/YYYY/MM/title URLs

I want to store every link that passes those rule.

What would I need to add to parse_articles to store the link in my item?

def parse_articles(self, response):
    item = SpideyItem()
    item['link'] = ???
    return item
4

1 に答える 1

6

response.urlはあなたが探しているものです。

応答オブジェクトに関するドキュメントを参照し、この簡単な例を確認してください。

于 2013-02-27T07:05:15.777 に答える