I have a web crawler that crawls for news stories on a web page.
I know how to use the XpathSelector to scrape certain information from the elements in the page.
However I cannot seem to figure out how to store the URL of the page that was just crawled.
class spidey(CrawlSpider):
name = 'spidey'
start_urls = ['http://nytimes.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://nytimes.com/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_articles')]
# r'\d{4}/\d{2}/\w+' : regular expression for http://nytimes.com/YYYY/MM/title URLs
I want to store every link that passes those rule.
What would I need to add to parse_articles to store the link in my item?
def parse_articles(self, response):
item = SpideyItem()
item['link'] = ???
return item