私はスクレイピーが初めてで、ウェブサイトをクロールし、そこからすべての電話番号、電子メール、PDFなどを取得するスパイダーを構築しようとしています(メインページからすべてのリンクをたどってほしいので、ドメイン全体を検索します)。
この質問には同様の問題がありましたが、解決されませんでした:なぜスクレイピー クローラーが停止するのですか?
これが私のスパイダーのコードです:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re
class ExampleSpider(CrawlSpider):
name = "hyatt"
allowed_domains = ["hyatt.com"]
start_urls = (
'http://www.hyatt.com/',
)
#follow only non-javascript links
rules = (
Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
)
def parse_item(self, response):
#self.log('The current url is %s' % response.url)
selector = Selector(response)
item = MobilesuitesItem()
#get url
item['url'] = response.url
#get page title
titles = selector.select("//title")
for t in titles:
item['title'] = t.select("./text()").extract()
#get all phone numbers, emails, and pdf links
text = response.body
item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
item['email'] = '|'.join(re.findall("[^\s@]+@[^\s@]+\.[^\s@]+", text))
item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))
#check to see if dining is mentioned on the page
item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
return item
ハングする前のクロール ログの最後の部分を次に示します。
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
{'email': '',
'phone': '',
'title': [u'24/7 Gallery Menu'],
'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)