最初にログインして注文のリストを解析する必要があるスパイダーに取り組んでいます。スクレイピングしようとしているサイトは、ログインに成功した後、時々キャプチャを使用します。キャプチャのみを要求するか、キャプチャの詳細を使用して再度ログインします。
以下のスパイダーは期待どおりに動作し、ログインを試行し、check_login_response
メソッドで、ログインが成功したかどうかを確認し、そうでない場合は再度呼び出していますself.login()
。通常、スパイダーには注文 URL のリストが与えられます。それらは実行時にメソッドで start_urls にロードされます__init___
。
現在起こっていることは、スパイダーが実行され、parse_page
メソッドで停止することです。この行に出力された URL を確認できますlog.msg('request %s' % url)
。しかし、スパイダーは、start_urls のリストを使用して parse メソッドを実行することはありません。
この問題は、キャプチャの再試行が発生した場合にのみ発生します。通常のログイン シナリオでは問題なく動作し、解析メソッドが呼び出されます。
アドバイスをお願いします。
PS Spider と CrawlSpider の両方のクラスを試しましたが、同じ結果が得られました
class SomeCrawlSpider(CrawlSpider):
""" SomeSpider
"""
name = 'some_order_details'
allowed_domains = ['example.com']
start_urls = []
login_url = 'https://www.exmaple.com/login/'
login_attempts = 0
def __init__(self, qwargs, *args, **kwargs):
# starting Scrapy logging
self.start_urls = qwargs.get('order_details_urls')
super(SomeCrawlSpider, self).__init__(*args, **kwargs)
def start_requests(self):
""" start_requests
@return:
"""
log.msg('starting requests')
return self.init_request()
def init_request(self):
""" init_request
@return: list
"""
log.msg('init requests')
return [Request(url=self.login_url, callback=self.login)]
def check_login_response(self, response):
""" check if response is logged in
@param response:
@return:
"""
log.msg('check requests')
# first we check if we get the captcha page again
if "Type the characters you see in this image" in response.body_as_unicode() \
or "What is your e-mail address?" in response.body_as_unicode():
return self.login(response)
return self.parse_page(response)
def parse_page(self, response):
""" parse_page
@param response:
@return:
"""
self.login_attempts += 1
if "Your Orders" in response.body_as_unicode():
log.msg('user is logged in')
self.credentials.last_used_at = datetime.utcnow().replace(tzinfo=utc)
self.credentials.save()
for url in self.start_urls:
log.msg('request %s' % url)
yield self.make_requests_from_url(url)
yield OrderItem(auth_failed=True)
def login(self, response):
""" login
@param response:
@return:
"""
# check the existence of credentials:
if not any([self.credentials, self.credentials.username, self.credentials.password]):
log.msg('Credentials is not set correctly')
return OrderItem(auth_failed=True)
log.msg('Trying to login')
# check if response is captcha
if "Type the characters you see in this image" in response.body_as_unicode():
# captcha page before login
log.msg('Captcha detected and guessing pass through')
self.crawler.engine.pause()
captcha = select_captcha_from_image(response)
self.crawler.engine.unpause()
log.msg('captcha detected: %s' % str(captcha))
if not captcha:
# if captcha returns Null
log.msg('captcha was not decoded')
raise OrderDetailsNoCaptchaException
if "What is your e-mail address?" in response.body_as_unicode():
log.msg('logging in via form: captcha + credentials')
return FormRequest.from_response(response,
formdata={'guess': str(captcha),
'email': 'XXXX',
'password': 'XXXX'},
callback=self.check_login_response)
else:
log.msg('posting captcha')
return FormRequest.from_response(response,
formdata={'field-keywords': str(captcha)},
callback=self.check_login_response)
if 'What is your e-mail address?' in response.body_as_unicode():
log.msg('logging in via form')
return FormRequest.from_response(response,
formdata={'email': 'XXX',
'password': 'XXXX'},
callback=self.check_login_response)
return OrderItem(auth_failed=True)
def parse(self, response):
""" this function returns
@param response: Response Object
@return: Dictionary
"""
log.msg('Parsing items invoked')
# here i parse response to item and yeild back below
yield OrderItem(**item)
EDIT (コンソール出力の追加)
これは、キャプチャが検出された場合です。
[2014-12-15 18:10:01,972: INFO/MainProcess] Received task: product.parse_messages_to_purchases[72927959-35e1-4053-917b-f25de318faf7]
[2014-12-15 18:10:03,327: WARNING/Worker-5] 2014-12-15 18:10:03-0600 [scrapy] INFO: Enabled extensions: LogStats, CloseSpider, SpiderState
[2014-12-15 18:10:04,331: WARNING/Worker-5] 2014-12-15 18:10:04-0600 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
[2014-12-15 18:10:04,331: WARNING/Worker-5] 2014-12-15 18:10:04-0600 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[2014-12-15 18:10:04,332: WARNING/Worker-5] 2014-12-15 18:10:04-0600 [scrapy] INFO: Enabled item pipelines: PurchaseWriterPipeline
[2014-12-15 18:10:04,341: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:04-0600 [scrapy] INFO: starting requests
[2014-12-15 18:10:04,342: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:04-0600 [scrapy] INFO: init requests
[2014-12-15 18:10:04,343: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:04-0600 [some_order_details] INFO: Spider opened
[2014-12-15 18:10:04,346: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:04-0600 [some_order_details] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x113120908> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:04,351: WARNING/ProductCrawlerScript-5:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x113120908> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:06,063: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:06-0600 [scrapy] INFO: Trying to login
[2014-12-15 18:10:06,063: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:06-0600 [scrapy] INFO: logging in via form
[2014-12-15 18:10:06,500: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:06-0600 [scrapy] INFO: check requests
[2014-12-15 18:10:06,500: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:06-0600 [scrapy] INFO: Trying to login
[2014-12-15 18:10:06,500: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:06-0600 [scrapy] INFO: Captcha detected and guessing pass through
[2014-12-15 18:10:22,701: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:22-0600 [scrapy] INFO: captcha detected: cpm6jf
[2014-12-15 18:10:22,701: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:22-0600 [scrapy] INFO: logging in via form: captcha + credentials
[2014-12-15 18:10:25,598: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [scrapy] INFO: check requests
[2014-12-15 18:10:25,601: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [scrapy] INFO: user is logged in
[2014-12-15 18:10:25,621: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [scrapy] INFO: request https://www.some.com/gp/css/summary/edit.html?ie=UTF8&orderID=111-1260932-6725022&ref_=oh_aui_or_o02_&
[2014-12-15 18:10:25,624: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [some_order_details] INFO: Closing spider (finished)
[2014-12-15 18:10:25,625: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [some_order_details] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7000,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 65353,
'downloader/response_count': 6,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 2,
'request_depth_max': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6}
[2014-12-15 18:10:25,625: WARNING/ProductCrawlerScript-5:1] 2014-12-15 18:10:25-0600 [some_order_details] INFO: Spider closed (finished)
[2014-12-15 18:10:25,639: INFO/MainProcess] Task product.parse_messages_to_purchases[72927959-35e1-4053-917b-f25de318faf7] succeeded in 23.665501486s: u'Message with id 16650 completed parsing'
これはキャプチャがない場合です
[2014-12-15 18:10:43,229: INFO/MainProcess] Received task: product.parse_messages_to_purchases[c06bafdc-0f39-4e43-8f74-ad899f30e799]
[2014-12-15 18:10:44,557: WARNING/Worker-1] 2014-12-15 18:10:44-0600 [scrapy] INFO: Enabled extensions: LogStats, CloseSpider, SpiderState
[2014-12-15 18:10:45,560: ERROR/Worker-1] Unable to read instance data, giving up
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[2014-12-15 18:10:45,561: WARNING/Worker-1] 2014-12-15 18:10:45-0600 [scrapy] INFO: Enabled item pipelines: PurchaseWriterPipeline
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: starting requests
[2014-12-15 18:10:45,570: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [scrapy] INFO: init requests
[2014-12-15 18:10:45,571: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Spider opened
[2014-12-15 18:10:45,574: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:45-0600 [some_order_details] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:45,580: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py:159: DeprecationWarning: <scrapy.core.downloader.contextfactory.ScrapyClientContextFactory instance at 0x11311c3f8> was passed as the HTTPS policy for an Agent, but it does not provide IPolicyForHTTPS. Since Twisted 14.0, you must pass a provider of IPolicyForHTTPS.
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: Trying to login
[2014-12-15 18:10:50,563: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:50-0600 [scrapy] INFO: logging in via form
[2014-12-15 18:10:52,899: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: check requests
[2014-12-15 18:10:52,904: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: user is logged in
[2014-12-15 18:10:52,924: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:52-0600 [scrapy] INFO: request https://www.some.com/gp/css/summary/edit.html?ie=UTF8&orderID=111-1260932-6725022&ref_=oh_aui_or_o02_&
[2014-12-15 18:10:54,164: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [scrapy] INFO: Parsing items invoked
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,191: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-08 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,262: INFO/MainProcess] Received task: semantic.get_product_and_create[939f73f6-f5bd-4d31-8a1b-6544de37b7b2] eta:[2014-12-16 00:12:34.246354+00:00]
/Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,268: WARNING/ProductCrawlerScript-1:1] /Users/mo/Projects/pythonic/env/lib/python2.7/site-packages/django/db/models/fields/__init__.py:903: RuntimeWarning: DateTimeField Purchase.shipping_date received a naive datetime (2014-12-09 00:00:00) while time zone support is active.
RuntimeWarning)
[2014-12-15 18:10:54,286: INFO/MainProcess] Received task: semantic.get_product_and_create[acae6d98-2769-4d27-bcaa-a30ded095e4a] eta:[2014-12-16 00:12:34.285050+00:00]
[2014-12-15 18:10:54,288: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Closing spider (finished)
[2014-12-15 18:10:54,289: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7527,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 5,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 79284,
'downloader/response_count': 6,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 2,
'request_depth_max': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6}
[2014-12-15 18:10:54,290: WARNING/ProductCrawlerScript-1:1] 2014-12-15 18:10:54-0600 [some_order_details] INFO: Spider closed (finished)