必要なテキストノードが含まれるため、何らかのlxml
解析の問題が必要です。h1
//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()
しかし//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()
、p
あなたが望むのはh1
要素の中にありますが、そうではありません。
ページのこの領域の HTML ソースは次のとおりです。
<div class="product-shop detail-right">
<div class="prcdt-overview">
<div class="title">
<h1>
<div class="htag">Vincent Chase</div>
<p itemprop="name"> Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses</p>
</h1>
<span style="text-align:center;color:#329C92;font-size:12px;padding-top:5px">Product Id: 73871</span>
</div>
<div id="container2" style="display: none;">
<div class="product-options" id="product-options-wrapper">
このスクレイピー シェル セッションを見てください。
paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 13:16:33+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 13:16:34+0200 [default] INFO: Spider opened
2013-10-15 13:16:35+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s] item {}
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x354c310>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()').extract()
Out[1]:
[u'\n ',
u'Vincent Chase',
u'\n ']
In [2]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()').extract()
Out[2]:
[u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses',
u'Enter the details below as they appear on your prescription from your doctor. ',
u'Understand Your Prescription.',
u'Retail Store Price - Rs 1600',
u'You Save - Rs 800',
u'Retail Store Price - Rs 4500',
u'You Save - Rs 1010',
u'STATUS: ',
u'READY TO SHIP\t',
u'(LIMITED STOCK)',
u' ',
u'Delivered By 20 Oct,2013']
In [4]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//div[@class="htag"]//text()').extract()
Out[4]: [u'Vincent Chase']
In [5]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//p//text()').extract()
Out[5]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']
In [6]:
提案:
この Web サイト/Web ページでは、「itemscope」属性と「itemtype」属性 ( http://schema.org/docs/gs.html#microdata_itemscope_itemtypeを参照) を使用しているため、それらを使用して必要なデータを抽出することをお勧めします。
たとえば、次の XPath 式を使用できます。
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()
HtmlXPathSelector を使用すると、
In [1]: ''.join(hxs.select('//*[@itemscope and @itemtype="http://schema.org/Product"]//*[@itemprop="name"]/text()').extract()).strip()
Out[1]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'
スクレイピー シェル セッションの例:
paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 12:47:30+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 12:47:31+0200 [default] INFO: Spider opened
2013-10-15 12:47:32+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s] item {}
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x3f54310>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: hxs.select("""
...: //*[@itemscope and @itemtype="http://schema.org/Product"]
...: //*[@itemprop="name"]/text()""")
Out[1]: [<HtmlXPathSelector xpath='\n//*[@itemscope and @itemtype="http://schema.org/Product"]\n //*[@itemprop="name"]/text()' data=u' Colorato VC 5134 Matt Black Grey Gradie'>]
In [2]: hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()""").extract()
Out[2]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']
In [3]: ''.join(hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()""").extract()).strip()
Out[3]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'
In [4]: