python - Python Scrapyを使用してpタグ/要素内のテキストをスクレイピングできません

Question

http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
x-path を使用してサイトから製品名を抽出したい
//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p。

以下を試しましたが、結果に何も得られません item['pname'] = ' '.join(hxs.select('//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p/text()').extract()).strip()

score 0 · Accepted Answer

必要なテキストノードが含まれるため、何らかのlxml解析の問題が必要です。h1//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()

しかし//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()、pあなたが望むのはh1要素の中にありますが、そうではありません。

ページのこの領域の HTML ソースは次のとおりです。

<div class="product-shop detail-right">
    <div class="prcdt-overview">
        <div class="title">
                                    <h1>
                <div class="htag">Vincent Chase</div>
                <p itemprop="name"> Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses</p>
            </h1>
            <span style="text-align:center;color:#329C92;font-size:12px;padding-top:5px">Product Id: 73871</span>
        </div>               

        <div id="container2" style="display: none;">
            <div class="product-options" id="product-options-wrapper">

このスクレイピーシェルセッションを見てください。

paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 13:16:33+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 13:16:34+0200 [default] INFO: Spider opened
2013-10-15 13:16:35+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s]   item       {}
[s]   request    <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   response   <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x354c310>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()').extract()
Out[1]: 
[u'\n                            ',
 u'Vincent Chase',
 u'\n                            ']

In [2]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()').extract()
Out[2]: 
[u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses',
 u'Enter the details below as they appear on your prescription from your doctor. ',
 u'Understand Your Prescription.',
 u'Retail Store Price - Rs 1600',
 u'You Save - Rs 800',
 u'Retail Store Price - Rs 4500',
 u'You Save - Rs 1010',
 u'STATUS: ',
 u'READY TO SHIP\t',
 u'(LIMITED STOCK)',
 u'    ',
 u'Delivered By 20 Oct,2013']

In [4]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//div[@class="htag"]//text()').extract()
Out[4]: [u'Vincent Chase']

In [5]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//p//text()').extract()
Out[5]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']

In [6]:

提案：

この Web サイト/Web ページでは、「itemscope」属性と「itemtype」属性 ( http://schema.org/docs/gs.html#microdata_itemscope_itemtypeを参照) を使用しているため、それらを使用して必要なデータを抽出することをお勧めします。

たとえば、次の XPath 式を使用できます。

//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()

HtmlXPathSelector を使用すると、

In [1]: ''.join(hxs.select('//*[@itemscope and @itemtype="http://schema.org/Product"]//*[@itemprop="name"]/text()').extract()).strip()
Out[1]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'

スクレイピーシェルセッションの例:

paul@wheezy:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 12:47:30+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 12:47:31+0200 [default] INFO: Spider opened
2013-10-15 12:47:32+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s]   item       {}
[s]   request    <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   response   <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x3f54310>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: hxs.select("""
   ...: //*[@itemscope and @itemtype="http://schema.org/Product"]
   ...:     //*[@itemprop="name"]/text()""")
Out[1]: [<HtmlXPathSelector xpath='\n//*[@itemscope and @itemtype="http://schema.org/Product"]\n    //*[@itemprop="name"]/text()' data=u' Colorato VC 5134 Matt Black Grey Gradie'>]

In [2]: hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()""").extract()
Out[2]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']

In [3]: ''.join(hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
    //*[@itemprop="name"]/text()""").extract()).strip()
Out[3]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'

In [4]:

python - Python Scrapyを使用してpタグ/要素内のテキストをスクレイピングできません

1 に答える 1

Related

Reference