0

これはリンクhttps://www.google.com/about/careers/search#!t=jo&jid=34154&で、仕事の詳細の下にあるコンテンツを抽出する必要があります。

Job details

Team or role: Software Engineering // How to write xapth
Job type: Full-time // How to write xapth
Last updated: Oct 17, 2014 // How to write xapth
Job location(s):Seattle, WA, USA; Kirkland, WA, USA //// How to write rejax for to extract city, state and country separately for each jobs. Also i need to filter USA, canada and UK jobs separately.

ここに、上記のコンテンツを抽出するための html コードを追加しました。

<div class="detail-content">
<div>
<div class="greytext info" style="display: inline-block;">Team or role:</div>
<div class="info-text" style="display: inline-block;">Software Engineering</div> // How to write xpath for this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job type:</div>
<div class="info-text" style="display: inline-block;" itemprop="employmentType">Full-time</div>// How to write xpath for job type this one
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Job level:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Salary:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div>
<div class="greytext info" style="display: inline-block;">Last updated:</div>
<div class="info-text" style="display: inline-block;" itemprop="datePosted"> Oct 17, 2014</div> // How to write xpath for posted date this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job location(s):</div>
<div class="info-text" style="display: inline-block;">Seattle, WA, USA; Kirkland, WA, USA</div> // How to write rejax for to extract city, state and country seprately
</div>
</div>
</div>

スパイダーコードは次のとおりです。

def parse_listing_page(self,response):
        selector = Selector(response)
        item=googleSpiderItem()
        item['CompanyName'] = "Google"	
        item ['JobDetailUrl'] = response.url
        item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
        item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.')
        item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)')
        item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()
        Description = selector.xpath("string(//div[@itemprop='description'])").extract()
	item['Description'] = [d.encode('UTF-8') for d in Description]
	print "Done!"
        yield item

出力は次のとおりです。

	Traceback (most recent call last):
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
	    taskObj._oneWorkUnit()
	  File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
	    result = next(self._iterator)
	  File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
	    work = (callable(elem, *args, **named) for elem in iterable)
	--- <exception caught here> ---
	  File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
	    yield next(it)
	  File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
	    for x in result:
	  File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
	    return (_set_referer(r) for r in result or ())
	  File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
	    return (r for r in result or () if _filter(r))
	  File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
	    return (r for r in result or () if _filter(r))
	  File "/home/sureshp/Downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page
	   

>  **item['City'] = selector.xpath("//a[@class='source
> sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.')
>     	exceptions.AttributeError: 'list' object has no attribute 're'**

4

1 に答える 1

1

解析コードにタイプミスがあることに気付きました。

それを私が直した。これで出力は次のようになりました。

{'City': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
 'CompanyName': 'Google',
 'Description': [u"Google's software engineers develop the next-generation technologies that change how millions of users connect, explore, and interact with information and one another. Our ambitions reach far beyond just Search. Our products need to handle information at the the scale of the web. We're looking for ideas from every area of computer science, including information retrieval, artificial intelligence, natural language processing, distributed computing, large-scale system design, networking, security, data compression, and user interface design; the list goes on and is growing every day. As a software engineer, you work on a small team and can switch teams and projects as our fast-paced business grows and evolves. We need our engineers to be versatile and passionate to tackle new problems as we continue to push technology forward.?\nWith your technical expertise you manage individual projects priorities, deadlines and deliverables. You design, develop, test, deploy, maintain, and enhance software solutions.\n\nSeattle/Kirkland engineering teams are involved in the development of several of Google?s most popular products: Cloud Platform, Hangouts/Google+, Maps/Geo, Advertising, Chrome OS/Browser, Android, Machine Intelligence. Our engineers need to be versatile and willing to tackle new problems as we continue to push technology forward."],
 'JobDetailUrl': 'https://www.google.com/about/careers/search?_escaped_fragment_=t%3Djo%26jid%3D34154%26',
 'Jobtype': [],
 'State': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
 'Title': [u'Software Engineer']}

ここに変更されたコードがあります:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Google.items import GoogleItem
import re
class DmozSpider(Spider):
    name = "google"
    allowed_domains = ["google.com"]
    start_urls = [
    "https://www.google.com/about/careers/search#!t=jo&jid=34154&",
    ]

    def parse(self, response):
        selector = Selector(response)
        item=GoogleItem()
        item['Description'] = selector.xpath("string(//div[@itemprop='description'])").extract()
        item['CompanyName'] = "Google"  
        item ['JobDetailUrl'] = response.url
        item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
        item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
        item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
        item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()

        yield item

City、State、Nation を個別に指定するには、セレクターでサイクルを使用できます。

for p in selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract():
    city,state,nation= p.split(',')
    item['City'] =  city
    item['State'] =  state
    item['Nation'] =  nation
于 2014-11-13T10:10:28.103 に答える