0

LXML を使用して、この検索 URL から返された検索結果を解析しようとしています。

http://www.rte.ie/player/ie/search/?q=news

HTML で返される記事タグは次のとおりです。

  <article class="search-result clearfix"><a
    href="/player/ie/show/10117771/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/0005d4bf-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117771/">elev8</a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117771/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">Ivan and Sean talk to future basketball sensation Julian Newman and the <span class="search-highlight">News</span> Dudes are in the loft with some crazy <span class="search-highlight">news</span> stories.</p>
     <span
    class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10118015/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/000716b2-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10118015/">One <span class="search-highlight">News</span></a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10118015/">Wed 06 Mar 2013</a></p>
    <!-- p class="search-programme-date">06/03/2013</p -->
    <p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
    <span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117836/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/00071614-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117836/"><span class="search-highlight">News</span> on Two and World Forecast</a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117836/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">All the <span class="search-highlight">news</span> and sport from home and abroad.</p>
     <span
    class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117816/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/000715f2-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117816/">Nine <span class="search-highlight">News</span></a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117816/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">The Nine <span class="search-highlight">News</span> followed by Weather.</p>
    <span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117789/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/000715ae-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117789/">Six One <span class="search-highlight">News</span></a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117789/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">The Six One <span class="search-highlight">News</span> and Sport followed by Weather.</p>
    <span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117784/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/000715a0-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117784/">Nuacht and <span class="search-highlight">News</span> with Signing</a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117784/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">Nuacht and <span class="search-highlight">News</span> with Signing.</p>
    <span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117770/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/0007158d-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117770/"><span class="search-highlight">News</span>2Day</a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117770/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">Domestic and international <span class="search-highlight">news</span> items of interest to younger viewers.</p>
     <span
    class="sprite logo-rte-two search-channel-icon">RTÉ 2</span>
  </article>



  <article class="search-result clearfix"><a
    href="/player/ie/show/10117728/" class="thumbnail-programme-link"><span
        class="sprite thumbnail-icon-play">Watch Now</span><img class="thumbnail" alt="Watch Now"
        src="http://img.rasset.ie/0007154e-261.jpg"></a>
    <h3 class="search-programme-title"><a href="/player/ie/show/10117728/">One <span class="search-highlight">News</span></a></h3>
    <p class="search-programme-episodes"><a href="/player/ie/show/10117728/">Tue 05 Mar 2013</a></p>
    <!-- p class="search-programme-date">05/03/2013</p -->
    <p class="search-programme-description">The One O'Clock <span class="search-highlight">News</span> followed by Weather.</p>
    <span class="sprite logo-rte-one search-channel-icon">RTÉ 1</span>
  </article>

返された結果を試して解析するために次のコードを追加しましたが、返された結果が一貫していないという問題があります。私が興味を持っているセクションは記事タグの繰り返しですが、返された結果に検索テキストが含まれていて、タグ span class="search-highlight" が追加されていることが問題であり、これが私の解析を捨てています。

url = "http://www.rte.ie/player/ie/search/?q=news"
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = str(response.read())
response.close()

parser = etree.HTMLParser(encoding='utf-8')
tree   = etree.fromstring(html, parser)

for elem in tree.xpath('//article[@class="search-result clearfix"]'):
    icon_url = str(elem[0][1].attrib.get('src'))
    print 'icon_url ', icon_url

    name_tmp = str(elem[1][0].text)
    print 'name_tmp ', name_tmp

    stream = str(elem[1][0].attrib.get('href'))
    print 'stream ', stream

    date_tmp = str(elem[2][0].text)
    print 'date_tmp ', date_tmp

    short_tmp = elem[4].text
    print 'short_tmp ', short_tmp

    channel =  elem[5].text
    print 'channel ', channel

問題のフィールドは name_tmp と short_tmp です。search-highlight スパン タグが原因で、フル テキスト名がドロップされています。全文を解析する方法やスパンタグを無視する方法を考えられる人はいますか?

非常に長い投稿で申し訳ありません...

4

3 に答える 3

0

itertext()ノードでこのメソッドを使用して、すべての子孫テキストノードからコンテンツを取得できると思います。

于 2013-03-06T16:32:23.013 に答える
0

Element.itertext()あなたは方法を探しています:

name_tmp = ''.join(elem[1][0].itertext())

short_tmp = ''.join(elem[4].itertext())

これらの修正を行うと、コードは次のように出力されます。

icon_url  http://img.rasset.ie/0005d4bf-261.jpg
name_tmp  elev8
stream  /player/no/show/10117771/
date_tmp  Tue 05 Mar 2013
short_tmp  Ivan and Sean talk to future basketball sensation Julian Newman and the News Dudes are in the loft with some crazy news stories.
channel  RTÉ 2
icon_url  http://img.rasset.ie/000716b2-261.jpg
name_tmp  One News
stream  /player/no/show/10118015/
date_tmp  Wed 06 Mar 2013
short_tmp  The One O'Clock News followed by Weather.
channel  RTÉ 1

于 2013-03-06T16:32:42.617 に答える
0

lxml.html少し読みやすく、より堅牢にするために使用できます。

from lxml import html

tree = html.parse("http://www.rte.ie/player/ie/search/?q=news")
for article in tree.xpath('//article[@class="search-result clearfix"]'):
    select = lambda expr: article.cssselect(expr)[0]
    title = select(".search-programme-title")
    info = dict(
        icon_url=select("img.thumbnail").get('src'),
        name=title.text_content(),
        stream=title.find('a').get('href'),
        date=select(".search-programme-episodes").text_content(),
        short=select(".search-programme-description").text_content(),
        channel=select(".search-channel-icon").text_content())
    print(info)

出力

{'short': 'Ivan and Sean talk to future basketball sensation Julian Newman and the News Dudes are in the loft with some crazy news stories.', 'stream': '/player/ru/show/10117771/', 'name': 'elev8', 'date': 'Tue 05 Mar 2013', 'icon_url': 'http://img.rasset.ie/0005d4bf-261.jpg', 'channel': 'RTÉ 2'}
{'short': "The One O'Clock News followed by Weather.", 'stream': '/player/ru/show/10118015/', 'name': 'One News', 'date': 'Wed 06 Mar 2013', 'icon_url': 'http://img.rasset.ie/000716b2-261.jpg', 'channel': 'RTÉ 1'}
...
于 2013-03-06T17:12:21.607 に答える