python - lxmlのhtmlから日付文字列を解析します

Question

 s = """
      <tbody>
      <tr>
       <td style="border-bottom: none">
       <span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
        <span class="graytext" style="font-size: 11px">
        05/13/09  2:02am
        <br>
       </span>
      </td>
     </tr>
    </tbody>
 """

HTML文字列で、日付文字列を取り出す必要があります。

私はこのように試しました

  import lxml
  doc = lxml.html.fromstring(s)
  doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')

しかし、これは機能していません。日付文字列だけを取る必要があります。

score 1 · Accepted Answer

クエリはを選択してspanいます。そこからテキストを取得する必要があります。

>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')
[<Element span at 1c9d4c8>]

ほとんどのクエリはシーケンスを返します。私は通常、最初のアイテムを取得するヘルパー関数を使用します。

from lxml import etree
s = """
<tbody>
 <tr>
   <td style="border-bottom: none">
   <span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
    <span class="graytext" style="font-size: 11px">
    05/13/09  2:02am
    <br>
   </span>
  </td>
 </tr>
</tbody>
"""
doc = etree.HTML(s)

def first(sequence,default=None):
  for item in sequence:
    return item
  return default

それで：

>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()')
['\n    05/13/09  2:02am\n    ']
>>> first(doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()'),'').strip()
'05/13/09  2:02am'

score 0 · Accepted Answer

最後の行の代わりに次を試してください。

print doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()')[0]

xpath式の最初の部分は正しく、//span[@class="graytext" and @style="font-size: 11px"]一致するすべてのスパンノードを選択してから、ノードから何を選択するかを指定する必要があります。text()ここで使用するのは、ノードのコンテンツを選択することです。

python - lxmlのhtmlから日付文字列を解析します

2 に答える 2

Related

Reference