python-2.7 - xpathを使用したhtmlの特定のタグの後の次のタグは何ですか

Question

私はこのHTMLコードを持っています:

<a name="apple"></a>
<h3> header1 </h3>
<p> some text </p>
<p> some text1 </p>
<a name="orange"></a>
<h3> header2 </h3>
<p> some text 2 </p>

次のようなコードを使用して、ヘッダータグの後のテキストを取得します。

for header in tree.iter('h3'):
 paragraph = header.xpath('(.//following::p)[1]')
 if (header.text=="apple"):
    print "%s: %s" % (header.text, paragraph[0].text)

複数の<p>タグがある場合は機能しません。見出しの後にいくつのタグがあるかを調べて<p>、それらをすべて取得するにはどうすればよいですか?

私はpython 2.7とxpathを使用しています。

score 2 · Accepted Answer

lxml's ( itersibling()) を使用し、子孫ではなく兄弟で作業し、必要に応じてこれらの兄弟の子孫で作業する方がおそらく簡単です。

このようなものを試すことができます

>>> for heading in root.iter("h3"):
...     print "----", heading
...     for sibling in heading.itersiblings():
...         if sibling.tag == 'h3':
...             break
...         print sibling
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

XPath を使用する場合は、 (名前空間を介して)で利用可能なEXSLT のset 拡張機能を使用できます。考え方は上記とほぼ同じです。lxml"http://exslt.org/sets"

すべての兄弟を選択 ( following-sibling::*)、
ただし、除外 ( set:difference()) 次の<h3>兄弟 ( following-sibling::h3) および ( |XPath 演算子) 後続のすべての兄弟も ( following-sibling::h3/following-sibling::*)。

これは次のように使用できます。

>>> following_siblings_untilh3 = lxml.etree.XPath("""
...         set:difference(
...             following-sibling::*,
...             (following-sibling::h3|following-sibling::h3/following-sibling::*))""",
...         namespaces={"set": "http://exslt.org/sets"})
>>> 
>>> for heading in root.iter("h3"):
...     print "----", heading
...     for e in following_siblings_noth3(heading): print e
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

単純化できると確信しています。（私は見つけていませんfollowing-sibling-or-self::h3...）

python-2.7 - xpathを使用したhtmlの特定のタグの後の次のタグは何ですか

1 に答える 1

Related

Reference