python - Python/Etree: 要素とその子からテキストを取得する

Question

次のような HTML を使用する必要があります。

<li><a href="#">S:</a><a class="#"> (n) </a><a href="#">trial</a>, <a href="#">trial run</a>, <b>test</b>, <a href="#">tryout</a> (trying something to find out about it) <i>"a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"</i></li>

問題は、子 ( as とis など) とテキストノード (,子の間の部分など) の両方からテキストを取得する必要があることです。

私ができることは、各子からテキストを取得して、それをまとめることです(これにより、すべてのテキストノード以外のすべてが得られます)、またはテキストノードのみを取得します( and ではありませんa) i。両方手に入れる方法はありますか？

score 1 · Accepted Answer

lxml 変更ログは、lxml v2.3 が python 3.1.2 以降と互換性があることを示しています。

また、PHPのstrip_tagsに相当するPythonのre.sub(r'<[^>]*?>', '', val)正規表現を使用することもできます。

score 0 · Accepted Answer

これは、lxml.html を使用して行うことができます。

In [1]: import lxml.html

In [2]: el = lxml.html.fromstring('<li><a href="#">S:</a><a class="#"> (n) </a><a href="#">trial</a>, <a href="#">trial run</a>, <b>test</b>, <a href="#">tryout</a> (trying something to find out about it) <i>"a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"</i></li>')

In [3]: print el.text_content()
S: (n) trial, trial run, test, tryout (trying something to find out about it) "a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"

python - Python/Etree: 要素とその子からテキストを取得する

2 に答える 2

Related

Reference