beautifulsoup - Python BeautifulSoup再帰的テキストではない

Question

以下のようなコードのスパン要素があります。アンカー(a)タグの外側にのみ存在するテキストを抽出するにはどうすればよいですか:

# print soup.prettify()
<span class="1">
    text_wanted         
    <a data-toggle="notify" href="https://www.abc.com/1" class="class1"><span>text1</span></a>
    <a data-toggle="notify" href="https://www.abc.com/2" class="class2"><span>text2</span></a>
</span>

以下の解決策を考えています。

text_all = soup.text.encode('utf-8')
text_strip_list = [a.text.encode('utf-8').strip() for a in soup.find_all('a')]
for text_strip in text_strip_list:
    text_all = text_all.replace(text_strip, '').strip()

アンカータグに飛び込む代わりに、必要なテキストを取得する簡単な方法があるのではないかと思っています..

前もって感謝します...

score 1 · Accepted Answer

htmlが解析された HTML を含む BeautifulSoup オブジェクトであると仮定すると、

from BeautifulSoup import NavigableString

print [node for node in html.find('span').contents if type(node) is NavigableString]

は、最も外側の内のテキストノードを生成しますspan。

beautifulsoup - Python BeautifulSoup再帰的テキストではない

1 に答える 1

Related

Reference