python - BeautifulSoup を使用して、特定のテキストを含む HTML タグを見つける

Question

次のテキストパターンを含む HTML ドキュメントの要素を取得しようとしています: #\S{11}

<h2> this is cool #12345678901 </h2>

したがって、前のものは次を使用して一致します。

soup('h2',text=re.compile(r' #\S{11}'))

結果は次のようになります。

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

一致するすべてのテキストを取得できます (上記の行を参照)。しかし、テキストの親要素を一致させたいので、ドキュメントツリーを走査するための開始点としてそれを使用できます。この場合、テキストが一致するのではなく、すべての h2 要素が返されるようにします。

アイデア？

score 21 · Accepted Answer

BeautifulSoup の検索操作は、他の場合とは対照的に、が基準として使用されるBeautifulSoup.NavigableString場合にオブジェクト[のリスト] を提供します。オブジェクトをチェックして、使用できる属性を確認してください。これらの属性のうち、はBS4の変更により優先されます。text=BeautifulSoup.Tag__dict__parentprevious

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

score 4 · Accepted Answer

bs4 (Beautiful Soup 4) では、OP の試行は期待どおりに機能します。

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

戻ります[<h2> this is cool #12345678901 </h2>]。

python - BeautifulSoup を使用して、特定のテキストを含む HTML タグを見つける

3 に答える 3

Related

Reference