python - 複雑な美しいスープクエリ

Question

以下は、私が Beautiful Soup で調べている HTML ファイルのスニペットです。

<td width="50%">
    <strong class="sans"><a href="http:/website">Site</a></strong> <br />

内にある and を<a href>持つ任意の行のを取得したいと思います。<strong class="sans"><td width="50%">

Beautiful Soup を使用して、これらの複数の条件について HTML ファイルを照会することは可能ですか?

score 12 · Accepted Answer

BeautifulSoupの検索メカニズムはcallableを受け入れます。これは、ドキュメントがあなたのケースに推奨しているようです。「タグの属性に複雑な制限や連動する制限を課す必要がある場合は、名前の呼び出し可能なオブジェクトを渡してください...」。（わかりました...彼らは特に属性について話していますが、アドバイスはBeautifulSoup APIの根底にある精神を反映しています）。

ワンライナーが必要な場合：

soup.findAll(lambda tag: tag.name == 'a' and \
tag.findParent('strong', 'sans') and \
tag.findParent('strong', 'sans').findParent('td', attrs={'width':'50%'}))

この例ではラムダを使用しましたが、タグに親がないfindParent('strong', 'sans')場合に例外が発生しないように、このラムダは2回の呼び出しを行う必要があるため、複数のチェーン要件がある場合は、実際には呼び出し可能な関数を定義することをお勧めします。適切な関数を使用すると、テストをより効率的にすることができます。<a>strong

score 0 · Accepted Answer

>>> BeautifulSoup.BeautifulSoup("""<html><td width="50%">
...     <strong class="sans"><a href="http:/website">Site</a></strong> <br />
... </html>""" )
<html><td width="50%">
<strong class="sans"><a href="http:/website">Site</a></strong> <br />
</td></html>
>>> [ a for a in strong.findAll("a") 
            for strong in tr.findAll("strong", attrs = {"class": "sans"}) 
                for tr in soup.findAll("td", width = "50%")]
[<a href="http:/website">Site</a>]

score 0 · Accepted Answer

from bs4 import BeautifulSoup
html_doc = """<td width="50%">
<strong class="sans"><a href="http:/website">Site</a></strong> <br /> 
"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.select('td[width="50%"] .sans [href]')
# Out[24]: [<a href="http:/website">Site</a>]

ドキュメンテーション

python - 複雑な美しいスープ クエリ

3 に答える 3

Related

Reference

python - 複雑な美しいスープクエリ