python - BeautifulSoupでさまざまな文字列を検索し、含まれているタグを返します

Question

次のHTMLがあるとします。

<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>

探しているすべてのキーワードを含むすべてのタグを見つけられるようにしたいと思います。例（例2と3は機能しません）：

>>> len(soup.find_all(text="world"))
2

>>> len(soup.find_all(text="world puzzle"))
1

>>> len(soup.find_all(text="world puzzle book"))
0

すべてのキーワードを検索できる正規表現を考え出そうとしていますが、ANDingは不可能のようです（ORingのみ）。

前もって感謝します！

score 5 · Accepted Answer

このような複雑な一致を行う最も簡単な方法は、一致を実行する関数を記述し、textその関数を引数の値として渡すことです。

def must_contain_all(*strings):                                                 
    def must_contain(markup):                                                   
        return markup is not None and all(s in markup for s in strings)         
    return must_contain

これで、一致する文字列を取得できます。

print soup.find_all(text=must_contain_all("world", "puzzle"))
# [u"\nWho in the world am I? Ah, that's the great puzzle.\n"]

文字列を含むタグを取得するには、.parent演算子を使用します。

print [text.parent for text in soup.find_all(text=must_contain_all("world", "puzzle"))]
# [<p>Who in the world am I? Ah, that's the great puzzle.</p>]

score 1 · Accepted Answer

BeautifulSoupの代わりにlxmlの使用を検討することをお勧めします。lxmlを使用すると、XPathで要素を検索できます。

このボイラープレート設定では、次のようになります。

import lxml.html as LH
import re

html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""

doc = LH.fromstring(html)

<p>これにより、文字列を含むすべてのタグのテキストが検索されworldます。

print(doc.xpath('//p[contains(text(),"world")]/text()'))
['\nIf everybody minded their own business, the world would go around a great deal faster than it does.\n', "\nWho in the world am I? Ah, that's the great puzzle.\n"]

<p>そして、これはとを含むすべてのタグ内のすべてのテキストを検索しworldますpuzzle：

print(doc.xpath('//p[contains(text(),"world") and contains(text(),"puzzle")]/text()'))
["\nWho in the world am I? Ah, that's the great puzzle.\n"]

score 0 · Accepted Answer

これはおそらく最も効率的な方法ではありませんが、交差点を設定してみることができます。

len(set(soup.find_all(text="world")
    & set(soup.find_all(text="book")
    & set(soup.find_all(text="puzzle")))

score 0 · Accepted Answer

スケルトンのビット（そして私はBeautifulSoupではなくlxmlを使用していますが、soup.findAllを使用してそれに適応できます）：

html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""

import lxml.html
import re

fragment = lxml.html.fromstring(html)
d = dict(
    (node, set(re.findall(r'\S+', node.text_content())))
    for node in fragment.xpath('//p'))

for node, it in d.iteritems():
    # then use set logic to go from here...

python - BeautifulSoupでさまざまな文字列を検索し、含まれているタグを返します

4 に答える 4

Related

Reference