python - BeautifulSoup を使用してタグを削除しますが、その内容は保持します

Question

現在、私は次のようなコードを持っています:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

無効なタグ内のコンテンツを破棄したくない場合を除きます。タグを取り除き、soup.renderContents() を呼び出すときにコンテンツを保持するにはどうすればよいですか?

score 63 · Accepted Answer

私が使用した戦略は、タグがタイプの場合はタグをそのコンテンツに置き換え、そうでNavigableStringない場合はタグに再帰してコンテンツをNavigableStringなどに置き換えることです。これを試してください:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

結果は次のとおりです。

<p>Good, bad, and ugly</p>

別の質問でこれと同じ答えを出しました。たくさん出てきそうです。

score 19 · Accepted Answer

これはすでにコメントで他の人から言及されていますが、MozillaのBleachでそれを行う方法を示す完全な回答を投稿したいと思いました。個人的には、BeautifulSoupを使用するよりもずっといいと思います。

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

score 7 · Accepted Answer

タグを削除する前に、おそらくタグの子をタグの親の子に移動する必要があります-それはどういう意味ですか?

その場合、コンテンツを適切な場所に挿入するのは難しいですが、次のようなものが機能するはずです。

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

例の値を使用すると、これは必要に応じて印刷<div><p>Hello there my friend!</p></div>されます。

score 7 · Accepted Answer

あなたはsoup.textを使うことができます

.text はすべてのタグを削除し、すべてのテキストを連結します。

score 3 · Accepted Answer

アンラップを使用します。

Unwrap は、複数出現するタグの 1 つを削除し、コンテンツを保持します。

例：

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

python - BeautifulSoup を使用してタグを削除しますが、その内容は保持します

11 に答える 11

Related

Reference