python - Python のグループ化された後方参照

Question

htmlおそらく WYSIWYG で発生した出力をクリーニングしています。正気を保つために削除したい空の書式設定タグがたくさんあります。

例えば

<em></em> Here's some text <strong>   </strong> and here's more <em> <span></span></em>

Regular-Expressions.infoのおかげで、一度に 1 つのレイヤーをアンラップするための後方参照を備えたきちんとした正規表現があります。

# Returns a string minus one level of empty formatting tags
def remove_empty_html_tags(input_string):
    return re.sub(r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>', r'\1', input_string)

ただし、のすべてのレイヤーを一度にアンラップできるようにしたいのですが<em> <span></span></em>、ネストされた空のタグのレイヤーが 5 つ以上ある可能性があります。

backref a la (?:<?P<tagBackRef>strong|span|em)\b[^>]>(\s)*)+(または何か) をグループ化し、後でそれを使用して、(</(?P=tagBackRef>)+ネストされているが一致する複数の空のhtmlタグを削除する方法はありますか?

後世のために：

これはおそらくXY 質問であり、私が望んでいた結果を得るために使用したいと思っていたツールは、他の誰もが選択したものではありませんでした。ヘンリーの答えは質問に答えましたが、彼と他のすべての人は、htmlを解析するための正規表現よりもhtmlパーサーを指摘します。=)

score 4 · Accepted Answer

これは、 BeautifulSoupなどの HTML パーサーを使用すると、はるかに簡単に実行できます。次に例を示します。

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<body>
    <em></em> Here's some <span><strong>text</strong></span> <strong>   </strong> and here's more <em> <span></span></em>
</body>
""")

for element in soup.findAll(name=['strong', 'span', 'em']):
    if element.find(True) is None and (not element.string or not element.string.strip()):
        element.extract()

print soup

プリント:

<html><body>
 Here's some <span><strong>text</strong></span>  and here's more <em> </em>
</body></html>

ご覧のとおり、内容が空の (または空白のみで構成された) およびタグがすべてspan削除されましたstrong。em

以下も参照してください。

空のタグを削除/削除/抽出

score 1 · Accepted Answer

本当にHTML パーサーを使用したくない場合、および速度に過度に関心がない場合 (そうではないか、正規表現を使用して HTML をクリーンアップすることはないと思います)、コードを変更するだけです。あなたはすでに書いています。置換をループ (または再帰; 好み) に入れて、何も変更しない場合に戻ります。

# Returns a string minus all levels of empty formatting tags
def remove_empty_html_tags(input_string):
    matcher = r'<(?P<tag>strong|span|em)\b[^>]*>(\s*)</(?P=tag)>'
    old_string = input_string
    new_string = re.sub(matcher, r'\1', old_string)
    while new_string != old_string:
        old_string = new_string
        new_string = re.sub(matcher, r'\1', new_string)
    return new_string

python - Python のグループ化された後方参照

後世のために：

2 に答える 2

Related

Reference