python - BeautifulSoup：指定された属性を削除しますが、タグとその内容は保持します

Question

私はMSFrontPageで生成されたWebサイトのhtmlを「defrontpagify」しようとしています。それを行うためにBeautifulSoupスクリプトを作成しています。

ただし、特定の属性（またはリスト属性）を含むドキュメント内のすべてのタグから特定の属性（またはリスト属性）を削除しようとする部分に行き詰まりました。コードスニペット：

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

エラーなしで実行されますが、実際にはどの属性も削除されません。外側のループなしで実行すると、単一の属性（soup.findAll（'style' = True）をハードコーディングするだけで機能します。

ここで問題を知っている人はいますか？

PS-ネストされたループもあまり好きではありません。もっと機能的なマップ/フィルターっぽいスタイルを知っている人がいたら、ぜひ見てみたいです。

score 9 · Accepted Answer

この線

for tag in soup.findAll(attribute=True):

が見つかりませんtag。使用する方法があるかもしれませんfindAll; わからない。ただし、これは機能します。

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

このコードはPython3でのみ機能することに注意してください。Python2で機能する必要がある場合は、以下のNóraの回答を参照してください。

score 6 · Accepted Answer

unutbuの答えのPython2バージョンは次のとおりです。

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}

score 6 · Accepted Answer

ちょうどftr：ここでの問題は、キーワード引数としてHTML属性を渡す場合、キーワードは属性の名前であるということです。attributeしたがって、変数が展開されないため、コードはnameの属性を持つタグを検索しています。

これが理由です

属性名のハードコーディングが機能しました[0]
コードは失敗しません。検索がどのタグとも一致しません

この問題を解決するには、探している属性をdict：として渡します。

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

将来誰か、dtk

find_all(style=True)[0]：引用符なしで例に含める必要がありますが、SyntaxError: keyword can't be an expression

score 2 · Accepted Answer

私はこのメソッドを使用して属性のリストを削除します。非常にコンパクトです。

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]

score 1 · Accepted Answer

私はこれを使用します：

if "align" in div.attrs:
    del div.attrs["align"]

また

if "align" in div.attrs:
    div.attrs.pop("align")

https://stackoverflow.com/a/22497855/1907997に感謝します

python - BeautifulSoup：指定された属性を削除しますが、タグとその内容は保持します

5 に答える 5

Related

Reference