python - Beautiful Soupのextract（）でエラーが発生しました

Question

私はいくつかの画面スクレイピングソフトウェアに取り組んでいて、BeautifulSoupで問題が発生しました。私はpython2.4.3とBeautifulSoup3.0.7aを使用しています。

タグを削除する必要があり<hr>ますが、さまざまな属性を持つ可能性があるため、replace（）を呼び出すだけではタグは削除されません。

次のhtmlが与えられます：

<h1>foo</h1>
<h2><hr/>bar</h2>

そして次のコード：

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i
    print i.string

出力は次のとおりです。

<h1>foo</h1>
foo
<h2>bar</h2>
None

抽出機能を誤解していますか、それともBeautiful Soupのバグですか？

score 2 · Accepted Answer

バグの可能性があります。しかし、幸いなことに、文字列を取得する別の方法があります。

from BeautifulSoup import BeautifulSoup

string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i, i.next

# <h1>foo</h1> foo
# <h2>bar</h2> bar

score 0 · Accepted Answer

私も同じ問題を抱えています。理由はわかりませんが、BSによって作成された空の要素に関係していると思います。

たとえば、次のコードがある場合：

from bs4 import BeautifulSoup

html ='            \
<a>                \
    <b test="help">            \
        hello there!  \
        <d>        \
        now what?  \
        </d>    \
        <e>        \
            <f>        \
            </f>    \
        </e>    \
    </b>        \
    <c>            \
    </c>        \
</a>            \
'

soup = BeautifulSoup(html,'lxml')
#print(soup.find('b').attrs)

print(soup.find('b').contents)

t = soup.find('b').findAll()
#t.reverse()
for c in t:
    gb = c.extract()

print(soup.find('b').contents)

soup.find('b').text.strip()

次のエラーが発生しました：

'NoneType'オブジェクトには属性'next_element'がありません

私が得た最初の印刷で：

>>> print(soup.find('b').contents)
[u' ', <d> </d>, u' ', <e> <f> </f> </e>, u' ']

そして2番目に私は得ました：

>>> print(soup.find('b').contents)
[u' ', u' ', u' ']

問題を引き起こしているのは真ん中の空の要素だと確信しています。

私が見つけた回避策は、スープを再作成することです。

soup = BeautifulSoup(str(soup))
soup.find('b').text.strip()

今それは印刷します：

>>> soup.find('b').text.strip()
u'hello there!'

それがお役に立てば幸いです。

python - Beautiful Soupのextract（）でエラーが発生しました

2 に答える 2

Related

Reference