html - Beautifulsoup が img タグから src 属性を抽出できない

Question

これが私のコードです：

html = '''<img onload='javascript:if(this.width>950) this.width=950'
src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
soup = BeautifulSoup(html)
imgs = soup.findAll('img')

print imgs[0].attrs

印刷します[(u'onload', u'javascript:if(this.width>950) this.width=950')]

srcでは、属性はどこにあるのでしょうか?

htmlを次のようなものに置き換えるとhtml = '''<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'''

次のように正しい結果が得られます[(u'src', u'/image/fluffybunny.jpg'), (u'title', u'Harvey the bunny'), (u'alt', u'a cute little fluffy bunny')]

私はHTMLとbeautifulsoupにまったく慣れていません。私はいくつかの知識を欠いていますか？アイデアをありがとう。

score 8 · Accepted Answer

これを BeautifulSoup のバージョン 3 と 4 の両方でテストしたところ、bs4(バージョン 4) の方がバージョン 3 よりも HTML を適切に修正しているように見えることがわかりました。

BeautifulSoup 3 の場合:

>>> html = """<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">"""
>>> soup = BeautifulSoup(html) # Version 3 of BeautifulSoup
>>> print soup
<img onload="javascript:if(this.width&gt;950) this.width=950" />950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"&gt;

>現在の様子と>、いくつかのビットがずれていることに注意してください。

また、 BeautifulSoup() を呼び出すと、分割されます。soup.img を印刷すると、次のようになります。

<img onload="javascript:if(this.width&gt;950) this.width=950" />

そのため、詳細を見逃すことになります。

( bs4BeautifulSoup 4、現在のバージョン):

>>> html = '''<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">'''
>>> soup = BeautifulSoup(html) 
>>> print soup
<html><body><img onload="javascript:if(this.width&gt;950) this.width=950" src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"/></body></html>

今では.attrs: BeautifulSoup 3 では、あなたが発見したように、タプルのリストを返します。BeautifulSoup 4 では、辞書を返します。

>>> print soup.findAll('img')[0].attrs # Version 3
[(u'onload', u'javascript:if(this.width>950) this.width=950')]

>>> print soup.findAll('img')[0].attrs # Version 4
{'onload': 'javascript:if(this.width>950) this.width=950', 'src': 'http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg'}

じゃあ何をすればいいの？BeautifulSoup 4 を入手してください。HTML をより適切に解析します。

ちなみに、必要なのがだけの場合src、呼び出し.attrsは必要ありません。

>>> print soup.findAll('img')[0].get('src')
http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg

html - Beautifulsoup が img タグから src 属性を抽出できない

2 に答える 2

Related

Reference