python - Beautifulsoupを使用してxmlファイルの説明タグ内のimgを抽出します

Question

パースをしています。説明タグ内の画像を取得したい。私は urllib と BeautifulSoup を使用しています。別のタグ内にある画像を取得できますが、説明タグ内の画像をエンコードされた形式で取得できません。

XML コード

<item>
         <title>Kidnapped NDC member and political activist tells his story</title>
         <link>http://www.yementimes.com/en/1724/news/3065</link>
         <description>&lt;img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" /&gt;
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’
&lt;br clear="all"&gt;</description>

ビュー.py

for q in b.findAll('item'):
            d={}
            d['desc']=strip_tags(q.description.string).strip('&nbsp')
            if q.guid:
                d['link']=q.guid.string
            else:   
                d['link']=strip_tags(q.comments)
            d['title']=q.title.string
            for r in q.findAll('enclosure'):
                d['image']=r['url']
            arr.append(d)

誰でもそれを行うアイデアを教えてください..
これは、別のタグ内の画像を解析するために行ったことです...説明内にあるかどうかを取得しようとしましたが、できません。

score 0 · Accepted Answer

からすべてのコンテンツを抽出し、それを<description>使用して新しいオブジェクトを作成し、最初の要素の属性をBeautifulSoup検索できます。src<img>

from bs4 import BeautifulSoup
import sys 
import html.parser

h = html.parser.HTMLParser()

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for i in soup.find_all('item'):
    d = BeautifulSoup(h.unescape(i.description.string))
    print(d.img['src'])

次のように実行します。

python3 script.py xmlfile

これにより、次の結果が得られます。

http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg

python - Beautifulsoupを使用してxmlファイルの説明タグ内のimgを抽出します

1 に答える 1

Related

Reference