python - コンテンツを失うことなく、画像から「a」タグをアンラップします

Question

見つかったすべての画像から「a」タグ (リンク) を削除したかったのです。したがって、パフォーマンスのために、html 内のすべての画像のリストを作成し、タグのラップを探して、リンクを削除するだけです。

私は BeautifulSoup を使用していますが、タグを削除する代わりに、内部のコンテンツを削除するのではなく、何が間違っているのかわかりません。

これは私がしたことです

from bs4 import BeautifulSoup

html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>"  '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
    print 'THIS IS THE BEGINING /////////////// '
    #print img.find_parent('a').unwrap()
    print img.parent.unwrap()

これにより、次の出力が得られます

> >> print img.parent() 
<a href="http://somelink"><img src="http://imgsrc.jpg" /></a> 
<a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>

> >> print img.parent.unwrap() 
<a href="http://somelink"></a> 
<a href="http://somelink2"></a>

試してみましたが、またはを使用すると機能しませreplaceWithんreplaceWithChildrenobject.parentfindParent

何が間違っているのかわかりません。Pythonを始めてからわずか数週間です。

score 2 · Accepted Answer

このunwrap()関数は、削除されたタグを返します。ツリー自体は適切に変更されています。unwrap()ドキュメントからの引用：

のようreplace_with()に、unwrap()置き換えられたタグを返します。

つまり、正しく動作します。の戻り値の代わりにの新しい親を出力して、タグが実際に削除されたことを確認します。imgunwrap()<a>

>>> from bs4 import BeautifulSoup
>>> html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg /></a>"  '''
>>> soup = BeautifulSoup(html)
>>> for img in soup.find_all('img'):
...     img.parent.unwrap()
...     print img.parent
... 
<a href="http://somelink"></a>
<div> <img src="http://imgsrc.jpg"/> <a href="http://somelink2"><img src="http://imgsrc2.jpg /&gt;&lt;/a&gt;"/></a></div>
<a href="http://somelink2"></a>
<div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg /&gt;&lt;/a&gt;"/></div>

ここで python はimg.parent.unwrap()戻り値をエコーし、その後にタグの親がタグprintであることを示すステートメントの出力が続きます。最初の出力は、他のタグがまだラップされていることを示し、2 番目の出力は、タグの直接の子として両方を示しています。<img><div> <img><div>

score 1 · Accepted Answer

あなたが探している出力がわかりません。これでしょうか？

from bs4 import BeautifulSoup

html = '''<div> <a href="http://somelink"><img src="http://imgsrc.jpg" /></a> <a href="http://somelink2"><img src="http://imgsrc2.jpg" /></a>  '''
soup = BeautifulSoup(html)
for img in soup.find_all('img'):
    img.parent.unwrap()
print(soup)

収量

<html><body><div> <img src="http://imgsrc.jpg"/> <img src="http://imgsrc2.jpg"/></div></body></html>

score 0 · Accepted Answer

私は Python をあまり使っていませんが、unwrapは、探している img タグではなく、削除された HTML を返すようです。電話soup.prettify()してみて、結局リンクが削除されたかどうかを確認してください。

python - コンテンツを失うことなく、画像から「a」タグをアンラップします

3 に答える 3

Related

Reference