python - BeautifulSoup または lxml を使用して html を解析および変更します。タグのすぐ下にあるhtmlタグでテキストを囲みます

Question

私は初心者として Python2.7 で作業しています。HTMLファイルを解析して変更したい。このために、私は Beautiful Soup を使用しており、lxml も 1 つのオプションです。問題は、html を変更してテキストを html タグで囲むことができるかどうかです。テキストは「body」タグのすぐ下にあります。したがって、body タグのすぐ下にあるテキストは何でも、目的のタグの下にテキストを取得できるように html を変更したいと考えています。したがって、それを解析して、このテキストの場所を簡単に見つけることができます。

<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br />
<b>Price</b>
$117.80<br />
<b>You Save:</b>
$32.20(21%)<br />
<font size="-1" color="#009900">In Stock</font>
<br />
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br />
Gift-wrap available.<br /></body></html>

この例では、テキスト「$117.80」と「$32.20」をユーザー HTML タグで囲みます。Beautifulsoup または lxml を使用してこれを達成するにはどうすればよいですか。

score 0 · Accepted Answer

tailテキストを囲む必要があると思います。それらを処理するには、 beautifulsoupよりも lxml の方が適しています。次のスクリプトは、テキストを含むものを検索し、新しいタグを作成して (自分のものを選択)、そこに挿入します。正規表現を使用して、テキストが価格のように見えることを確認し、この方法でorの末尾のテキストをスキップします。elementtail<div>Ships from and sold by Amazon.comGift-wrap available.

from lxml import etree
import re

tree = etree.parse('htmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail):
        e = etree.Element('div')
        e.text = elem.tail
        elem.tail = ''
        elem.addnext(e)

print(etree.tostring(root))

次の結果が得られます。

<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br/>
<b>Price</b><div>
$117.80</div><br/>
<b>You Save:</b><div>
$32.20(21%)</div><br/>
<font size="-1" color="#009900">In Stock</font>
<br/>
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br/>
Gift-wrap available.<br/></body></html>

python - BeautifulSoup または lxml を使用して html を解析および変更します。タグのすぐ下にあるhtmlタグでテキストを囲みます

1 に答える 1

Related

Reference