0

私は次のHTMLを持っています:

<div class="dialog">
<div class="title title-with-sort-row">
    <h2>Description</h2>
    <div class="dialog-search-sort-bar">
    </div>
</div>
<div class="content"><div style="margin-right: 20px; margin-left: 30px;">
    <span class="description2">
        With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
        She is made available under a Creative Commons License that gives endless opportunities for further development. 
        This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
        The result is a figure that has very good bending and morphing behavior.
        <br />
    </span>
</div>
</div>

のいくつかの div からこの div を見つけclass="dialog"て、 のテキストを引き出す必要がありますspan class="description2"

コードを使用する場合:

description = soup.find(text = re.compile('Description'))
if description != None:
    someEl = description.parent
    parent1 = someEl.parent
    parent2 = parent1.parent
    description = parent2.find('span', {'class' : 'description2'})
    print 'Description: ' + str(description)

私は得る:

<span class="description2">
    With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior.
    <br/>
</span>

HTMLと非ASCII文字を使用せずにテキストだけを取得しようとすると、

description = description.get_text()

私は(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

この HTML ブロックをストレート ascii に変換するにはどうすればよいですか?

4

1 に答える 1

2
#!/usr/bin/env python
# -*- coding: utf-8 -*-

foo = u'With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'

print foo.encode('ascii', 'ignore')

注意すべき3つのこと。

1 つ目は'ignore'、encode メソッドへのパラメーターです。選択したエンコーディングの範囲外の文字をドロップするようにメソッドに指示します (この場合、ascii が安全です)。

2 つ目は、文字列の先頭にu.

3 つ目は、明示的なファイル エンコーディング ディレクティブです: # -*- coding: utf8 -*-.

また、この回答に添付されたコメントでDaenythの非常に良い点を読まなければ、あなたはばかげた塊です. 出力を HTML/XML で使用する場合は、上記xmlcharrefreplaceの代わりに使用できます。ignore

于 2012-05-07T12:31:04.390 に答える