python - Python - 文字列のエンコーディングとアクセント付きの引用符/アポストロフィに苦労しています

Question

各記事のコンテンツを取得し、データベースに保存する前にいくつかの簡単な処理を実行する単純な RSS フィードスクリプトがあります。

問題は、テキストを次のように実行した後、アクセント付きのアポストロフィと引用符がすべてテキストから削除されることです。

# this is just an example string, I use feed_parser to download the feeds
string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string)
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8', 'ignore')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print "".join([x for x in string if ord(x)<128])

結果は次のとおりです。

> <p>  </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p>

すべての html エンティティの引用符/アポストロフィが取り除かれます。これを修正するにはどうすればよいですか?

score 1 · Accepted Answer

次のコードは私にとってはうまくいきます。おそらくコンストラクターのconvertEntities引数を逃したでしょう：BeautifulSoup

string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string, convertEntities=BeautifulSoup.HTML_ENTITIES) # See the converEntities argument
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
# I don't know why your are doing this
#string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print string

python - Python - 文字列のエンコーディングとアクセント付きの引用符/アポストロフィに苦労しています

1 に答える 1

Related

Reference