python - プルされた HTML ページからの Python での可変特殊文字の処理

Question

Mechanize を使用して html データを取得すると、これを変数に格納します。これを「HTML_RESPONSE」と呼びましょう。これが完了したら、それを解析して、タイトル、短い説明、長い説明の 3 つを抽出します。

私が直面している問題は、短い説明または長い説明に - &、£、$ などの文字が含まれる可能性がある場所です。

これを XML に入れて保存しようとすると問題が発生します。これは、これらをデコードしようとすると Python が異常終了するためです。

たとえば、ページの短い説明は次のとおりです。

S_DESC = "Senior VP of Treasury and Corporate Finance & ERM, 
RTL Group, has been invited to the above conference to present a Case Study 
on Integrating Strategy and Risk into Enterprise Risk Management"

私がデコードしている方法 -

#!/usr/bin/python
# -*- coding: ISO-8859-1 -*-

print S_DESC.decode('UTF-8').encode('ascii','xmlcharrefreplace')

これはアンパサンドでうまく機能します。その後、英ポンド記号を含む S_DESC を取得すると、スクリプトは次の出力で中断されます。

UnicodeEncodeError: 'ascii' コーデックは文字 u'\xa3' をエンコードできません

このコードは、スクリプトの一部で失敗します (ポンド記号を取得するたびに、最後の行で上記の例外がスローされます)。Pythonにこれらの文字を単独で処理するように指示する普遍的な方法があるかどうかを知りたいです。考えられる互換性のない文字ごとに 100 の関数を作成することはオプションではありません。同様に、コードを「無効にする」可能性のあるすべての特殊文字を特定するために、Web サイト全体 (2k + 記事) をふるいにかける準備ができていません.. ..

XML = """
    <MAIN>
        <ITEM>
            <Author>{0}</Author>
            <Author_UN>{1}</Author_UN>
            <Date_Modified>{2}</Date_Modified>
            <Date_Published>{3}</Date_Published>
            <Default_Group_Rights>
                {4}
            </Default_Group_Rights>
            <attachment>
                <file_name>{5}</file_name>
                <file_extension>{6}</file_extension>
                <file_stored_local>{7}</file_stored_local>
            </attachment>
            <title>{8}</title>
            <sm_desc>{9}</sm_desc>
            <lg_desc>
                <![CDATA[
                {10}
                ]]>
            </lg_desc>
        </ITEM>
    </MAIN>""".format(author_soup,  username,  date_modified,  published_date,  xrights,  attachment_text,  file_extension,  localstore,  item_title.decode('UTF-8').encode('ascii','xmlcharrefreplace'), short_description.decode('UTF-8').encode('ascii','xmlcharrefreplace'),  long_description.decode('UTF-8').encode('ascii','xmlcharrefreplace'))

[編集]

これは私が作成したサンプルコードで、エラーを完全に反映しています。

    #TESTING GROUND
# -*- coding: UTF-8 -*-

author_soup = "John Smith"
username = "jsmith"
date_modified = "25 December 2012, 15:42 PM"
published_date = "25 December 2012, 15:42 PM"
xrights = "r-w-x-x"
attachment_text = "Random Attachment"
file_extension = "txt"
localstore = "../Local"
item_title = "The NEw Financial Reforms of 2012"
short_description = " £16 Billion Spent on new reforms backfire."
long_description = '[<p>fullstory</p>, <p><a class="external-link" href="http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece">http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece</a></p>]'

XML = """
<MAIN>
    <ITEM>
        <Author>{0}</Author>
        <Author_UN>{1}</Author_UN>
        <Date_Modified>{2}</Date_Modified>
        <Date_Published>{3}</Date_Published>
        <Default_Group_Rights>
            {4}
        </Default_Group_Rights>
        <attachment>
            <file_name>{5}</file_name>
            <file_extension>{6}</file_extension>
            <file_stored_local>{7}</file_stored_local>
        </attachment>
        <title>{8}</title>
        <sm_desc>{9}</sm_desc>
        <lg_desc>
            <![CDATA[
            {10}
            ]]>
        </lg_desc>
    </ITEM>
</MAIN>""".format(author_soup,  username,  date_modified,  published_date,  xrights,  attachment_text,  file_extension,  localstore,  item_title.decode('UTF-8'), short_description.decode('UTF-8'),  long_description.decode('UTF-8'))

python - プルされた HTML ページからの Python での可変特殊文字の処理

[編集]

0 に答える 0

Related

Reference