python - Pythonimaplibを使用してGmailをエクスポートする-改行の問題でテキストが壊れている

Question

次のコードを使用して、特定のGmailフォルダー内のすべてのメールをエクスポートしています。

私が期待するすべての電子メールを引き出すという点でうまく機能していますが、それ（または私）はCR/改行のエンコーディングを混乱させているようです。

コード：

import imaplib
import email
import codecs
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myUser@gmail.com', 'myPassword')  #user / password
mail.list()
mail.select("myFolder") # connect to folder with matching label

result, data = mail.uid('search', None, "ALL") # search and return uids instead
i = len(data[0].split())

for x in range(i):
    latest_email_uid = data[0].split()[x]
    result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
    raw_email = email_data[0][1]
    email_message = email.message_from_string(raw_email)
    save_string = str("C:\\\googlemail\\boxdump\\email_" + str(x) + ".eml") #set to   save location
    myfile = open(save_string, 'a')
    myfile.write(email_message)
    myfile.close()

私の問題は、オブジェクトに到達するまでに「= 0A」が散らばっていることです。これは、改行またはキャリッジリターンフラグが誤って解釈されていると想定しています。

[d3 03 03 0a]の16進数で見つけることができますが、これは「文字」ではないため、str.replace（）でパーツを取り出す方法が見つかりません。改行フラグは実際には必要ありません。

文字列全体を16進数に変換し、並べ替え/正規表現の置換を行うことはできますが、問題がソースデータのエンコード/読み取りにある場合は、それはやり過ぎのようです。

私が見るもの：

====
CAUTION:  This email message and any attachments con= tain information that may be confidential and may be LEGALLY PRIVILEGED. If yo= u are not the intended recipient, any use, disclosure or copying of this messag= e or attachments is strictly prohibited. If you have received this email messa= ge in error please notify us immediately and erase all copies of the message an= d attachments. Thank you.
====

私が欲しいもの：

====
CAUTION:  This email message and any attachments contain information that may be confidential and may be LEGALLY PRIVILEGED. If you are not the intended recipient, any use, disclosure or copying of this message or attachments is strictly prohibited. If you have received this email message in error please notify us immediately and erase all copies of the message and attachments. Thank you.
====

score 2 · Accepted Answer

あなたが見ているのはQuoted Printable encoding です。

変更してみてください:

email_message = email.message_from_string(raw_email)

に：

email_message = str(email.message_from_string(raw_email)).decode("quoted-printable")

詳細については、Python コーデックモジュールの標準エンコーディングを参照してください。

score 0 · Accepted Answer

たった2つの追加アイテムが1日この痛みを考えていました。1 ペイロードレベルで実行して、email_message を処理してメールからメールアドレスなどを取得できるようにします。

2 文字セットセットもデコードする必要があります。Web ページから HTML をコピーして貼り付けたり、Word ドキュメントなどのコンテンツをメールにコピーして貼り付けたりするのに苦労しました。

if maintype == 'multipart':
                    for part in email_message.get_payload():
                            if part.get_content_type() == 'text/plain':
                                text += part.get_payload().decode("quoted-printable").decode(part.get_content_charset())

これが誰かを助けることを願っています!

デイブ

python - Pythonimaplibを使用してGmailをエクスポートする-改行の問題でテキストが壊れている

2 に答える 2

Related

Reference