python - 本文にUnicode文字が含まれている場合のPythonでのGmailメール解析

Question

メールを解析するためのスクリプトを作成しました。Mac OS Xメールクライアントから文字を受信する場合は正常に機能しますが（これまでにテストしたものだけです）、文字の本体部分にUnicode文字が含まれているとパーサーが失敗します。

たとえば、コンテンツを含むメッセージを送信しましたąčę。

そして、これが本文と添付ファイルを同時に解析するスクリプトの私の部分です。

p = FeedParser()
p.feed(msg)
msg = p.close()
attachments = []
body = None
for part in msg.walk():
  if part.get_content_type().startswith('multipart/'):
    continue
  try:
    filename = part.get_filename()
  except:
    # unicode letters in filename, set default name then
    filename = 'Mail attachment'

  if part.get_content_type() == "text/plain" and not body:
    body = part.get_payload(decode=True)
  elif filename is not None:
    content_type = part.get_content_type()
    attachments.append(ContentFile(part.get_payload(decode=True), filename))

if body is None:
    body = ''

さて、OS X Mailからの手紙では機能すると述べましたが、Gmailの手紙では機能しません。

トレースバック：

トレースバック（最後の最後の呼び出し）：ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/core/handlers/base.py"、行116、get_response response = callback （request、* callback_args、** callback_kwargs）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/csrf.py"、77行目、wrapped_view return view_func（* args、** kwargs）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/views/decorators/http.py"、41行目、内部リターン関数（request、* args、** kwargs）ファイル "/Users/aemdy/PycharmProjects/rezervavau/bms/messages/views.py"、66行目、accept Message.accept（request.POST.get（'msg'））ファイル"/Users/aemdy/PycharmProjects/rezervavau/bms/messages/models.py"、行261accept thread = thread File "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/manager.py"、149行目、create return self.get_query_set（） .create（** kwargs）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/query.py"、391行目、create obj.save（ force_insert = True、using = self.db）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py"、532行目、save force_update = force_update、update_fields = update_fields）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/models/base.py"、行627、save_base result=manager。 _insert（[self]、fields = fields、return_id = update_pk、using = using、raw = raw）ファイル"/Users/aemdy/virtualenvs/django1.5/lib/python2。7 / site-packages / django / db / models / manager.py "、215行目、_insert return insert_query（self.model、objs、fields、** kwargs）ファイル" /Users/aemdy/virtualenvs/django1.5/ lib / python2.7 / site-packages / django / db / models / query.py "、1633行目、insert_query return query.get_compiler（using = using）.execute_sql（return_id）ファイル" / Users / aemdy / virtualenvs / django1 .5 / lib / python2.7 / site-packages / django / db / models / sql / compiler.py "、行920、execute_sql cursor.execute（sql、params）ファイル" / Users / aemdy / virtualenvs/django1。 5 / lib / python2.7 / site-packages / django / db / backends / util.py "、47行目、execute sql = self.db.ops.last_executed_query（self.cursor、sql、params）ファイル" / Users /aemdy/virtualenvs/django1.5/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/operations.py "、行201、last_executed_queryのreturncursor.query.decode（'utf-8'）ファイル "/Users/aemdy/virtualenvs/django1.5/lib/python2.7/encodings/utf_8.py"、16行目、decode return codecs.utf_8_decode （入力、エラー、True）UnicodeDecodeError：'utf8'コーデックは115桁目のバイト0xe0をデコードできません：無効な継続バイト

私のスクリプトは私に次の本文を与えます��。どうすればそれをデコードして元にąčę戻すことができますか？

score 6 · Accepted Answer

さて、私は自分で解決策を見つけました。私は今いくつかのテストを行い、何かが失敗した場合は今すぐあなたに許可します。

ボディを再度デコードする必要がありました：

body = part.get_payload(decode=True).decode(part.get_content_charset())

score 1 · Accepted Answer

あなたはこれを使ってみたくなるかもしれません：

from email.Iterators import typed_subpart_iterator


def get_charset(message, default="ascii"):
    """Get the message charset"""

    if message.get_content_charset():
        return message.get_content_charset()

    if message.get_charset():
        return message.get_charset()

    return default

def get_body(message):
    """Get the body of the email message"""

    if message.is_multipart():
        #get the plain text version only
        text_parts = [part
                      for part in typed_subpart_iterator(message,
                                                         'text',
                                                         'plain')]
        body = []
        for part in text_parts:
            charset = get_charset(part, get_charset(message))
            body.append(unicode(part.get_payload(decode=True),
                                charset,
                                "replace"))

        return u"\n".join(body).strip()

    else: # if it is not multipart, the payload will be a string
          # representing the message body
        body = unicode(message.get_payload(decode=True),
                       get_charset(message),
                       "replace")
        return body.strip()

score 0 · Accepted Answer

確認することをお勧めしますemail.iterators（ただし、エンコーディングの問題が解決するかどうかはわかりません）。

python - 本文にUnicode文字が含まれている場合のPythonでのGmailメール解析

3 に答える 3

Related

Reference