python - UnicodeEncodeError を受け取る Python スクリプト: 'ascii' コーデックは文字をエンコードできません

Question

reddit から投稿を取得して Twitter に投稿する単純な Python スクリプトがあります。残念ながら、今夜、reddit の誰かのタイトルに書式設定の問題があるためだと思われる問題が発生し始めました。私が受け取っているエラーは次のとおりです。

  File "redditbot.py", line 82, in <module>
  main()
 File "redditbot.py", line 64, in main
 tweeter(post_dict, post_ids)
 File "redditbot.py", line 74, in tweeter
 print post+" "+post_dict[post]+" #python"
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in  position 34: ordinal not in range(128)

そして、ここに私のスクリプトがあります：

# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')

access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'


def strip_title(title):
    if len(title) < 75:
    return title
else:
    return title[:74] + "..."

def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
    post_dict[strip_title(submission.title)] = submission.url
    post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
    post_title = post
    post_link = post_dict[post]

    mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
            'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit



def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
    for line in file:
        if id in line:
            found = 1
return found

def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
    file.write(str(id) + "\n")

def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
    found = duplicate_check(post_id)
    if found == 0:
        print "[bot] Posting this link on twitter"
        print post+" "+post_dict[post]+" #python"
        api.update_status(post+" "+post_dict[post]+" #python")
        add_id_to_file(post_id)
        time.sleep(3000)
    else:
        print "[bot] Already posted"

if __name__ == '__main__':
main()

どんな助けでも大歓迎です-事前に感謝します!

score 4 · Accepted Answer

次の簡単なプログラムを考えてみましょう:

print(u'\u201c' + "python")

(適切な文字エンコーディングを使用して) 端末に出力しようとすると、次のようになります。

“python

ただし、出力をファイルにリダイレクトしようとすると、UnicodeEncodeError.

script.py > /tmp/out
Traceback (most recent call last):
  File "/home/unutbu/pybin/script.py", line 4, in <module>
    print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

端末に出力する場合、Python は端末の文字エンコーディングを使用して Unicode をエンコードします。(端末はバイトしか出力できないため、出力するには Unicode をエンコードする必要があります。)

出力をファイルにリダイレクトする場合、ファイルにはエンコーディングが宣言されていないため、Python は文字エンコーディングを決定できません。したがって、デフォルトでは、Python2 はファイルに書き込む前に、エンコーディングを使用してすべての Unicode を暗黙的にエンコードしasciiます。u'\u201c'ASCII エンコードできないため、UnicodeEncodeError. (ascii でエンコードできるのは、最初の 127 の Unicode コードポイントのみです)。

この問題は、 Why Print Fails wikiで詳しく説明されています。

この問題を解決するには、まず、Unicode とバイト文字列を追加しないようにします。これにより、Python2 では ascii コーデックを使用した暗黙的な変換が発生し、Python3 では例外が発生します。コードを将来的に保証するには、明示的に記述したほうがよいでしょう。たとえばpost、バイトをフォーマットして出力する前に、明示的にエンコードします。

post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))

score 1 · Accepted Answer

この問題は、連結時にバイト文字列と Unicode 文字列を混在させることで発生する可能性があります。すべての文字列リテラルの前にを付ける代わりにu、おそらく

from __future__ import unicode_literals

あなたのために物事を修正します。詳細な説明については、こちらを参照して、それがオプションかどうかを判断してください。

python - UnicodeEncodeError を受け取る Python スクリプト: 'ascii' コーデックは文字をエンコードできません

3 に答える 3

Related

Reference