python-2.7 - エンコーディングを強制したにもかかわらず、NLTK の word_tokenize で UnicodeDecodeError が発生する

Question

最初にpdfをプレーンテキストに変換し（印刷してすべて問題ありません）、NLTKからword_tokenize（）を実行しようとするとUnicodeDecodeErrorが発生します。

事前にプレーンテキストでdecode('utf-8').encode('utf-8')しようとしても、そのエラーが発生します。トレースバックで、最初にエラーが発生する word_tokenize() のコード行が plaintext.split('\n') であることに気付きました。これが、プレーンテキストで split('\n') を実行してエラーを再現しようとした理由ですが、それでもエラーは発生しません。

そのため、エラーの原因も回避方法もわかりません。

どんな助けでも大歓迎です！:) pdf_to_txt ファイルの何かを変更することで回避できるでしょうか?

トークン化するコードは次のとおりです。

from cStringIO import StringIO
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os
import string
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

stopset = stopwords.words('english')
path = 'my_folder'
listing = os.listdir(path)
for infile in listing:
        text = self.convert_pdf_to_txt(path+infile)
        text = text.decode('utf-8').encode('utf-8').lower()
        print text
        splitted = text.split('\n')
        filtered_tokens = [i for i in word_tokenize(text) if i not in stopset and i not in string.punctuation]

pdfからtxtに変換するために呼び出すメソッドは次のとおりです。

def convert_pdf_to_txt(self, path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    ret = retstr.getvalue()
    retstr.close()
    return ret

これが私が得るエラーのトレースバックです：

    Traceback (most recent call last):
  File "/home/iammyr/opt/workspace/task-logger/task_logger/nlp/pre_processing.py", line 65, in <module>
    obj.tokenizeStopWords()
  File "/home/iammyr/opt/workspace/task-logger/task_logger/nlp/pre_processing.py", line 29, in tokenizeStopWords
    filtered_tokens = [i for i in word_tokenize(text) if i not in stopset and i not in string.punctuation]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 93, in word_tokenize
    return [token for sent in sent_tokenize(text)
  [...]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 9: ordinal not in range(128)

百万とたくさんの良いカルマに感謝します! ;)

score 4 · Accepted Answer

完全に適切な Unicode 文字列 (戻る) を、型指定されていないバイトの束に変換しています。Python はこれを処理する方法がわかりませんが、必死に ASCII コーデックを適用しようとします。を削除.encode('utf-8')すれば問題ありません。

http://nedbatchelder.com/text/unipain.htmlも参照してください。

python-2.7 - エンコーディングを強制したにもかかわらず、NLTK の word_tokenize で UnicodeDecodeError が発生する

1 に答える 1

Related

Reference