python - Tkinter (Python) を使用しようとしたときの Unicode デコードエラー

Question

ファイルを読み取り、ユーザーに単語を入力するように求め、その単語が何回使用されたかを伝える簡単なプログラムを作成しました。毎回正確なディレクトリを入力する必要がないように改善したいと思います。Tkinter をインポートし、コード fileName= filedialog.askfilename() を使用して、ボックスがポップアップし、ファイルを選択できるようにしました。次のエラーコードが表示されますが、使用しようとするたびに...

Traceback (most recent call last):
  File "/Users/AshleyStallings/Documents/School Work/Computer Programming/Side Projects/How many? (Python).py", line 24, in <module>
    for line in fileScan.read().split():   #reads a line of the file and stores
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 12: ordinal not in range(128)

このエラーコードが表示されないのは、.txt ファイルを開こうとしたときだけです。しかし、.docx ファイルも開きたいと思っています。事前にご協力いただきありがとうございます:)

# Name: Ashley Stallings
# Program decription: Asks user to input a word to search for in a specified
# file and then tells how many times it's used.
from tkinter import filedialog

print ("Hello! Welcome to the 'How Many' program.")
fileName= filedialog.askopenfilename()  #Gets file name


cont = "Yes"

while cont == "Yes":
    word=input("Please enter the word you would like to scan for. ") #Asks for word
    capitalized= word.capitalize()  
    lowercase= word.lower()
    accumulator = 0

    print ("\n")
    print ("\n")        #making it pretty
    print ("Searching...")

    fileScan= open(fileName, 'r')  #Opens file

    for line in fileScan.read().split():   #reads a line of the file and stores
        line=line.rstrip("\n")
        if line == capitalized or line == lowercase:
            accumulator += 1
    fileScan.close

    print ("The word", word, "is in the file", accumulator, "times.")

    cont = input ('Type "Yes" to check for another word or \
"No" to quit. ')  #deciding next step
    cont = cont.capitalize()

    if cont != "No" and cont != "Yes":
        print ("Invalid input!")

print ("\n")
print ("Thanks for using How Many!")  #ending

PSそれが問題かどうかはわかりませんが、OSxを実行しています

score 3 · Accepted Answer

このエラーコードが表示されないのは、.txt ファイルを開こうとしたときだけです。しかし、.docx ファイルも開きたいと思っています。

docxファイルは単なるテキストファイルではありません。これはOffice Open XMLファイル (XML ドキュメントとその他のサポートファイルを含む zip ファイル) です。テキストファイルとして読み込もうとしてもうまくいきません。

たとえば、ファイルの最初の 4 バイトは次のようになります。

b'PK\x03\x04`

これを UTF-8 や ASCII などとして解釈することはできず、大量のゴミが発生します。あなたは確かにこれであなたの言葉を見つけるつもりはありません.

自分でいくつかの処理を行うことができます — を使用zipfileしてアーカイブ内にアクセスしdocument.xml、XML パーサーを使用してテキストノードを取得し、それらを再結合して空白で分割できるようにします。例えば：

import itertools
import zipfile
import xml.etree.ElementTree as ET

with zipfile.ZipFile('foo.docx') as z:
    document = z.open('word/document.xml')
    tree = ET.parse(document)

textnodes = tree.findall('.//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t')
text = itertools.chain.from_iterable(node.text.split() for node in textnodes)
for word in text:
    # ...

もちろん、実際にxmlns宣言を解析してw名前空間を適切に登録したほうがよいので、そのまま使用できますが、それ'w:t'が何を意味するのかわかっている場合は、すでにそれを知っています。 XML 名前空間とに関するチュートリアルの場所ElementTree。

では、それがファイルでいっぱいの zip ファイルであり、実際のテキストが fileword/document.xmlにあり、そのファイル内の実際のテキストが.//w:tノードにあり、名前空間がにwマップされていることをどのように知る必要があるのhttp://schemas.openxmlformats.org/wordprocessingml/2006/mainでしょうか? 関連するすべてのドキュメントを読んで、いくつかのサンプルファイルを使用して理解することができます。しかし、そうしなければ、大きな学習曲線があなたの前に立ちはだかります。

何をしているのか分かっている場合でも、PyPI で docx パーサーモジュールを検索し、それを使用する方がよいでしょう。

python - Tkinter (Python) を使用しようとしたときの Unicode デコード エラー

1 に答える 1

Related

Reference

python - Tkinter (Python) を使用しようとしたときの Unicode デコードエラー