python - Python でテキストファイルのエンコーディングを検出するコードに落とし穴がありますか?

Question

Python やテキストエンコーディングよりも、自転車の修理、チェーンソーの使用、塹壕の安全性についてよく知っています。それを念頭に置いて...

Python のテキストエンコーディングは、永遠の問題のようです (私自身の質問: Searching text files' contents with different encodings with Python?、およびその他の私が読んだもの: 1、2。以下エンコーディング。

限定的なテストでは、このコードは、テキストエンコーディングの最初の 3 バイトの超過や、それらのデータが有益ではない状況について知る必要なく、私の目的に沿って機能しているようです*。

*私の目的は次のとおりです。

中程度から高い成功率で使用できる依存関係のないスニペットを用意します。
任意のエンコーディングのテキストベースのログファイルをローカルワークステーションでスキャンし、その内容に基づいて関心のあるファイルとして識別します (ファイルを適切なエンコーディングで開く必要があります)。
これを機能させるという課題のために。

質問: 以下のように、文字を比較してカウントする奇抜な方法であると思われる方法を使用する場合の落とし穴は何ですか? どんな入力でも大歓迎です。

def guess_encoding_debug(file_path):
    """
    DEBUG - returns many 2 value tuples
    Will return list of all possible text encodings with a count of the number of chars
    read that are common characters, which might be a symptom of success.
    SEE warnings in sister function
    """

    import codecs
    import string
    from operator import itemgetter

    READ_LEN = 1000
    ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
                 'utf_16_be','utf_32','utf_32_le','utf_32_be']

    #chars in the regular ascii printable set are BY FAR the most common
    #in most files written in English, so their presence suggests the file
    #was decoded correctly.
    nonsuspect_chars = string.printable

    #to be a list of 2 value tuples
    results = []

    for e in ENCODINGS:
        #some encodings will cause an exception with an incompatible file,
        #they are invalid encoding, so use try to exclude them from results[]
        try:
            with codecs.open(file_path, 'r', e) as f:

                #sample from the beginning of the file
                data = f.read(READ_LEN)

                nonsuspect_sum = 0

                #count the number of printable ascii chars in the
                #READ_LEN sized sample of the file
                for n in nonsuspect_chars:
                    nonsuspect_sum += data.count(n)

                #if there are more chars than READ_LEN
                #the encoding is wrong and bloating the data
                if nonsuspect_sum <= READ_LEN:
                    results.append([e, nonsuspect_sum])
        except:
            pass

    #sort results descending based on nonsuspect_sum portion of
    #tuple (itemgetter index 1).
    results = sorted(results, key=itemgetter(1), reverse=True)

    return results


def guess_encoding(file_path):
    """
    Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
    Will return one likely text encoding, though there may be others just as likely.
    WARNING: DO NOT use if your file uses any significant number of characters
             outside the standard ASCII printable characters!
    WARNING: DO NOT use for critical applications, this code will fail you.
    """

    results = guess_encoding_debug(file_path)

    #return the encoding string (second 0 index) from the first
    #result in descending list of encodings (first 0 index)
    return results[0][0]

特に慣れていないchardetに比べて遅いと思います。精度も低い。このように設計されているため、アクセントやウムラウトなどを使用するローマ字ベースの言語は、少なくともうまく機能しません。いつ失敗するかを知るのは難しいでしょう。ただし、ほとんどのプログラミングコードを含むほとんどの英語のテキストは、このコードが依存する string.printable で主に記述されます。

将来的には外部ライブラリがオプションになる可能性がありますが、現時点では次の理由から避けたいと思います。

このスクリプトは、さまざまなバージョンの Python を使用して、ネットワーク内外の複数の会社のコンピューターで実行されるため、複雑さが少ないほど良い. 私が「会社」と言うとき、私は社会科学者の小さな非営利団体を意味します。
私は GPS データ処理からのログの収集を担当していますが、私はシステム管理者ではありません。彼女は Python プログラマーではなく、私が彼女のために費やす時間が少ないほど良いのです。
私の会社で一般的に利用可能な Python のインストールは、GIS ソフトウェアパッケージと共にインストールされます。
私の要件はそれほど厳密ではありません。関心のあるファイルを特定し、他の方法を使用してそれらをアーカイブにコピーしたいだけです。内容を操作、追加、または書き換えるために、完全な内容をメモリに読み取っていません。
高水準のプログラミング言語には、これを独自に達成する方法が必要なようです。「思われる」というのは、どんな取り組みにおいても揺るぎない土台ですが、私はそれを機能させることができるかどうか試してみたかったのです。

score 0 · Accepted Answer

おそらく、コードがどれだけうまく機能するかを調べる最も簡単な方法は、他の既存のライブラリのテストスイートを取得し、それらをベースとして使用して独自の包括的なテストスイートを作成することです。コードがこれらすべてのケースで機能するかどうかがわかります。また、関心のあるすべてのケースをテストすることもできます。

python - Python でテキスト ファイルのエンコーディングを検出するコードに落とし穴がありますか?

1 に答える 1

Related

Reference

python - Python でテキストファイルのエンコーディングを検出するコードに落とし穴がありますか?