java - ファイルのエンコーディングを自動検出するリファクタリング

Question

エンコードファイルを確認する必要があります。このコードは機能しますが、少し長いです。このロジックをリファクタリングする方法。このターゲットに別のバリアントを使用することはできますか?

コード：

class CharsetDetector implements Checker {

    Charset detectCharset(File currentFile, String[] charsets) {
        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(currentFile, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File currentFile, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(
                    new FileInputStream(currentFile));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer) != -1) && (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            } else {
                return null;
            }

        } catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        } catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    @Override
    public boolean check(File fileChack) {
        if (charsetDetector(fileChack)) {
            return true;
        }
        return false;
    }

    private boolean charsetDetector(File currentFile) {
        String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" };

        CharsetDetector charsetDetector = new CharsetDetector();
        Charset charset = charsetDetector.detectCharset(currentFile,
                charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(
                        new FileInputStream(currentFile), charset);

                @SuppressWarnings("unused")
                int valueReaders = 0;
                while ((valueReaders = reader.read()) != -1) {
                    return true;
                }

                reader.close();
            } catch (FileNotFoundException exc) {
                System.out.println("File not found!");
                exc.printStackTrace();
            } catch (IOException exc) {
                exc.printStackTrace();
            }
        } else {
            System.out.println("Unrecognized charset.");
            return false;
        }

        return true;
    }
}

質問：

このプログラムロジックはどのようにリファクタリングされますか?
エンコーディングを検出する別の方法はどれですか ( UTF-16シーケンスなど)?

score 5 · Accepted Answer

このコードをリファクタリングする最善の方法は、文字検出を行うサードパーティのライブラリを導入することです。いくつかの選択肢については、この質問を参照してください

score 3 · Accepted Answer

指摘されているように、ファイルのエンコーディングを「知る」または「検出する」ことはできません。ほとんどの場合、複数の文字エンコーディングに関してあいまいなバイトシーケンスが存在するため、完全に正確であるためには、そのことを伝える必要があります。

UTF-8 と ISO8859-1 の検出については、このSO の質問でさらに議論されています。. 本質的な答えは、ファイル内の各バイトシーケンスをチェックして、期待されるエンコーディングとの互換性を確認することです。UTF-8 バイトエンコーディングルールについては、http://en.wikipedia.org/wiki/UTF-8を参照してください。

特に、文字エンコーディング/セットの検出に関する非常に興味深い論文があります。価格は、OPが適切なコードサイズであるとほのめかした30行に収まらない、さまざまな言語の文字頻度に関する知識を備えた非常に複雑な検出システムです. どうやら検出アルゴリズムは Mozilla に組み込まれているようで、おそらく見つけて抽出することができます。

私たちはもっと単純なスキームに落ち着きました: a) 文字セットが言われていることを信じる b) そうでない場合は、BOM をチェックし、存在する場合はその内容を信じます。、または iso8859 をこの順序で。これをファイルの 1 回のパスで実行する醜いルーチンを作成できます。

(問題は時間の経過とともに悪化すると思います。Unicode には毎年新しいリビジョンがあり、有効なコードポイントには本当に微妙な違いがあります。これを正しく行うには、すべてのコードポイントの有効性をチェックする必要があります。運が良ければ、それらはすべて下位互換性があります。)

[編集: OP は Java でこれをコーディングするのに問題があるようです。私たちのソリューションと他のページのスケッチは Java でコーディングされていないため、回答を直接コピーして貼り付けることはできません。彼のコードに基づいて、ここで Java バージョンのドラフトを作成します。コンパイルもテストもされていません。YMMV]

int UTF8size(byte[] buffer, int buf_index)
// Java-version of character-sniffing test on other page
// This only checks for UTF8 compatible bit-pattern layout
// A tighter test (what we actually did) would check for valid UTF-8 code points
{   int first_character=buffer[buf_index];

    // This first character test might be faster as a switch statement
    if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path
    else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (buf_index+3>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80)
         && ((buffer[buf_index + 3] & 0xC0) == 0x80))
            return 4;
    }
    else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (buf_index+2>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80))
            return 3;
    }
    else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence
        if (buf_index+1>=buffer.length) return 0;
        if ((buffer[buf_index + 1] & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

public static boolean isUTF8 ( File file ) {
    int file_size;
    if (null == file) {
        throw new IllegalArgumentException ("input file can't be null");
    }
    if (file.isDirectory ()) {
        throw new IllegalArgumentException ("input file refers to a directory");
    }

    file_size=file.size();
    // read input file
    byte [] buffer = new byte[file_size];
    try {
        FileInputStream fis = new FileInputStream ( input ) ;
        fis.read ( buffer ) ;
        fis.close ();
    }
    catch ( IOException e ) {
        throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () );
    }

    { int buf_index=0;
      int step;

      while (buf_index<file_size) {
         step=UTF8size(buffer,buf_index);
         if (step==0) return false; // definitely not UTF-8 file
         buf_index+=step;

      }

    }

   return true ; // appears to be UTF-8 file
}

java - ファイルのエンコーディングを自動検出するリファクタリング

2 に答える 2

Related

Reference