java - OutOfMemoryError - UTF-8 エンコーディングの検出から

Question

currentFileこのクラスは、エンコーディングをチェックして検出する必要があります。結果が UTF-8 の場合return true。

実行後の出力は -java.lang.OutOfMemoryError: Java heap spaceです。

読み取りデータの場合、これにはJDK 7が必要ですFiles.readAllBytes(path)

コード：

class EncodingsCheck implements Checker {

    @Override
    public boolean check(File currentFile) {
        return isUTF8(currentFile);
    }

    public static boolean isUTF8(File file) {
        // validate input
        if (null == file) {
            throw new IllegalArgumentException("input file can't be null");
        }
        if (file.isDirectory()) {
            throw new IllegalArgumentException(
                    "input file refers to a directory");
        }

        // read input file
        byte[] buffer;
        try {
            buffer = readUTFHeaderBytes(file);
        } catch (IOException e) {
            throw new IllegalArgumentException(
                    "Can't read input file, error = " + e.getLocalizedMessage());
        }

        if (0 == (buffer[0] & 0x80)) {
            return true; // ASCII subset character, fast path
        } else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence
            if (buffer[3] >= buffer.length) {
                return false;
            }
            if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))
                    && (0x80 == (buffer[3] & 0xC0)))
                return true;
        } else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence
            if (buffer[2] >= buffer.length) {
                return false;
            }
            if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) {
                return true;
            }
        } else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence
            if (buffer[1] >= buffer.length) {
                return false;
            }
            if (0x80 == (buffer[1] & 0xC0)) {
                return true;
            }
        }

        return false;
    }

    private static byte[] readUTFHeaderBytes(File input) throws IOException {
        // read data
        Path path = Paths.get(input.getAbsolutePath());
        byte[] data = Files.readAllBytes(path);
        return data;
    }
}

質問：

この問題をどのように解決しますか?
この方法で UTF-16 を確認するにはどうすればよいですか (これについて心配する必要があるか、これは役に立たないトラブルにすぎません)。

score 2 · Accepted Answer

ファイル全体を読み取る必要はありません。

private static byte[] readUTFHeaderBytes(File input) throws IOException {
    FileInputStream fileInputStream = new FileInputStream(input);
    try{
        byte firstBytes[] = new byte[4];
        int count = fileInputStream.read(firstBytes);
        if(count < 4){
            throw new IOException("Empty file");
        }
        return firstBytes;
    } finally {
        fileInputStream.close();
    }
}

他のUTFエンコーディングを検出するには、特定のパターンを使用します。

バイトエンコーディングフォーム
00 00 FE FF UTF-32、ビッグエンディアン
FF FE 00 00 UTF-32、リトルエンディアン
FE FF UTF-16、ビッグエンディアン
FF FE UTF-16、リトルエンディアン
EF BB BF UTF-8

java - OutOfMemoryError - UTF-8 エンコーディングの検出から

1 に答える 1

Related

Reference