java - Java Charset InputStreamReader, File Channel Differences

Question

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.

When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:

try {
        InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
        BufferedReader in = new BufferedReader(read);
        String str;
        while((str=in.readLine())!=null){           
            System.out.println(str);
    }
    in.close();
}catch (Exception e){
    System.out.println(e);
}

However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:

    File f = new File("JapanTest.txt");
    fis = new FileInputStream(f);
    channel = fis.getChannel();
     MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
     buffer.position(0);
    int get = Math.min(buffer.remaining(), 1024);
    byte[] barray = new byte[1024];
    buffer.get(barray, 0, get);
    CharSet charSet = Charset.forName("UTF-16");
    //endOfLinePos is a calculated value and defines the number of bytes to read
    rowString = new String(barray, 0, endOfLinePos, charSet);               
    System.out.println(rowString);

The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?

More Details: I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

score 1 · Accepted Answer

おそらく、InputStreamReader通常は行わないいくつかの変換を行いnew String(...)ます。回避策として（そしてこの仮定を検証するために）、のようにチャネルから読み取られたデータをラップすることを試みることができますnew InputStreamReader( new ByteArrayInputStream( barray ) )。

編集：それを忘れてください:)- Channels.newReader（）が行く方法でしょう。

score 1 · Accepted Answer

UTF-16のコード単位は2バイトであり、UTF-8のようなバイトではありません。パターンと1バイトコードの単位長により、UTF-8は自己同期します。いつでも正しく読み取ることができ、継続バイトの場合は、バックトラックするか、1文字だけを失う可能性があります。

UTF-16では、常にバイトのペアを処理する必要があります。奇数バイトで読み取りを開始したり、奇数バイトで読み取りを停止したりすることはできません。また、エンディアンを知っている必要があります。ファイルの先頭を読み取らない場合は、BOMがないため、UTF-16LEまたはUTF-16BEのいずれかを使用する必要があります。

ファイルをUTF-8としてエンコードすることもできます。

java - Java Charset InputStreamReader, File Channel Differences

2 に答える 2

Related

Reference