java - エンコーディングが異なる大きなテキストファイルを部分的に読み込む

Question

私はJavaテキストコンポーネントを書いていて、途中で大きなテキストファイルを部分的にロードしようとしています(速度上の理由から)。

私の質問は、テキストが UTF8、Big5、GBK などのマルチバイトエンコーディング形式であるかどうかです。テキストを正しくデコードできるようにバイトを揃えるにはどうすればよいですか?

score 2 · Accepted Answer

I can't speak for the other formats but utf8 shouldn't be too hard.

Just look at the first byte of the chunk you grabbed and figure out from there:

Taken from wikipedia:

00000000-01111111   00-7F   0-127   US-ASCII (single byte)
10000000-10111111   80-BF   128-191 2'nd, 3rd, or 4'th byte of a multi-byte sequence
11000000-11000001   C0-C1   192-193 start of a 2-byte sequence, but code point <= 127
11000010-11011111   C2-DF   194-223 Start of 2-byte sequence
11100000-11101111   E0-EF   224-239 Start of 3-byte sequence
11110000-11110100   F0-F4   240-244 Start of 4-byte sequence

If the byte is in the 2'nd or 3'rd group then you know you missed part of a character. If it's in the 1'st,4'th,5'th,6'th group then you know you are on the start of a character. Proceed accordingly from there.

java - エンコーディングが異なる大きなテキスト ファイルを部分的に読み込む

2 に答える 2

Related

Reference

java - エンコーディングが異なる大きなテキストファイルを部分的に読み込む