java - 不正なUTF-8バイトシーケンスを検出して、Java入力ストリームでそれらを置き換える方法は？

Question

問題のファイルは私の管理下にありません。ほとんどのバイトシーケンスは有効なUTF-8であり、ISO-8859-1（または他のエンコーディング）ではありません。できるだけ多くの情報を抽出するように頑張りたいと思います。

ファイルにいくつかの不正なバイトシーケンスが含まれています。これらは置換文字に置き換える必要があります。

これは簡単な作業ではなく、UTF-8ステートマシンに関する知識が必要だと考えています。

Oracleには、必要な処理を実行するラッパーがあります
。UTF8ValidationFilterjavadoc

そのようなものが（商業的にまたは無料のソフトウェアとして）利用可能ですか？

ありがとう
-ステファン

解決：

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

score 12 · Accepted Answer

java.nio.charset.CharsetDecoderは必要なことを行います。このクラスは、さまざまな種類のエラーに対するユーザー定義可能なアクションを使用した文字セットのデコードを提供します（onMalformedInput()およびを参照onUnmappableCharacter()）。

CharsetDecoderに書き込みます。これは、 usingにOutputStreamパイプして、フィルター処理されたを効果的に作成できます。InputStreamjava.io.PipedOutputStreamInputStream

score 0 · Accepted Answer

One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

score 0 · Accepted Answer

必要な動作は、すでにのデフォルトですInputStreamReader。したがって、自分で指定する必要はありません。これで十分です：

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

java - 不正なUTF-8バイトシーケンスを検出して、Java入力ストリームでそれらを置き換える方法は？

3 に答える 3

Related

Reference