utf-8 - 無効な UTF-8 バイト

Question

すべてのバイトシーケンスが有効な UTF-8 であるとは限りません。UTF-8 デコーダーは、以下のために準備する必要があります。
1. the red invalid bytes in the above table
2. an unexpected continuation byte
3. a start byte not followed by enough continuation bytes
4. an Overlong Encoding as described above
5. A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

コードページレイアウトに従って、0xC0 と 0xC1 は無効であり、有効な UTF-8 シーケンスで表示されてはなりません。CodePoints 0xC0 と 0xC1 について私が持っているものは次のとおりです。

Byte 2   Byte 1      Num   Char
11000011 10000000    192   À
11000011 10000001    193   Á

これらのバイトシーケンスに対応する文字がありますが、存在しないはずです。私はそれを間違っていますか？

score 9 · Accepted Answer

あなたは用語を混乱させているだけです：

コードポイントU+ 00C0は文字「À」、U+00C1 は「Á」です。
UTF-8 でエンコードされたこれらは、それぞれバイトシーケンス です。C3 80C3 81

バイト C0とはC1、UTF-8 エンコーディングでは表示されません。

コードポイントは、バイトとは独立した文字を表します。バイトはバイトです。

utf-8 - 無効な UTF-8 バイト

1 に答える 1

Related

Reference