unicode - サロゲートペアは、UTF-16 で 2 バイトより大きいコードポイントを表す唯一の方法ですか?

Question

これはおそらくばかげた質問であることはわかっていますが、この問題について確信を持つ必要があります。たとえば、プログラミング言語がその String 型が UTF-16 エンコーディングを使用すると言っている場合、それは次のことを意味するかどうかを知る必要があります。

U+0000 から U+FFFF の範囲のコードポイントに 2 バイトを使用します。
U+FFFF (コードポイントあたり 4 バイト) より大きいコードポイントにはサロゲートペアを使用します。

または、一部のプログラミング言語は、エンコード時に独自の「トリック」を使用し、この標準に 100% 準拠していません。

score 3 · Accepted Answer

UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.

I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.

But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 2²¹), and it's up to you how to represent and serialize those.

I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.

The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)

score 1 · Accepted Answer

UTF-16 自体は標準です。ただし、文字列が 16 ビットのコード単位に基づいているほとんどの言語 (UTF-16 を「サポート」すると主張しているかどうかにかかわらず) は、無効なサロゲートを含むコード単位の任意のシーケンスを使用できます。たとえば、これは通常、受け入れ可能な文字列リテラルです。

"x \uDC00 y \uD800 z"

通常、別のエンコーディングに書き込もうとしたときにのみエラーが発生します。

Python のオプションのエンコード/デコードオプションsurrogateescapeは、このような無効なサロゲートを使用して、単一バイト 0x80–0xFF を表すトークンをスタンドアロンのサロゲートコードユニット U+DC80–U+DCFF に密輸し、このような文字列を生成します。これは通常、内部でのみ使用されるため、ファイルやネットワークで遭遇することはほとんどありません。strまた、Python のデータ型が 16 ビットコード単位に基づいている限り、UTF-16 にのみ適用されます (これは 3.0 と 3.3 の間の「狭い」ビルドにあります)。

私は、UTF-16 の他の一般的に使用される拡張機能/バリアントを認識していません。

unicode - サロゲート ペアは、UTF-16 で 2 バイトより大きいコード ポイントを表す唯一の方法ですか?

2 に答える 2

Related

Reference

unicode - サロゲートペアは、UTF-16 で 2 バイトより大きいコードポイントを表す唯一の方法ですか?