php - 正規化されたUTF-8とは何ですか？

Question

ICUプロジェクト（現在はPHPライブラリもあります）には、検索時に値を比較しやすくするためにUTF-8文字列を正規化するために必要なクラスが含まれています。

しかし、私はこれがアプリケーションにとって何を意味するのかを理解しようとしています。たとえば、「互換性の同等性」ではなく「標準的な同等性」が必要な場合、またはその逆の場合はどうなりますか？

score 192 · Accepted Answer

192

于 2011-10-28T20:13:31.347 に答える

score 44 · Accepted Answer

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

See http://unicode.org/reports/tr15/ for more details.

score 13 · Accepted Answer

（データベースではなくUnicodeの）正規形は、主に（排他的に？）発音区別符号のある文字を扱います。Unicodeは、U + 00C0、「Latin Capital A with Grave」など、「組み込み」の発音区別符号が付いた一部の文字を提供します。「ラテン語の大文字A」（U + 0041）と「合成のアクサングラーブ」（U + 0300）から同じ文字を作成できます。つまり、2つのシーケンスで同じ結果の文字が生成されても、バイト単位で生成されます。比較すると、それらは完全に異なることがわかります。

正規化は、それに対処するための試みです。正規化により、すべての文字が同じ方法でエンコードされることが保証されます（または少なくとも試行されます）。すべてが必要に応じて個別の結合発音区別符号を使用するか、すべてが可能な限り単一のコードポイントを使用します。比較の観点からは、選択するロット全体は実際には重要ではありません。ほとんどの正規化された文字列は、別の正規化された文字列と適切に比較されます。

この場合、「互換性」とは、1つのコードポイントが1文字に等しいことを前提としたコードとの互換性を意味します。そのようなコードがある場合は、互換性の正規形を使用することをお勧めします。私はそれが直接述べられているのを見たことがありませんが、正規形の名前は、ユニコードコンソーシアムが個別の合成発音区別符号を使用することが好ましいと考えていることを意味します。これには、文字列内の実際の文字をカウントするためのより多くのインテリジェンスが必要です（また、文字列をインテリジェントに分割するなど）が、より用途が広いです。

ICUを最大限に活用している場合は、標準形を使用する可能性があります。たとえば、コードポイントが文字と等しいと想定するコードを自分で作成しようとしている場合は、可能な限りそれを真にする互換性の正規形が必要になる可能性があります。

score 5 · Accepted Answer

2つのUnicode文字列が正規に同等である場合、文字列は実際には同じであり、異なるUnicodeシーケンスのみを使用します。たとえば、Äは文字ÄまたはAと◌̈の組み合わせを使用して表すことができます。

文字列が互換性のみと同等である場合、文字列は必ずしも同じではありませんが、コンテキストによっては同じである可能性があります。たとえば、ﬀはffと同じと見なすことができます。

したがって、文字列を比較する場合は、互換性の同等性は実際の同等性ではないため、正規の同等性を使用する必要があります。

ただし、文字列のセットを並べ替える場合は、互換性の同等性を使用するのが理にかなっている場合があります。

score 5 · Accepted Answer

これは実際にはかなり簡単です。UTF-8には、実際には同じ「文字」のいくつかの異なる表現があります。（バイト単位では異なるため、引用符で文字を使用しますが、実際には同じです）。リンク先のドキュメントに例が示されています。

文字「Ç」は、バイトシーケンス0xc387として表すことができます。Cただし、（0x43）の後にバイトシーケンス0xcca7を続けることで表すこともできます。したがって、0xc387と0x43cca7は同じ文字であると言えます。動作する理由は、0xcca7が結合マークであるためです。つまり、その前の文字（Cここ）を取得して変更します。

ここで、正規の同等性と互換性の同等性の違いについては、一般的に文字を調べる必要があります。

文字には、値を介して意味を伝える文字と、別の文字を取得して変更する文字の2種類があります。9は意味のある文字です。スーパースクリプト⁹はその意味を取り、提示によってそれを変更します。したがって、標準的には異なる意味を持ちますが、それでも基本文字を表します。

標準的な同等性とは、バイトシーケンスが同じ文字を同じ意味でレンダリングすることです。互換性の同等性とは、バイトシーケンスが同じ基本意味を持つ異なる文字をレンダリングしている場合です（変更されている場合でも）。9と⁹はどちらも「9」を意味するため互換性は同等ですが、同じ表現を持たないため、正規には同等ではありません。

score 4 · Accepted Answer

Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

score 2 · Accepted Answer

The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?

W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

The standard is C! in doubt use NFC

For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().

Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

(...) each of the following sequences (the first two being single-character sequences) represent the same character:

U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE

U+212B ( Å ) ANGSTROM SIGN

U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. (...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

php - 正規化されたUTF-8とは何ですか？

7 に答える 7

The standard is C! in doubt use NFC

Related

Reference