perl - ロケールを使用した Windows 上の Perl での多言語テキストの並べ替え

Question

本の索引をさまざまな言語で並べ替えるためのソフトウェアを構築しています。Perl を使用し、ロケールのキーをオフにします。Unix で開発していますが、Windows に移植できる必要があります。これは原則として、またはロケールに依存して機能する必要がありますか?間違ったツリーを吠えていますか? 要するに、これを機能させるには Windows が必要ですが、UNIX 環境での開発の方が快適です。

score 11 · Accepted Answer

開始点が Unicode であると仮定すると、ネイティブエンコーディングが何であれ、すべての受信データを非常に慎重にデコードしているため、Unicode::Collateモジュールを開始点として使用するのは簡単です。

ロケールの調整が必要な場合は、おそらくUnicode::Collate::Locale代わりに開始することをお勧めします。

Unicode へのデコード

すべて UTF8 環境で実行している場合、これは簡単ですが、ランダムないわゆる「ロケール」(さらに悪いことに、Microsoft が「コードページ」と呼ぶ醜いもの) の変遷にさらされている場合は、あなたを助けるCPANEncode::Localeモジュールを手に入れるために。例えば：

 use Encode;
 use Encode::Locale;

 # use "locale" as an arg to encode/decode
 @ARGV = map { decode(locale =>  $_) } @ARGV;

 # or as a stream for binmode or open
 binmode $some_fh, ":encoding(locale)";

 binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
 binmode STDOUT, ":encoding(console_out)"  if -t STDOUT;
 binmode STDERR, ":encoding(console_out)"  if -t STDERR;

（私だったら":utf8"、出力に使用するだけです。）

標準照合、およびロケールと調整

ポイントは、すべてを内部の Perl 形式にデコードしたら、その上でUnicode::Collateandを使用できるUnicode::Collate::Localeということです。これらは本当に簡単です：

   use v5.14;
   use utf8;
   use Unicode::Collate;
   my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
   @exes = Unicode::Collate->new->sort(@exes);
   say "@exes";

   # prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

または、かなり派手になることもあります。これは、書籍のタイトルを処理しようとするものです: 先頭の記事を削除し、数字をゼロで埋めます。

my $collator = Unicode::Collate->new(
    --upper_before_lower => 1,
    --preprocess => {
        local $_ = shift;
        s/^ (?: The | An? ) \h+ //x;  # strip articles
        s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
        return $_;
    };
);

次に、そのオブジェクトのsortメソッドを使用して並べ替えます。

場合によっては、並べ替えを裏返しにする必要があります。例えば：

 my $collator = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = 
        $collator->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

これを行う必要があるのは、さまざまなフィールドを持つレコードを並べ替えているためです。バイナリソートキーを使用するとcmp、選択した/カスタムの照合オブジェクトを通過したデータに対して演算子を使用できます。

collator オブジェクトの完全なコンストラクターには、正式な構文として次のすべてが含まれています。

      $Collator = Unicode::Collate->new(
         UCA_Version => $UCA_Version,
         alternate => $alternate, # alias for 'variable'
         backwards => $levelNumber, # or \@levelNumbers
         entry => $element,
         hangul_terminator => $term_primary_weight,
         highestFFFF => $bool,
         identical => $bool,
         ignoreName => qr/$ignoreName/,
         ignoreChar => qr/$ignoreChar/,
         ignore_level2 => $bool,
         katakana_before_hiragana => $bool,
         level => $collationLevel,
         minimalFFFE => $bool,
         normalization  => $normalization_form,
         overrideCJK => \&overrideCJK,
         overrideHangul => \&overrideHangul,
         preprocess => \&preprocess,
         rearrange => \@charList,
         rewrite => \&rewrite,
         suppress => \@charList,
         table => $filename,
         undefName => qr/$undefName/,
         undefChar => qr/$undefChar/,
         upper_before_lower => $bool,
         variable => $variable,
      );

しかし、通常、これらのほとんどについて心配する必要はありません。実際、CLDR データを使用して国固有のロケール調整が必要な場合はUnicode::Collate::Locale、コンストラクターにもう 1 つのパラメーターを追加するだけを使用する必要がありますlocale => $country_code。

 use Unicode::Collate::Locale;
 $coll = Unicode::Collate::Locale->
           new(locale => "fr");
 @french_text = $coll->sort(@french_text);

それがどれほど簡単か分かりますか？

しかし、他のクールなこともできます。

 use Unicode::Collate::Locale;
 my $Collator = new Unicode::Collate::Locale::
                 locale => "de__phonebook",
                 level  => 1,
                 normalization => undef,
                ;

 my $full = "Ich müß Perl studieren.";
 my $sub = "MUESS";
 if (my ($pos,$len) = $Collator->index($full, $sub)) {
     my $match = substr($full, $pos, $len);
     say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

 }

実行すると、次のように表示されます。

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

モジュールの v0.96 の時点で利用可能なロケールは、Unicode::Collate::Localeそのマンページから取得したものです。

 locale name       description
--------------------------------------------------------------
 af                Afrikaans
 ar                Arabic
 as                Assamese
 az                Azerbaijani (Azeri)
 be                Belarusian
 bg                Bulgarian
 bn                Bengali
 bs                Bosnian
 bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
 ca                Catalan
 cs                Czech
 cy                Welsh
 da                Danish
 de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
 ee                Ewe
 eo                Esperanto
 es                Spanish
 es__traditional   Spanish ('ch' and 'll' as a grapheme)
 et                Estonian
 fa                Persian
 fi                Finnish (v and w are primary equal)
 fi__phonebook     Finnish (v and w as separate characters)
 fil               Filipino
 fo                Faroese
 fr                French
 gu                Gujarati
 ha                Hausa
 haw               Hawaiian
 hi                Hindi
 hr                Croatian
 hu                Hungarian
 hy                Armenian
 ig                Igbo
 is                Icelandic
 ja                Japanese [1]
 kk                Kazakh
 kl                Kalaallisut
 kn                Kannada
 ko                Korean [2]
 kok               Konkani
 ln                Lingala
 lt                Lithuanian
 lv                Latvian
 mk                Macedonian
 ml                Malayalam
 mr                Marathi
 mt                Maltese
 nb                Norwegian Bokmal
 nn                Norwegian Nynorsk
 nso               Northern Sotho
 om                Oromo
 or                Oriya
 pa                Punjabi
 pl                Polish
 ro                Romanian
 ru                Russian
 sa                Sanskrit
 se                Northern Sami
 si                Sinhala
 si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
 sk                Slovak
 sl                Slovenian
 sq                Albanian
 sr                Serbian
 sr_Latn           Serbian in Latin (tailored as Croatian)
 sv                Swedish (v and w are primary equal)
 sv__reformed      Swedish (v and w as separate characters)
 ta                Tamil
 te                Telugu
 th                Thai
 tn                Tswana
 to                Tonga
 tr                Turkish
 uk                Ukrainian
 ur                Urdu
 vi                Vietnamese
 wae               Walser
 wo                Wolof
 yo                Yoruba
 zh                Chinese
 zh__big5han       Chinese (ideographs: big5 order)
 zh__gb2312han     Chinese (ideographs: GB-2312 order)
 zh__pinyin        Chinese (ideographs: pinyin order) [3]
 zh__stroke        Chinese (ideographs: stroke order) [3]
 zh__zhuyin        Chinese (ideographs: zhuyin order) [3]

   Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
   it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
   (Zulu).

   Note

   [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and halfwidth forms are identical to their regular form.  The
   difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
   and then "katakana_before_hiragana" has no effect.

   [2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
   (level 2) greater than, the corresponding hangul syllable.

   [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

   Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

要約すると、主なトリックは、ローカルデータを統一された Unicode 表現にデコードしてから、適切な動作のためにユーザーのコンソールウィンドウのランダムな設定に依存しない、場合によっては調整された決定論的並べ替えを使用することです。

_{注: これらの例はすべて、マンページの引用を除いて、作者の親切な許可を得て、プログラミング Perlの^第4版から愛情を込めて取り上げたものです。:)}

score 1 · Accepted Answer

Win32::OLE::NLSは、システムのその部分へのアクセスを提供します。CompareString必要なロケール ID を取得するために必要なツールを提供します。

システムドキュメントを見つけたい/必要な場合に備えて、基礎となるシステムコールの名前はCompareStringExです。

perl - ロケールを使用した Windows 上の Perl での多言語テキストの並べ替え

2 に答える 2

Unicode へのデコード

標準照合、およびロケールと調整

Related

Reference