perl - Normalizing ASCII characters

Question

I need to normalize a string such as "quée" and I can't seem to convert the extended ASCII characters such as é, á, í, etc into roman/english versions. I've tried several different methods but nothing works so far. There is a fair amount of material on this general subject but I can't seem to find a working answer to this problem.

Here's my code:

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;

#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );

foreach ( @breakdown ) {
    if ( $_ eq "\x{130}" ) {
        $_ = "e";
        print "\nArray Output: @breakdown\n";
    }
    $lowercase = join( "",@breakdown );
}

score 9 · Accepted Answer

1）この記事は、かなり良い（複雑な場合）方法を提供する必要があります。

これは、アクセント付きのすべてのUnicode文字を基本文字+アクセントに変換するためのソリューションを提供します。それが完了したら、アクセント文字を個別に削除するだけです。

2）別のオプションはCPANです:(Text::Unaccent::PurePerl改良されたPure PerlバージョンのText::Unaccent）

3）また、このSOの答えは提案しText::Unidecodeます：

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
  ete

score 7 · Accepted Answer

元のコードが機能しない理由\x{130}は、é ではないからです。それは LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130 or İ)です。あなたは\x{E9}単に\xE9（中括弧は2桁の数字の場合はオプションです）、LATIN SMALL LETTER E WITH ACUTE (U+00E9)を意味していました。

また、 ; に余分なバックスラッシュがありますtr。のようになりますtr/\xE9/e/。

これらの変更により、コードは機能しますが、この種のことには CPAN のモジュールの 1 つを使用することをお勧めします。私自身はText::Unidecodeの方が好きです。アクセント付きの文字だけでなく、もっと多くのことを処理できるからです。

score 3 · Accepted Answer

作業と再作業の後、これが私が今持っているものです。単語を区別するために入力文字列の途中にスペースを入れたい以外は、私がやりたいことはすべてやっています。

open FILE, "funnywords.txt";

# Iterate through funnywords.txt
while ( <FILE> ) {
    chomp;

    # Show initial text from file
    print "In: '$_' -> ";

    my $inputString = $_;

    # $inputString is scoped within a for each loop which dissects
    # unicode characters ( example: "é" splits into "e" and "´" )
    # and throws away accent marks. Also replaces all
    # non-alphanumeric characters with spaces and removes
    # extraneous periods and spaces.
    for ( $inputString ) {
        $inputString = NFD( $inputString ); # decompose/dissect
        s/^\s//; s/\s$//;                   # strip begin/end spaces
        s/\pM//g;                           # strip odd pieces
        s/\W+//g;                           # strip non-word chars
    }

    # Convert to lowercase 
    my $outputString = "\L$inputString";

    # Output final result
    print "$outputString\n";
}

一部の正規表現とコメントが赤く着色されている理由が完全にはわかりません...

「funnywords.txt」の行の例をいくつか示します。

キュー

22.

？éÉíóñúÑ¿

[ 。これ？]

aquí、aALLí

perl - Normalizing ASCII characters

4 に答える 4

Related

Reference