java - Unicode 制御文字の置換

Question

Java の文字列内の特殊な制御文字をすべて置き換える必要があります。

Google maps API v3 についてお聞きしたいのですが、Google はこれらの文字を気に入らないようです。

例: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F

この URL には次の文字が含まれています: http://www.fileformat.info/info/unicode/char/008f/index.htm

データを受け取ったので、このデータをジオコーディングする必要があります。一部の文字がジオコーディングを通過しないことはわかっていますが、正確なリストはわかりません。

この問題に関するドキュメントを見つけることができなかったので、Google が好まない文字のリストは次のとおりだと思います: http://www.fileformat.info/info/unicode/category/Cc/list.htm

これらの文字を取り除くために既に構築された関数はありますか、または新しいものを構築して、1 つずつ置換する必要がありますか?

それとも、仕事をするための良い正規表現はありますか?

そして、Googleが嫌いな文字の正確なリストを誰か知っていますか?

編集：GoogleはこのためのWebページを作成しました：

https://developers.google.com/maps/documentation/webservices/?hl=fr#BuildingURLs

score 13 · Accepted Answer

If you want to delete all characters in Other/Control Unicode category, you can do something like this:

    System.out.println(
        "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
    ); // abcd

Note that this actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

If the blacklist is not nicely captured by one Unicode block/category, Java does have a powerful character class arithmetics featuring intersection, subtraction, etc that you can use. Alternatively you can also use a negated whitelist approach, i.e. instead of explicitly specifying what characters are illegal, you specify what are legal, and everything else then becomes illegal.

API links

Examples

Here's a subtraction example:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[a-z&&[^aeiou]]", "_")
    );
    //   _e_u_a_ e___e__io__: _o_ _ou _a_e __o __o__e__!!

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

[a-z&&[^aeiou]] matches [a-z] subtracted by [aeiou], i.e. all lowercase consonants.

The next example shows the negated whitelist approach:

    System.out.println(
        "regular expressions: now you have two problems!!"
            .replaceAll("[^a-z]", "_")
    );
    //   regular_expressions__now_you_have_two_problems__

Only lowercase letters a-z are legal; everything else is illegal.

java - Unicode 制御文字の置換

1 に答える 1

API links

Examples

Related

Reference