java - Unicode サロゲート文字をエスケープしますか?

Question

次のテキスト行があります（コードでも参照してください：

私がやろうとしているのは、その絵文字 (電話アイコン) を 2 つの \u 文字としてエスケープしてから、元の電話アイコンに戻すことですか? 以下の最初の方法は正常に機能しますが、基本的に範囲でエスケープして、このような文字をエスケープできるようにしたいと考えています。以下の最初の方法を使用してこれがどのように可能になるかわかりません。

StringEscapeUtils と同じ出力として UnicodeEscaper を使用して、この範囲ベースのエスケープを実現するにはどうすればよいですか (つまり、2 つの \uxx \uxx にエスケープしてから、電話アイコンに unescape します)。

import org.apache.commons.lang3.text.translate.UnicodeEscaper;
import org.apache.commons.lang3.text.translate.UnicodeUnescaper;

    String text = "Unicode surrogate here-> <--here";
    // escape the entire string...not what I want because there could
    // be \n \r or any other escape chars that I want left in tact (i just want  a range)
    String text2 = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
    System.out.println(text2);   // "Unicode surrogate here-> \uD83D\uDCF1<--here"
    // unescape it back to the phone emoticon
    text2 = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
    System.out.println(text2); // "Unicode surrogate here-> <--here"

    // How do I do the same as above but but looking for a range of chars to escape (i.e. any unicode surrogate)
    // , which is what i want  and not to escape the entire string
    text2 = UnicodeEscaper.between(0x10000, 0x10FFFF).translate(text);
    System.out.println(text2); // "Unicode surrogate here-> \u1F4F1<--here"
    // unescape .... (need the phone emoticon here)
    text2 = (new UnicodeUnescaper().translate(text2));
    System.out.println(text2);// "Unicode surrogate here-> ὏1<--here"

score 3 · Accepted Answer

遅すぎる答え。しかし、私はあなたが必要であることを発見しました

org.apache.commons.lang3.text.translate.JavaUnicodeEscaper

クラスの代わりに UnicodeEscaper.

それを使用すると、次のように出力されます。

Unicode surrogate here-> \uD83D\uDCF1<--here

そして、エスケープ解除はうまく機能します。

score 2 · Accepted Answer

あなたの文字列:

"Unicode surrogate here-> \u1F4F1<--here"

あなたが思っていることをしません。

Acharは基本的に UTF-16 コード単位なので、16 ビットです。ここで何が起こるかというと、\u1f41 1;があるということです。そして、それはあなたの出力を説明しています。

ここで「エスケープ」と呼ぶものはわかりませんが、これがサロゲートペアを「\u\u」に置き換えている場合は、Character.toChars(). charBMP (1 文字) にあるかどうか (2 文字) に関係なく、1 つの Unicode コードポイントを表すために必要なシーケンスを返します。

コードポイント U+1f4f1 の場合、文字 0xd83d と 0xdcf1 をこの順序で含む 2 要素の char 配列を返します。そして、これはあなたが望むものです。

java - Unicode サロゲート文字をエスケープしますか?

2 に答える 2

Related

Reference