java - Javaで文字列をデコードする

Question

Javaで次の文字列を正しくデコードするにはどうすればよいですか

http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru

URLDecoder.decode（）を使用すると、次のエラーが発生します

java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"

ありがとう、デイブ

score 2 · Accepted Answer

%uXXXXエンコーディングは非標準であり、実際にはW3Cによって拒否されたため、URLDecoderがそれを理解しないのは当然です。

小さな関数を作成できます。これは、エンコードされた文字列内で出現%uXXYYするたびにを置き換えることで修正されます。%XX%YY次に、固定文字列を通常どおりに続行してデコードできます。

score 2 · Accepted Answer

ウィキペディアによると、「Unicode文字の非標準エンコーディングが存在します。%uxxxxここxxxxで、はUnicode値です」。継続：「この動作はRFCによって指定されておらず、W3Cによって拒否されました」。

URLにはそのようなトークンが含まれており、JavaURLDecoder実装はそれらをサポートしていません。

score 1 · Accepted Answer

Vartec のソリューションから始めましたが、追加の問題が見つかりました。このソリューションは UTF-16 で機能しますが、UTF-8 を返すように変更できます。わかりやすくするために、すべて置換は残されています。詳細については、http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScriptを参照してください。

static public String unescape(String escaped) throws UnsupportedEncodingException
{
    // This code is needed so that the UTF-16 won't be malformed
    String str = escaped.replaceAll("%0", "%u000");
    str = str.replaceAll("%1", "%u001");
    str = str.replaceAll("%2", "%u002");
    str = str.replaceAll("%3", "%u003");
    str = str.replaceAll("%4", "%u004");
    str = str.replaceAll("%5", "%u005");
    str = str.replaceAll("%6", "%u006");
    str = str.replaceAll("%7", "%u007");
    str = str.replaceAll("%8", "%u008");
    str = str.replaceAll("%9", "%u009");
    str = str.replaceAll("%A", "%u00A");
    str = str.replaceAll("%B", "%u00B");
    str = str.replaceAll("%C", "%u00C");
    str = str.replaceAll("%D", "%u00D");
    str = str.replaceAll("%E", "%u00E");
    str = str.replaceAll("%F", "%u00F");

    // Here we split the 4 byte to 2 byte, so that decode won't fail
    String [] arr = str.split("%u");
    Vector<String> vec = new Vector<String>();
    if(!arr[0].isEmpty())
    {
        vec.add(arr[0]);
    }
    for (int i = 1 ; i < arr.length  ; i++) {
        if(!arr[i].isEmpty())
        {
            vec.add("%"+arr[i].substring(0, 2));
            vec.add("%"+arr[i].substring(2));
        }
    }
    str = "";
    for (String string : vec) {
        str += string;
    }
    // Here we return the decoded string
    return URLDecoder.decode(str,"UTF-16");
}

score 1 · Accepted Answer

@ariy によって提示されたソリューションをよく調べた後、2 つの部分に切り刻まれたエンコードされた文字 (つまり、エンコードされた文字の半分が欠落している) に対しても回復力がある Java ベースのソリューションを作成しました。これは、2000 文字の長さに切り刻まれることがある長い URL をデコードする必要があるユースケースで発生します。さまざまなブラウザーでの URL の最大長は? を参照してください。

public class Utils {

    private static Pattern validStandard      = Pattern.compile("%([0-9A-Fa-f]{2})");
    private static Pattern choppedStandard    = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
    private static Pattern validNonStandard   = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
    private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");

    public static String resilientUrlDecode(String input) {
        String cookedInput = input;

        if (cookedInput.indexOf('%') > -1) {
            // Transform all existing UTF-8 standard into UTF-16 standard.
            cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");

            // Discard chopped encoded char at the end of the line (there is no way to know what it was)
            cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");

            // Handle non standard (rejected by W3C) encoding that is used anyway by some
            // See: https://stackoverflow.com/a/5408655/114196
            if (cookedInput.contains("%u")) {
                // Transform all existing non standard into UTF-16 standard.
                cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");

                // Discard chopped encoded char at the end of the line
                cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
            }
        }

        try {
            return URLDecoder.decode(cookedInput,"UTF-16");
        } catch (UnsupportedEncodingException e) {
            // Will never happen because the encoding is hardcoded
            return null;
        }
    }
}

java - Javaで文字列をデコードする

4 に答える 4

Related

Reference