java - JavaでURLのエンコーディングを検出する

Question

データベース内のデータが混在している場合がありますが、これが解決可能な問題であるかどうかを確認しようとしています。私が持っているのは、次の3つの形式のいずれかの部分的なURLです。

/some/path?ugly=häßlich // case 1, Encoding: UTF-8 (plain)
/some/path?ugly=h%C3%A4%C3%9Flich // case 2, Encoding: UTF-8 (URL-encoded)
/some/path?ugly=h%E4%DFlich // case 3: Encoding: ISO-8859-1 (URL-encoded)

アプリケーションに必要なのは、URLエンコードされたUTF8バージョンです。

/some/path?ugly=h%C3%A4%C3%9Flich // Encoding: UTF-8 (URL-encoded)

DB内の文字列はすべてUTF-8ですが、URLエンコードが存在する場合と存在しない場合があり、どちらの形式でもかまいません。

aプレーンUTF-8をURLエンコードされたUTF-8にエンコードするメソッドがあり、URLエンコードされたISO-8859-1をプレーンUTF-8にデコードするメソッドがbあるので、基本的には次のようにします。

ケース1：

String output = a(input);

ケース2：

String output = input;

ケース3：

String output = a(b(input));

どちらがどちらかがわかっていれば、これらのケースはすべて正常に機能しますが、そのような文字列がケース2か3かを検出する安全な方法はありますか？（パラメータで使用される言語をヨーロッパ言語に制限できます：ドイツ語、英語、フランス語、オランダ、ポーランド語、ロシア語、デンマーク語、ノルウェー語、スウェーデン語、トルコ語、それが助けになる場合）。

明らかな解決策はデータをクリーンアップすることですが、残念ながら、データは自分で作成したものではなく、必要な技術的理解を持っている人もいません（そして、機能する必要のあるレガシーデータがたくさんあります）

score 2 · Accepted Answer

英数字のみがエンコードされていると想定できる場合は、次のように機能します。

「häßlich」
"h％C3％A4％C3％9Flich"
"h％E4％DFlich"

//最初に確認します：

public static boolean isUtf8Encoded(String url) {
    return isAlphaNumeric(url);
}

public static boolean isUrlUtf8Encoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "UTF-8"));
}

public static boolean isUrlIsoEncoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "ISO-8859-1"));
}

private static boolean isAlphaNumeric(String decode) {
    for (char c : decode.toCharArray()) {
        if (!Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}

score 1 · Accepted Answer

最初にデコードしてからエンコードするときに回避策を講じることができます。エンコードされていないURLがある場合は、デコードの影響を受けません。

 String url = "your url";
    url=URIUtil.decode(url, "UTF-8");
    url=URIUtil.encodeQuery(url, "UTF-8");

score 0 · Accepted Answer

受け入れられた回答のおかげですが、URLには制御文字も含まれているため、URLでは機能しません。これが私の解決策です。

/**
 * List of valid characters in URL.
 */
private static final List VALID_CHARACTERS = Arrays.asList(
        '-', '.', '_', '~', ':', '/', '?', '#', '[', ']', '@', '!',
        '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='
);

/**
 * Check that decoding was successful or not.
 * @param url URL to check
 * @return True if it's valid.
 */
private static boolean isMalformed(final String url) {
    for (char c : url.toCharArray()) {
        if (VALID_CHARACTERS.indexOf(c) == -1 && !Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}

/**
 * Try to decode URL with specific encoding.
 * @param url URL
 * @param encoding Valid encoding
 * @return Decoded URL or null of encoding is not write
 * @throws java.io.UnsupportedEncodingException Throw if encoding does not support on your system.
 */
private static String _decodeUrl(final String url, final String encoding) {
    try {
        final String decoded = URLDecoder.decode(url, encoding);
        if(isMalformed(decoded)) {
            return decoded;
        }
    }
    catch (UnsupportedEncodingException ex) {
        throw new IllegalArgumentException("Illegal encoding: " + encoding);
    }
    return null;
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url) {
    final String[] mostPopularEncodings = new String[] {"iso-8859-1", "utf-8", "GB2312"};
    return decodeUrl(url, mostPopularEncodings);
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @param encoding Encoding
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url, final String... encoding) {
    for(String e:encoding) {
        final String decoded;
        if((decoded = _decodeUrl(url, e)) != null) {
            return decoded;
        }
    }
    return url;
}

java - JavaでURLのエンコーディングを検出する

3 に答える 3

Related

Reference