java - 間違ったエンコードされた文字列を修正する方法はありますか?

Question

メッセージブローカー (Stomp) 経由でこの文字列を取得しています:

JoÃÂ£o

そして、それが次のようになるはずです：

João

Javaでこれを元に戻す方法はありますか?! ありがとう！

score 4 · Accepted Answer

U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00C2  Â   c3 82   LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3  £   c2 a3   POUND SIGN
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

これがデータ (エンコーディング) 変換の問題である可能性を判断するのに苦労しています。データが悪い可能性はありますか？

データに問題がない場合は、エンコーディングを誤解していると見なす必要があります。元のエンコーディングはわかりません。別のことをしない限り、Java のデフォルトのエンコーディングは UTF-16 です。一般的なエンコーディングでエンコードされたものを UTF-16 のJoãoように解釈する方法がわかりませんJoÃÂ£o

念のため、一致するものが見つからない状態でこの python スクリプトを作成しました。それがすべてのエンコーディングをカバーしているかどうか、またはコーナーケース、FWIW を見逃していないかどうかは完全にはわかりません。

#!/usr/bin/env python                                                                                                                   
# -- coding: utf-8 --                                                                                                                   
import pkgutil
import encodings

good = u'João'
bad = u'JoÃÂ£o'

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found


for x in found:
    for y in found:
        res = None
        try:
            res =  good.encode(x).decode(y)
            print res,x,y
        except:
            pass
        if not res is None:
            if res == bad:
                print "FOUND"
                exit(1)

score 2 · Accepted Answer

場合によっては、ハックが機能します。しかし、最善の方法は、それが起こらないようにすることです。

以前、ページに正しいヘッダーと http コンテンツタイプとエンコーディングを正しく出力するサーブレットがあったときにこの問題が発生しましたが、IE は正しいフォームではなく latin1 でエンコードされたフォームを送信しました。そこで、問題なく動作する新しいデータ用に修正するために、簡単なダーティハック (実際に IE であるかどうかを検出して変換するリクエストラッパーを含む) を作成しました。そして、すでにめちゃくちゃになっているデータベースのデータに対して、次のハックを使用しました。

残念ながら、私のハックはあなたの例の文字列に対して完全には機能しませんが、非常に近いように見えます（私の「理論的な原因」で再現された壊れた文字列と比較して、壊れた文字列に余分な Ã があります）。したがって、おそらく私の「latin1」の推測は間違っているので、他のものを試してみてください（Tomasが投稿した他のリンクなど）。

package peter.test;

import java.io.UnsupportedEncodingException;

/**
* User: peter
* Date: 2012-04-12
* Time: 11:02 AM
*/
public class TestEncoding {
    public static void main(String args[]) throws UnsupportedEncodingException {
        //In some cases a hack works. But best is to prevent it from ever happening.
        String good = "João";
        String bad = "JoÃÂ£o";

        //this line demonstrates what the "broken" string should look like if it is reversible.
        String broken = breakString(good, bad);

        //here we show that it is fixable if broken like breakString() does it.
        fixString(good, broken);

        //this line attempts to fix the string, but it is not fixable unless broken in the same way as breakString()
        fixString(good, bad);
    }

    private static String fixString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = bad.getBytes("latin1"); //read the Java bytes as if they were latin1 (if this works, it should result in the same number of bytes as java characters; if using UTF8, it would be more bytes)
        String fixed = new String(bytes, "UTF8"); //take the raw bytes, and try to convert them to a string as if they were UTF8

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("fixed: " + fixed);
        System.out.println();

        return fixed;
    }

    private static String breakString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = good.getBytes("UTF8");
        String broken = new String(bytes, "latin1");

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("broken: " + broken);
        System.out.println();

        return broken;
    }
}

結果 (Sun jdk 1.7.0_03 を使用):

Good: João
Bad: JoÃÂ£o
bytes1.length: 5
broken: JoÃ£o

Good: João
Bad: JoÃ£o
bytes1.length: 5
fixed: João

Good: João
Bad: JoÃÂ£o
bytes1.length: 6
fixed: Jo�£o

java - 間違ったエンコードされた文字列を修正する方法はありますか?

2 に答える 2

Related

Reference