java - bufferedReader を使用しているときにエンコードを検出する方法

Question

この質問が何度も聞かれたことは知っていますが、私はこの問題に悩まされており、読んだものは何も役に立ちませんでした。

私はこのコードを持っています:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();

この Web ページhttp://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/のコンテンツを取得しようとしていますが、すべての非ラテン記号が間違って表示されています。

私は次のようなエンコーディングを設定しようとしました:

BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "WINDOWS-1251"));

そして、この時点ですべてが順調でした！しかし、解析しようとする各 Web サイトのエンコードを変更することはできず、何らかの解決策が必要です。

皆さん、エンコードを検出するのはそれほど簡単ではないことは知っていますが、本当に必要です。誰かがそのような問題を抱えていたら、どのように解決したか説明してください!

任意の助けが必要です！

これは、コンテンツを取得するために使用している関数のコード全体です。

protected Map<String, String> getFromUrl(String url){
    Map<String, String> mp = new HashMap<String, String>();
    String newCookie = "", redirect = null;
    try{
        String host = this.getHostName(url), content = "", header = "", UA = this.getUA(), cookie = this.getCookie(host, UA), referer = "http://"+host+"/";
        URL U = new URL(url);
        URLConnection conn = U.openConnection();
        conn.setRequestProperty("Host", host);
        conn.setRequestProperty("User-Agent", UA);
        conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3");
        conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
        conn.setRequestProperty("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
        conn.setRequestProperty("Keep-Alive", "115");
        conn.setRequestProperty("Connection", "keep-alive");
        conn.setRequestProperty("Connection", "keep-alive");
        if(referer != null)conn.setRequestProperty("Referer", referer);
        if(cookie != null && !cookie.contentEquals(""))conn.setRequestProperty("Cookie", cookie);
        for(int i=0; ; i++){
            String name = conn.getHeaderFieldKey(i);
            String value = conn.getHeaderField(i);
            if(name == null && value == null)break; 
            else if(name != null)if(name.contentEquals("Set-Cookie"))newCookie += value + " ";
            else if(name.toLowerCase().trim().contentEquals("location"))redirect = value;
            header += name + ": " + value + "\r\n";
        }
        if(!newCookie.contentEquals("") && !newCookie.contentEquals(cookie))this.setCookie(host, UA, newCookie.trim());
        try{
            BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            while((line = reader.readLine()) != null)content += line+"\r\n";
            reader.close();
        }
        catch(Exception e){/*System.out.println(url+"\r\n"+e);*/}
        mp.put("url", url);
        mp.put("header", header);
        mp.put("content", content);
    }
    catch(Exception e){
        mp.put("url", "");
        mp.put("header", "");
        mp.put("content", "");
    }
    if(redirect != null && this.redirectCount < 3){
        mp = getFromUrl(redirect);
        this.redirectCount++;
    }
    return mp;
}

score 1 · Accepted Answer

たとえば、 jsoupを使用します。嘘つき/存在しないヘッダーと 2 つの異なるメタタグがあるため、ランダムな Web サイトの文字エンコーディングの検出は複雑な問題です。たとえば、リンクしたページは Content-Type ヘッダーで文字セットを送信しません。

とにかく、HTML パーサーが必要になります。正規表現を使用することは考えていませんでしたよね?

使用例は次のとおりです。

Connection connection = Jsoup.connect("http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/");
connection
    .header("Host", host)
    .header("User-Agent", UA)
    .header("Accept", "text/html,application/xhtml+xml,application/xmlq=0.9,*/*q=0.8")
    .header("Accept-Language", "ru-ru,ruq=0.8,en-usq=0.5,enq=0.3")
    .header("Accept-Encoding", "gzip,deflate")
    .header("Accept-Charset", "utf-8q=0.7,*q=0.7")
    .header("Keep-Alive", "115")
    .header("Connection", "keep-alive");

connection.followRedirects(true);

Document doc = connection.get();

Map<String, String> cookies = connection.response().cookies();

Elements titles = doc.select(".title");
for( Element title : titles ) {
    System.out.println(title.ownText());
}

出力：

Шины Marangoni E-COMM
Описание шины Marangoni E-COMM

score 0 · Accepted Answer

「Content-Type」ヘッダーを探します。

コンテンツタイプ: テキスト/html; 文字セット=utf-8

そこにある「charset」部分が探しているものです。

java - bufferedReader を使用しているときにエンコードを検出する方法

2 に答える 2

Related

Reference