java - JavaでWebサイトのHTMLを取得するのに助けが必要

Question

私はjavahttpurlconnectionからhtmlを切り取っていくつかのコードを取得しましたが、JavaのWebサイトからhtmlをフェッチするためのコードとほとんど同じです。このコードを機能させることができない特定のWebサイトを除いて：

このWebサイトからHTMLを取得しようとしています。

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

しかし、私はジャンク文字を取得し続けます。それはhttp://www.google.comのような他のウェブサイトで非常にうまく機能しますが。

そして、これは私が使用しているコードです：

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

上記のURLで動作しない理由がわかりません。

どんな助けでもありがたいです。

score 7 · Accepted Answer

そのサイトは、クライアントの機能に関係なく、誤って応答を圧縮しています。通常、サーバーは、クライアントが応答をサポートする場合にのみ応答をgzipで圧縮する必要があります（by Accept-Encoding: gzip）。を使用して解凍する必要がありますGZIPInputStream。

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

InputStreamReaderコンストラクターに適切な文字セットも追加したことに注意してください。Content-Type通常は、応答のヘッダーから抽出します。

その他のヒントについては、URLConnectionを使用してHTTPリクエストを起動および処理する方法も参照してください。結局のところ、HTMLから情報を解析/抽出するだけの場合は、代わりにJsoupのようなHTMLパーサーを使用することを強くお勧めします。

java - JavaでWebサイトのHTMLを取得するのに助けが必要

1 に答える 1

Related

Reference