java - WebページのコンテンツをJavaで文字列に読み込むための最適な方法は何ですか？

Question

指定されたURLでHTMLページのコンテンツ全体をフェッチする次のJavaコードがあります。これはより効率的な方法で行うことができますか？どんな改善でも大歓迎です。

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

線の読みが最適ではないと感じずにはいられません。私はおそらく電話MalformedURLExceptionによって引き起こされたものを隠していることを知っていopenConnectionます、そして私はそれで大丈夫です。

私の関数には、HTML文字列に現在のシステムの正しい行末記号を持たせるという副作用もあります。これは必須ではありません。

ネットワークIOは、HTMLの読み取りにかかる時間をおそらく短縮することを理解していますが、それでもこれが最適であることを知りたいと思います。

StringBuilderちなみに、オープンのコンストラクターがあれば、InputStreamのすべてのコンテンツを取得して、InputStreamそれをに読み込むことができれば素晴らしいと思いStringBuilderます。

score 10 · Accepted Answer

他の回答に見られるように、堅牢なソリューションで説明する必要のあるさまざまなエッジケース（HTTPの特殊性、エンコーディング、チャンクなど）があります。したがって、おもちゃのプログラム以外では、事実上のJava標準HTTPライブラリであるApacheHTTPコンポーネントHTTPクライアントを使用することをお勧めします。

それらは多くのサンプルを提供し、リクエストの応答コンテンツを「ちょうど」取得するのは次のようになります。

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();

score 2 · Accepted Answer

OK、もう一度編集しました。必ずtry-finallyブロックを配置するか、IOExceptionをキャッチしてください

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

ここでこれを試してください：

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

HTTP / 1.1の「チャンキング」が原因で、Content-lengthヘッダー行が最初に送信されていないため、以下のコードは機能していませんでした。

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array

score 1 · Accepted Answer

大きなチャンクを文字配列に読み込み、配列の内容をStringBuilderに追加することで、InputStreamReaderの上で独自のバッファリングを行うことができます。

しかし、それはあなたのコードを理解するのを少し難しくするでしょう、そして私はそれが価値があるとは思えません。

Sean AO Harneyによる提案は生のバイトを読み取るため、その上にテキストへの変換を行う必要があることに注意してください。

java - WebページのコンテンツをJavaで文字列に読み込むための最適な方法は何ですか？

3 に答える 3

Related

Reference