java - ステータスコードと HTML をフィルタリングするための基本的な Web クローラーの拡張

Question

Java で基本的な Web クローラーを作成するためのチュートリアルに従い、基本的な機能を備えたものを手に入れました。

現時点では、サイトから HTML を取得してコンソールに出力するだけです。HTMLページのタイトルやHTTPステータスコードなどの詳細を除外できるように拡張したいと思っていましたか?

私はこのライブラリを見つけました：http: //htmlparser.sourceforge.net/ ...これで仕事ができると思いますが、外部ライブラリを使用せずにそれを行うことができますか?

これが私がこれまでに持っているものです：

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

score 1 · Accepted Answer

HTTP 通信用のツールは間違いなく存在します。ただし、自分で実装したい場合は、java.net.HttpURLConnection を調べてください。これにより、HTTP 通信をよりきめ細かく制御できます。ここにあなたのための小さなサンプルがあります:

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

java - ステータス コードと HTML をフィルタリングするための基本的な Web クローラーの拡張

1 に答える 1

Related

Reference

java - ステータスコードと HTML をフィルタリングするための基本的な Web クローラーの拡張