java - Java Web クローラーがダウンロードする GB データが多すぎます

Question

Webクローラーをコーディングしました。ただし、クロールすると、大量の GB のデータがダウンロードされます。

テキストだけを読みたい（画像を避けたい…など）。

Boilerpipeを使用して html からコンテンツを抽出します

最終的にリダイレクトされたURLを見つける方法は次のとおりです

public String getFinalRedirectedUrl(String url) throws IOException{
    HttpURLConnection connection;
    String finalUrl = url;
    int redirectCount = 0;
    do {
        connection = (HttpURLConnection) new URL(finalUrl)
                .openConnection();
        connection.setConnectTimeout(Config.HTTP_CONNECTION_TIMEOUT_TIME);
        connection.setReadTimeout(Config.HTTP_READ_TIMEOUT_TIME);
        connection.setInstanceFollowRedirects(false);
        connection.setUseCaches(false);
        connection.setRequestMethod("GET");
        connection.connect();
        int responseCode = connection.getResponseCode();
        if (responseCode >= 300 && responseCode < 400) {
            String redirectedUrl = connection.getHeaderField("Location");
            if (null == redirectedUrl)
                break;
            finalUrl = redirectedUrl;
            redirectCount++;
            if(redirectCount > Config.MAX_REDIRECT_COUNT){
                throw new java.net.ProtocolException("Server redirected too many  times ("+Config.MAX_REDIRECT_COUNT+")");
            }
        } else{
            break;
        }
    } while (connection.getResponseCode() != HttpURLConnection.HTTP_OK);
    connection.disconnect();

    return finalUrl;
}

これは私がURLを取得する方法です

private HTMLDocument fetch(URL url) throws IOException{
    final HttpURLConnection httpcon = (HttpURLConnection) url.openConnection();
    httpcon.setFollowRedirects(true);
    httpcon.setConnectTimeout(Config.HTTP_CONNECTION_TIMEOUT_TIME);
    httpcon.setReadTimeout(Config.HTTP_READ_TIMEOUT_TIME);
    httpcon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2");
    final String ct = httpcon.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        if(!ct.contains("text/html")){
            System.err.println("Content type is:"+ct);
            return new HTMLDocument("");
        }

        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
                final String charset = m.group(1);
                try {
                        cs = Charset.forName(charset);
                } catch (UnsupportedCharsetException | IllegalCharsetNameException e) {
                        // keep default
                }
        }
    }

    InputStream in = httpcon.getInputStream();

    final String encoding = httpcon.getContentEncoding();
    if(encoding != null) {
        if("gzip".equalsIgnoreCase(encoding)) {
                in = new GZIPInputStream(in);
        } else {
                System.err.println("WARN: unsupported Content-Encoding: "+encoding);
        }
    }

    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    byte[] buf = new byte[4096];
    int r;
    while ((r = in.read(buf)) != -1) {
        bos.write(buf, 0, r);
    }
    in.close();

    final byte[] data = bos.toByteArray();

    return new HTMLDocument(data, cs);
}

そして、Boilerpipeを使用して body を取得するには

HTMLDocument htmlDoc = fetch(new URL(url));
String body = ArticleExtractor.INSTANCE.getText(htmlDoc.toInputSource());

ダウンロードするデータ量を減らすには？

java - Java Web クローラーがダウンロードする GB データが多すぎます

1 に答える 1

Related

Reference