multithreading - Java 8 CompletedFuture Web クローラーが 1 つの URL を超えてクロールしない

Question

Cay S. Horstmann 著「Java SE 8 for the Really Impatient」という本から演習を行いながら、Java 8 で新しく導入された同時実行機能を試しています。新しいCompletedFutureとjsoupを使用して、次の Web クローラーを作成しました。基本的な考え方は、URL が与えられると、そのページの最初の m 個のURL を見つけて、このプロセスをn回繰り返します。もちろん、 mとnはパラメータです。問題は、プログラムが最初のページの URL をフェッチするが、再帰しないことです。私は何が欠けていますか？

static class WebCrawler {
    CompletableFuture<Void> crawl(final String startingUrl,
        final int depth, final int breadth) {
        if (depth <= 0) {
            return completedFuture(startingUrl, depth);
        }

        final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
            startingUrl)
            .map(url -> supplyAsync(getContent(url)))
            .map(docFuture -> docFuture.thenApply(getURLs(breadth)))
            .map(urlsFuture -> urlsFuture.thenApply(doForEach(
                depth, breadth)))
            .toArray(size -> new CompletableFuture[size]));

        allDoneFuture.join();

        return allDoneFuture;
    }

    private CompletableFuture<Void> completedFuture(
        final String startingUrl, final int depth) {
        LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);

        CompletableFuture<Void> future = new CompletableFuture<>();
        future.complete(null);

        return future;
    }

    private Supplier<Document> getContent(final String url) {
        return () -> {
            try {
                return connect(url).get();
            } catch (IOException e) {
                throw new UncheckedIOException(
                    " Something went wrong trying to fetch the contents of the URL: "
                        + url, e);
            }
        };
    }

    private Function<Document, Set<String>> getURLs(final int limit) {
        return doc -> {
            LOGGER.info("Getting URLs for document: {}.", doc.baseUri());

            return doc.select("a[href]").stream()
                .map(link -> link.attr("abs:href")).limit(limit)
                .peek(LOGGER::info).collect(toSet());
        };
    }

    private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
          final int depth, final int breadth) {
        return urls -> urls.stream().map(
            url -> crawl(url, depth - 1, breadth));
    }
}

テストケース：

@Test
public void testCrawl() {
    new WebCrawler().crawl(
        "http://en.wikipedia.org/wiki/Java_%28programming_language%29",
        2, 10);
}

multithreading - Java 8 CompletedFuture Web クローラーが 1 つの URL を超えてクロールしない

1 に答える 1

Related

Reference