java - URL ハーベスターの同時実行の問題、ConcurrentModificationException

Question

こんにちは、再帰的な .pdf url ハーベストを実行しようとしていますが、ConcurrentModificationException が発生しています。これがどのように発生するのか理解できません。また、並行性についてもよくわかりません。これがどのように発生し、どのように修正できるかについての洞察をいただければ幸いです。

public class urlHarvester {
    private URL rootURL;
    private String fileExt;
    private int depth;
    private HashSet<String> targets;
    private HashMap<Integer, LinkedList<String>> toVisit;

public urlHarvester(URL rootURL, String fileExt, int depth) {
    this.rootURL = rootURL;
    this.fileExt = fileExt;
    this.depth = depth;
    targets = new HashSet<String>();
    toVisit = new HashMap<Integer, LinkedList<String>>();
    for (int i = 1; i < depth + 1; i++) {
        toVisit.put(i, new LinkedList<String>());
    }
    doHarvest();
}

private void doHarvest() {
    try {
        harvest(rootURL, depth);
        while (depth > 0) {
            for (String s : toVisit.get(depth)) {
                toVisit.get(depth).remove(s);
                harvest(new URL(s),depth-1);
            }
            depth--;
        }   
    } catch (Exception e) {
        System.err.println(e);
        e.printStackTrace();
    }   
    for (String s : targets) {
        System.out.println(s);
    }

}

private void harvest(URL url, int depth) {
    try {
        URLConnection urlConnection = url.openConnection();
        InputStream inputStream = urlConnection.getInputStream();
        Scanner scanner = new Scanner(new BufferedInputStream(inputStream));
        java.lang.String source = "";
        while (scanner.hasNext()) {
            source = source + scanner.next();
        }   
        inputStream.close();
        scanner.close();

        Matcher matcher = Pattern.compile("ahref=\"(.+?)\"").matcher(source);
        while(matcher.find()) {
            java.lang.String matched = matcher.group(1);
            if (!matched.startsWith("http")) {
                if (matched.startsWith("/") && url.toString().endsWith("/")) {
                    matched = url.toString() + matched.substring(1);
                } else if ((matched.startsWith("/") && !url.toString().endsWith("/"))
                        || (!matched.startsWith("/") && url.toString().endsWith("/"))) {
                    matched = url.toString() + matched;
                } else if (!matched.startsWith("/") && !url.toString().endsWith("/")) {
                    matched = url.toString() + "/" + matched;
                }
            }
            if (matched.endsWith(".pdf") && !targets.contains(matched)) {
                targets.add(matched);System.out.println("ADDED");
            }
            if (!toVisit.get(depth).contains(matched)) {
                toVisit.get(depth).add(matched);
            }
        }
    } catch (Exception e) {
        System.err.println(e);
    }
}

メイン呼び出しのあるクラス:

urlHarvester harvester = new urlHarvester(new URL("http://anyasdf.com"), ".pdf", 5);

score 5 · Accepted Answer

エラーはおそらく同時実行性とは関係ありませんが、次のループが原因です。

for (String s : toVisit.get(depth)) {
    toVisit.get(depth).remove(s);
    harvest(new URL(s),depth-1);
}

remove反復中にコレクションから項目を削除するには、反復子からメソッドを使用する必要があります。

List<String> list = toVisit.get(depth); //I assume list is not null
for (Iterator<String> it = list.iterator(); it.hasNext();) {
    String s = it.next();
    it.remove();
    harvest(new URL(s),depth-1);
}

score 1 · Accepted Answer

ConcurrentModificationException反復処理中にコレクションからオブジェクトを直接削除しようとすると、Aがスローされます。

これは、からエントリを削除しようとしたときに発生していますtoVisit HashMap:

for (String s : toVisit.get(depth)) {
   toVisit.get(depth).remove(s); <----
   ...

コレクションから直接削除しようとする代わりに、反復子を使用できます。

Iterator<String> iterator = toVisit.get(depth).iterator();
while (iterator.hasNext()) {
   String s = iterator.next();
   iterator.remove();
   harvest(new URL(s),depth-1);
}

java - URL ハーベスターの同時実行の問題、ConcurrentModificationException

2 に答える 2

Related

Reference