apache - 新しいスクリプト bin/crawl の使用 - 異なるバッチ ID (null) の URL をスキップする

Question

Nutch 2.1 の新しいクロールスクリプト bin/crawl を使用して、seed.txt から多くの Web サイトをクロールしたいと考えています。

問題は、スクリプトを実行するたびに、「Skipoing [ここに具体的な URL があります] 別のバッチ ID (null)」というメッセージが表示されて、何も取得または解析されない (URL なし) ことです。

ログからの出力は次のとおりです。

Start old crawling linked TV:
InjectorJob: starting
InjectorJob: urlDir: /opt/ir/nutch/urls
InjectorJob: finished

URLの注入は大丈夫だったようです

Sun Jun 30 19:45:10 CEST 2013 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: topN: 50000
GeneratorJob: done
GeneratorJob: generated batch id: 1372614310-1071860715
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1372614310-24672
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1372614928303
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0

.... ここでは FetcherThread48 への繰り返しがあり、それが続きます

Fetcher: throughput threshold: -1
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Parsing :
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1372614310-24672
Skipping http://www.brugge.be/internet/en/musea/bruggemuseum/stadhuis/index.htm; different batch id (null)
Skipping http://www.galloromeinsmuseum.be/; different batch id (null)
Skipping http://www.museumdrguislain.be/; different batch id (null)
Skipping http://www.muzee.be/; different batch id (null)
Skipping http://musea.sint-niklaas.be/; different batch id (null)

... ...そして、シードからさらに多くの URL をスキップします ... ...

ParserJob: success
CrawlDB update
DbUpdaterJob: starting
Limit reached, skipping further inlinks for de.ard.www:http/
Limit reached, skipping further inlinks for de.rbb-online.mediathek:http/
Limit reached, skipping further inlinks for de.rbb-online.www:http/
DbUpdaterJob: done

どこに問題があるか分かりますか？私はこのツールの構成に完全に疲れ果てており、うまく機能させようとしています...

score 0 · Accepted Answer

わかりました、解決策は、古いバージョンのナット (2.1) を使用したことです。2.2.1 にアップデートすると、この問題はなくなりました。

apache - 新しいスクリプト bin/crawl の使用 - 異なるバッチ ID (null) の URL をスキップする

1 に答える 1

Related

Reference