Nutch を使用してクロールを試みてから長い時間が経ちましたが、実行されていないようです。Web サイトの SOLR 検索を構築しようとしており、Solr でのクロールとインデックス作成に Nutch を使用しています。
当初、いくつかの権限の問題がありましたが、現在は修正されています。クロールしようとしている URL はhttp://172.30.162.202:10200/
で、一般公開されていません。Solr サーバーからアクセスできる内部 URL です。Lynxで閲覧してみました。
以下は、Nutch コマンドからの出力です。
[abgu01@app01 local]$ ./bin/nutch crawl /home/abgu01/urls/url1.txt -dir /home/abgu01/crawl -depth 5 -topN 100
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /opt/apache-nutch-1.4-bin/runtime/local/logs/hadoop.log (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:125)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)
at org.apache.nutch.crawl.Crawl.<clinit>(Crawl.java:43)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
solrUrl is not set, indexing will be skipped...
crawl started in: /home/abgu01/crawl
rootUrlDir = /home/abgu01/urls/url1.txt
threads = 10
depth = 5
solrUrl=null
topN = 100
Injector: starting at 2012-07-27 15:47:00
Injector: crawlDb: /home/abgu01/crawl/crawldb
Injector: urlDir: /home/abgu01/urls/url1.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-07-27 15:47:03, elapsed: 00:00:02
Generator: starting at 2012-07-27 15:47:03
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/abgu01/crawl/segments/20120727154705
Generator: finished at 2012-07-27 15:47:06, elapsed: 00:00:03
Fetcher: starting at 2012-07-27 15:47:06
Fetcher: segment: /home/abgu01/crawl/segments/20120727154705
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://172.30.162.202:10200/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-07-27 15:47:08, elapsed: 00:00:02
ParseSegment: starting at 2012-07-27 15:47:08
ParseSegment: segment: /home/abgu01/crawl/segments/20120727154705
ParseSegment: finished at 2012-07-27 15:47:09, elapsed: 00:00:01
CrawlDb update: starting at 2012-07-27 15:47:09
CrawlDb update: db: /home/abgu01/crawl/crawldb
CrawlDb update: segments: [/home/abgu01/crawl/segments/20120727154705]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-07-27 15:47:10, elapsed: 00:00:01
Generator: starting at 2012-07-27 15:47:10
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-07-27 15:47:11
LinkDb: linkdb: /home/abgu01/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/abgu01/crawl/segments/20120727154705
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
クロールが実行されない理由を教えてください。depth
またはtopN
パラメータの値に関係なく、常に「深さ = 1 で停止 - フェッチする URL はこれ以上ありません」と言って終了します。その理由は (上記の出力を見て)、Fetcher が URL からコンテンツを取得できないためだと思います。
どんな入力でも大歓迎です!