nutch - nutchを利用した短縮URLのクロールについて

Question

urlsディレクトリに指定した一連の URL をクロールし、その URL のコンテンツのみを取得する必要があるアプリケーションに Nutch クローラーを使用しています。内部リンクまたは外部リンクの内容には興味がありません。そこで、NUTCH クローラーを使用し、深さを 1 としてクロールコマンドを実行しました。

bin/nutch クロール URL -dir クロール -深さ 1

Nutch は URL をクロールし、指定された URL の内容を教えてくれます。

readeg ユーティリティを使用してコンテンツを読んでいます。

bin/nutch の readeg -dump のクロール/セグメント/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

これで、Webページのコンテンツを取得しています。

私が直面している問題は、次のような直接の URL を指定した場合です。

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

次に、Webページのコンテンツを取得できます。しかし、一連の URL を次のような短い URL として指定すると、

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

コンテンツを取得できません。

セグメントを読むと、コンテンツが表示されません。セグメントから読み取ったダンプファイルの内容を以下に示します。

*Recno:: 0
URL:: http://is.gd/0yKjO6
クロールデータム::
バージョン: 7
ステータス: 1 (db_unfetched)
取得時刻: 2011 年 1 月 25 日火曜日 20:56:07 IST
修正時刻: Thu Jan 01 05:30:00 IST 1970
フェッチ以降の再試行: 0
再試行間隔: 2592000 秒 (30 日)
スコア: 1.0
署名: null
メタデータ: _ngt_: 1295969171407
コンテンツ：：
バージョン: -1
URL: http://is.gd/0yKjO6
ベース: http://is.gd/0yKjO6
contentType: テキスト/html
メタデータ: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4= 1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; 文字セット=UTF-8 接続=閉じる サーバー=nginx X-Powered-By=PHP/5.2.14
コンテンツ：
記録:: 1
URL:: http://is.gd/1tpKaN
コンテンツ：：
バージョン: -1
URL: http://is.gd/1tpKaN
ベース: http://is.gd/1tpKaN
contentType: テキスト/html
メタデータ: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice? tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; 文字セット=UTF-8 接続=閉じる サーバー=nginx X-Powered-By=PHP/5.2.14
コンテンツ：
クロールデータム::
バージョン: 7
ステータス: 1 (db_unfetched)
取得時刻: 2011 年 1 月 25 日火曜日 20:56:07 IST
修正時刻: Thu Jan 01 05:30:00 IST 1970
フェッチ以降の再試行: 0
再試行間隔: 2592000 秒 (30 日)
スコア: 1.0*

また、nutch-default.xml の max.redirects プロパティを 4 に設定してみましたが、進展がありません。この問題の解決策を教えてください。

よろしくお願いします。 Arjun Kumar Reddy

score 2 · Accepted Answer

nutch 1.2 を使用して、ファイルconf/nutch-default.xmlを編集し、 http.redirect.max
を見つけて、値をデフォルトの 0 ではなく少なくとも 1 に変更してみてください。

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

幸運を

score 0 · Accepted Answer

最初のフェッチで 301 (または 302) コードが返されるため、深さを 2 以上に設定する必要があります。リダイレクトは次の反復で行われるため、さらに深くする必要があります。

また、regex-urlfilter.txt で追跡されるすべての URL を許可していることを確認してください。

nutch - nutchを利用した短縮URLのクロールについて

2 に答える 2

Related

Reference