nutch - Nutch でクロールを高速化する方法

Question

Nutch の urls ファイルに制約付きの一連の url を与えるアプリケーションを開発しようとしています。これらの URL をクロールし、セグメントからデータを読み取ることでコンテンツを取得できます。

ウェブページのアウトリンクやインリンクはまったく気にしないので、深さ 1 を指定してクロールしました。urls ファイルには、その Web ページのコンテンツのみが必要です。

ただし、このクロールの実行には時間がかかります。そこで、クロール時間を短縮し、クロール速度を上げる方法を提案してください。検索部分は気にしないので、インデックス作成も必要ありません。

クロールを高速化する方法について提案がある人はいますか?

score 7 · Accepted Answer

速度を上げるための主なことは、nutch-site.xml を構成することです。

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>

score 6 · Accepted Answer

nutch-site.xml でスレッドをスケールアップできます。fetcher.threads.per.host と fetcher.threads.fetch を増やすと、クロールの速度が上がります。劇的な改善が見られました。ただし、これらを増やすときは注意してください。この増加したトラフィックをサポートするためのハードウェアまたは接続がない場合、クロールのエラーの量が大幅に増加する可能性があります。

score 4 · Accepted Answer

私にとって、このプロパティは非常に役に立ちました。ドメインが遅いと、すべてのフェッチフェーズが遅くなる可能性があるからです。

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

たとえば、robots.txt (デフォルトの動作) を尊重し、ドメインが長すぎてクロールできない場合、遅延は fetcher.max.crawl.delay になります。そして、このドメインがキューに大量にあると、すべてのフェッチフェーズが遅くなるため、generate.max.count を制限することをお勧めします。

同じ方法でフェッチフェーズの時間を制限するために、このプロパティを追加できます。

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

ただし、fetcher.threads.per.queue プロパティには触れないでください。ブラックリストで終了します...クロール速度を改善するための良い解決策ではありません...

score 2 · Accepted Answer

こんにちは私もこのクロールは初めてですが、いくつかの方法を使用しました。いくつかの良い結果が得られました。これらのプロパティでnutch-site.xmlを変更しました

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

親切にいくつかのオプションを提案してくださいありがとう

score 0 · Accepted Answer

同様の問題があり、 https://wiki.apache.org/nutch/OptimizingCrawlsの助けを借りて速度を向上させることができます

クロールが遅くなる原因と、それらの問題を改善するためにできることについて、有益な情報が含まれています。

残念ながら、私の場合、キューのバランスが非常に悪く、より大きなキューに高速にリクエストできません。そうしないと、ブロックされるため、スレッドをさらに高速化する前に、おそらくクラスターソリューションまたは TOR に移動する必要があります。

score -1 · Accepted Answer

リンクをたどる必要がないなら、Nutch を使う理由はないと思います。URL のリストを取得し、http クライアントライブラリまたは curl を使用した簡単なスクリプトで取得できます。

nutch - Nutch でクロールを高速化する方法

6 に答える 6

Related

Reference