indexing - SolrDataImportHandlerのチャンク化されたUrlDataSource

Question

solrに最適なデータをインポートするためにデータソースをチャンク化することを検討しており、データをセクションにチャンク化するマスターURLを使用できるかどうか疑問に思っていました。

たとえば、ファイル1には次のようなものがあります。

<chunks>
  <chunk url="http://localhost/chunker?start=0&stop=100" />
  <chunk url="http://localhost/chunker?start=100&stop=200" />
  <chunk url="http://localhost/chunker?start=200&stop=300" />
  <chunk url="http://localhost/chunker?start=300&stop=400" />
  <chunk url="http://localhost/chunker?start=400&stop=500" />
  <chunk url="http://localhost/chunker?start=500&stop=600" />
</chunks>

各チャンクのURLは次のようなものにつながります

<items>
   <item data1="info1" />
   <item data1="info2" />
   <item data1="info3" />
   <item data1="info4" />
</iems>

私は5億以上のレコードを処理しているので、メモリの問題を回避するためにデータをチャンク化する必要があると思います（SQLEntityProcessorを使用しているときにそれに遭遇しました）。また、5億件以上のウェブリクエストを行うことは避けたいと思います。これは、費用がかかる可能性があるためです。

score 7 · Accepted Answer

インターネット上に例がないため、私は最終的に使用したものを投稿すると思いました

<?xml version="1.0" encoding="utf-8"?>
<result>
  <dataCollection func="chunked">
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data info="test" info2="test" />
    <data hasmore="true" nexturl="http://server.domain.com/handler?start=0&amp;end=1000000000&amp;page=1&amp;pagesize=10"
  </dataCollection>
</result>

次のページにさらにあることを指定し、次のページへのURLを提供することを使用していることに注意することが重要です。これは、DataImportHandlersのSolrドキュメントと一致しています。ドキュメントには、ページ付けされたフィードがシステムにさらに多くの情報があり、次のバッチを取得する場所を通知する必要があると指定されていることに注意してください。

<dataConfig>
    <dataSource name="b" type="URLDataSource" baseUrl="http://server/" encoding="UTF-8" />
    <document>
        <entity name="continue"
                dataSource="b"
                url="handler?start=${dataimport.request.startrecord}&amp;end=${dataimport.request.stoprecord}&amp;pagesize=100000"
                stream="true"
                processor="XPathEntityProcessor"
                forEach="/result/dataCollection/data"
                transformer="DateFormatTransformer"
                connectionTimeout="120000"
                readTimeout="300000"
                >
            <field column="id"  xpath="/result/dataCollection/data/@info" />
            <field column="id"  xpath="/result/dataCollection/data/@info" />
            <field column="$hasMore" xpath="/result/dataCollection/data/@hasmore" />
            <field column="$nextUrl" xpath="/result/dataCollection/data/@nexturl" />
        </entity>
    </document>

$hasMoreフィールドと$nextUrlフィールドに注意してください。タイムアウトを設定することをお勧めします。また、ページサイズを指定できるようにすることをお勧めします（最適な処理速度を得るためのtweeking設定に役立ちます）。クアッドコアXeonプロセッサと32GBのRAMを搭載したシングルサーバーでマルチコア（3）solrインスタンスを使用して、1秒あたり約12.5Kレコードのインデックスを作成しています。

結果をページ分割するアプリは、データを格納するSQLサーバーと同じシステムを使用します。また、最終的にsolrサーバーの負荷を分散するときに構成の変更を最小限に抑えるために、開始位置と停止位置を渡します。

score 1 · Accepted Answer

エンティティは、元々必要なことを実行するためにネストできます。内側のエンティティは、このように外側のフィールドを参照できますurl="${chunk.link}"。ここchunkで、は外側のエンティティ名link、はフィールド名です。

<?xml version="1.0" encoding="windows-1250"?>
<dataConfig>
  <dataSource name="b" type="URLDataSource" baseUrl="http://server/" encoding="UTF-8" />
  <document>
    <entity name="chunk"
      dataSource="b"
      url="path/to/chunk.xml"
      stream="true"
      processor="XPathEntityProcessor"
      forEach="/chunks/chunk"
      transformer="DateFormatTransformer"
      connectionTimeout="120000"
      readTimeout="300000" >
      <field column="link" xpath="/chunks/chunk/@url" />
      <entity name="item"
        dataSource="b"
        url="${chunk.link}"
        stream="true"
        processor="XPathEntityProcessor"
        forEach="/items/item"
        transformer="DateFormatTransformer"
        connectionTimeout="120000"
        readTimeout="300000" >
        <field column="info"  xpath="/items/item/@info" />
      </entity>
    </entity>
</document>
</dataConfig>

indexing - SolrDataImportHandlerのチャンク化されたUrlDataSource

2 に答える 2

Related

Reference