xml - solrに任意のxmlを入力する

Question

Apache Solr について質問があります。任意の XML ファイルと、それが準拠する XSD がある場合、それを Solr に入力するにはどうすればよいですか。コードサンプルを入手できますか? XML を解析し、関連するデータを solr 入力ドキュメントに入れる必要があることは知っていますが、その方法がわかりません。

score 7 · Accepted Answer

DataImportHandler (DIH) を使用すると、着信 XML を XSL に渡したり、DIH トランスフォーマーを使用して XML を解析および変換したりできます。XSL を介して任意の XML を Solr の標準入力 XML 形式に変換するか、任意の XML を DIH 構成ファイル内の Solr スキーマフィールドにマップ/変換するか、またはその両方を組み合わせます。DIHは柔軟です。

サンプル dih-config.xml

これは、実際の作業サイトからの dih-config.xml のサンプルです (ここには疑似サンプルはありません)。LAMP サーバーのローカルディレクトリから xml ファイルを取得することに注意してください。HTTP 経由で直接 xml ファイルを送信したい場合は、代わりにContentStreamDataSourceを設定する必要があります。

このサンプルでは、着信 xml が既に標準の Solr 更新 xml 形式になっていることがあり、XSL が行うのは空のフィールドノードの削除だけですが、実際の変換では、"ignored_seriestitle" から "ispartof_t" のコンテンツを構築するなどの処理が行われます。「ignored_seriesvolume」と「ignored_seriesissue」は、DIH Regex トランスフォーマーで行われます。(XSLT が最初に実行され、その出力が DIH トランスフォーマーに渡されます。) 属性 "useSolrAddSchema" は、xml が既に標準の Solr xml 形式であることを DIH に通知します。そうでない場合は、着信 xml ドキュメントからコンテンツを選択するために、 XPathEntityProcessorの別の属性「xpath」が必要になります。

<dataConfig>
    <dataSource encoding="UTF-8" type="FileDataSource" />
    <document>
        <!--
            Pickupdir fetches all files matching the filename regex in the supplied directory
            and passes them to other entities which parse the file contents. 
        -->
        <entity
            name="pickupdir"
            processor="FileListEntityProcessor"
            rootEntity="false"
            dataSource="null"
            fileName="^[\w\d-]+\.xml$"
            baseDir="/var/lib/tomcat6/solr/cci/import/"
            recursive="true"
            newerThan="${dataimporter.last_index_time}"
        >

        <!--
            Pickupxmlfile parses standard Solr update XML.
            Incoming values are split into multiple tokens when given a splitBy attribute.
            Dates are transformed into valid Solr dates when given a dateTimeFormat to parse.
        -->
        <entity 
            name="xml"
            processor="XPathEntityProcessor"
            transformer="RegexTransformer,TemplateTransformer"
            datasource="pickupdir"
            stream="true"
            useSolrAddSchema="true"
            url="${pickupdir.fileAbsolutePath}"
            xsl="xslt/dih.xsl"
        >

            <field column="abstract_t" splitBy="\|" />
            <field column="coverage_t" splitBy="\|" />
            <field column="creator_t" splitBy="\|" />
            <field column="creator_facet" template="${xml.creator_t}" />
            <field column="description_t" splitBy="\|" />
            <field column="format_t" splitBy="\|" />
            <field column="identifier_t" splitBy="\|" />
            <field column="ispartof_t" sourceColName="ignored_seriestitle" regex="(.+)" replaceWith="$1" />
            <field column="ispartof_t" sourceColName="ignored_seriesvolume" regex="(.+)" replaceWith="${xml.ispartof_t}; vol. $1" />
            <field column="ispartof_t" sourceColName="ignored_seriesissue" regex="(.+)" replaceWith="${xml.ispartof_t}; no. $1" />
            <field column="ispartof_t" regex="\|" replaceWith=" " />
            <field column="language_t" splitBy="\|" />
            <field column="language_facet" template="${xml.language_t}" />
            <field column="location_display" sourceColName="ignored_class" regex="(.+)" replaceWith="$1" />
            <field column="location_display" sourceColName="ignored_location" regex="(.+)" replaceWith="${xml.location_display} $1" />
            <field column="location_display" regex="\|" replaceWith=" " />
            <field column="othertitles_display" splitBy="\|" />
            <field column="publisher_t" splitBy="\|" />
            <field column="responsibility_display" splitBy="\|" />
            <field column="source_t" splitBy="\|" />
            <field column="sourceissue_display" sourceColName="ignored_volume" regex="(.+)" replaceWith="vol. $1" />
            <field column="sourceissue_display" sourceColName="ignored_issue" regex="(.+)" replaceWith="${xml.sourceissue_display}, no. $1" />
            <field column="sourceissue_display" sourceColName="ignored_year" regex="(.+)" replaceWith="${xml.sourceissue_display} ($1)" />
            <field column="src_facet" template="${xml.src}" />
            <field column="subject_t" splitBy="\|" />
            <field column="subject_facet" template="${xml.subject_t}" />
            <field column="title_t" sourceColName="ignored_title" regex="(.+)" replaceWith="$1" />
            <field column="title_t" sourceColName="ignored_subtitle" regex="(.+)" replaceWith="${xml.title_t} : $1" />
            <field column="title_sort" template="${xml.title_t}" />
            <field column="toc_t" splitBy="\|" />
            <field column="type_t" splitBy="\|" />
            <field column="type_facet" template="${xml.type_t}" />
    </entity>
      </entity>
    </document>
</dataConfig>

DIH を設定するには:

DIH jar は、デフォルトでは Solr WAR ファイルに含まれていないため、solrconfig.xml から参照されていることを確認してください。簡単な方法の 1 つは、Solr インスタンスディレクトリに lib フォルダーを作成することです。このフォルダーには DIH jar が含まれています。これは、solrconfig.xml がデフォルトで参照用に lib フォルダーを検索するためです。Solr パッケージをダウンロードするときに、apache-solr-xxx/dist フォルダーで DIH jar を見つけます。

dist フォルダー: solr dih jarの場所

Solr の「conf」ディレクトリに dih-config.xml を (上記のように) 作成します。
DIH 要求ハンドラーが solrconfig.xml にまだない場合は追加します。

リクエストハンドラ:

<requestHandler name="/update/dih" startup="lazy" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</requestHandler>

DIH をトリガーするには:

データインポートハンドラーコマンドの wiki の説明には、フルインポートとデルタインポートの比較、およびコミット、最適化などを行うかどうかに関するより多くの情報がありますが、次のようにすると、最初に既存のインデックスを削除せずに DIH 操作がトリガーされます。すべてのファイルが処理された後に変更をコミットします。上記のサンプルは、ピックアップディレクトリで見つかったすべてのファイルを収集し、それらを変換し、インデックスを作成し、最後に更新をインデックスにコミットします (これにより、コミットが終了した瞬間に検索可能になります)。

http://localhost:8983/solr/update/dih?command=full-import&clean=false&commit=true

score 1 · Accepted Answer

最も簡単な方法は、DataImportHandlerを使用することです。これにより、最初にXSLを適用して、xmlをSolr入力xmlに変換できます。

xml - solrに任意のxmlを入力する

3 に答える 3

サンプル dih-config.xml

DIH を設定するには:

DIH をトリガーするには:

Related

Reference