solr - Solr を使用して PDF メタデータを抽出する際のエラー

Question

Solr 3.3 を使用しており、PDF ファイルからメタデータを抽出してインデックスを作成しようとしています。ドキュメントを追加するために、TikaEntityProcessor で DataImportHandler を使用しています。私の schema.xml ファイルで定義されているフィールドは次のとおりです。

<field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
   <field name="imgName" type="string" indexed="false" stored="true" multiValued="false" required="false"/>
   <dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>

したがって、メタデータ情報はインデックスを作成し、「attr_」という接頭辞が付いたフィールドに格納する必要があると思います。

これが私のデータ構成ファイルの外観です。データベースからソースディレクトリパスを取得し、それを FileListEntityProcessor に渡します。FileListEntityProcessor は、ディレクトリ内で見つかった各 pdf ファイルを TikaEntityProcessor に渡し、コンテンツを抽出してインデックスを付けます。

<entity onError="skip" name="fileSourcePaths" rootEntity="false" dataSource="dbSource" fileName=".*pdf" query="select path from file_sources">
      <entity name="fileSource" processor="FileListEntityProcessor" transformer="ThumbnailTransformer" baseDir="${fileSourcePaths.path}" recursive="true" rootEntity="false">
        <field name="link" column="fileAbsolutePath" thumbnail="true"/>
        <field name="imgName" column="imgName"/>
        <entity rootEntity="true" onError="abort" name="file" processor="TikaEntityProcessor" url="${fileSource.fileAbsolutePath}" dataSource="fileSource" format="text">
          <field column="resourceName" name="title" meta="true"/>
          <field column="Creation-Date" name="date_published" meta="true"/>
          <field column="text" name="description"/>
        </entity>
      </entity>

description と Creation-date は問題なく抽出されますが、 resourceName を抽出しているようには見えないため、 index にクエリを実行すると、ドキュメントのタイトルフィールドがありません。Creation-date と resourceName の両方がメタデータであるため、これは奇妙です。また、他の可能性のあるメタデータはいずれも、attr_ フィールドの下に格納されていませんでした。Tika 0.8 の使用に既知の問題があるというスレッドに出くわしたので、Tika 0.9 をダウンロードして 0.8 に置き換えました。また、pdfbox、jempbox、および fontbox を 1.3 から 1.4 にダウンロードして置き換えました。

ファイルに保存されているメタデータを確認するために、Tikaだけでpdfの1つを個別にテストしました。これは私が見つけたものです：

Content-Length: 546459 
Content-Type: application/pdf 
Creation-Date: 2010-06-09T12:11:12Z 
Last-Modified: 2010-06-09T14:53:38Z 
created: Wed Jun 09 08:11:12 EDT 2010 
creator: XSL Formatter V4.3 MR9a (4,3,2009,1022) for Windows 
producer: Antenna House PDF Output Library 2.6.0 (Windows) 
resourceName: Argentina.pdf 
trapped: False 
xmpTPg:NPages: 2

ご覧のとおり、resourceName メタデータがあります。もう一度インデックスを作成しようとしましたが、同じ結果が得られました。Creation-date の抽出とインデックスは問題ありませんが、resourceName は問題ありません。また、残りの属性は、attr_ フィールドの下でインデックス化されていません。

何がうまくいかないのですか？

solr - Solr を使用して PDF メタデータを抽出する際のエラー

0 に答える 0

Related

Reference