solr - フォルダー内のファイルのインデックス作成

Question

特定のフォルダ内のすべての doc ファイルにインデックスを付けるにはどうすればよいですか? ファイルmydocumentsを含むフォルダーがあるdocとしましょう。docx効率的な検索のために、そのフォルダー内のすべてのファイルにインデックスを付ける必要があります。ファイルのフォルダーのインデックス作成について何を提案できdocますか? 注: sphinx を探しましたが、xml と mssql のみをインデックスに登録しているようです。

score 1 · Accepted Answer

あなたの問題は、特定のフォルダにあるテキストファイルのリストのインデックス作成にあると思います。したがって、これはそれらにインデックスを付けるためのサンプルコードです。ただし、Word文書に索引を付ける場合は、getDocumentメソッドを変更して、Lucene文書を解析してデータを取り込む必要があります。

重要なポイントは次のとおりです。

IndexWriterを作成します。
dir.listFiles（）メソッドを使用して、フォルダー内のファイルのリストを取得します。
ファイルを繰り返し処理し、Luceneドキュメントを一度に1つずつ作成します
Luceneドキュメントをインデックスに追加します。
ドキュメントの追加が完了したら、変更をコミットしてindexWriterを閉じます。

Word文書またはPDFファイルの解析と読み取りを探している場合は、ApachePOIおよびPDFBoxライブラリを使用する必要があります。

RAMDirectoryクラスはデモにのみ使用することに注意してください。代わりに、FSDirectoryを使用する必要があります。

それがあなたの問題を解決することを願っています。

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Scanner;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;


public class IndexFolders {

    public static void main(String[] args) throws FileNotFoundException, IOException{
        String path = args[0];
        File dir = new File(path);

        Directory indexDir = new RAMDirectory();
        Version version = Version.LUCENE_40;
        Analyzer analyzer = new StandardAnalyzer(version);
        IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
        IndexWriter indexWriter = new IndexWriter(indexDir, config);

        for (File file : dir.listFiles()){
            indexWriter.addDocument(getDocument(file));
        }

        indexWriter.commit();
        indexWriter.close();
    }


    public static Document getDocument(File file) throws FileNotFoundException
    {
        Scanner input = new Scanner(file);
        StringBuilder builder = new StringBuilder();

        while(input.hasNext()){
            builder.append(input.nextLine());
        }

        Document document = new Document();
        document.add(new Field("text", builder.toString(),org.apache.lucene.document.TextField.TYPE_STORED));
        return document;
    }


}

score 1 · Accepted Answer

私の答えはLuceneに当てはまります。

Lucene は、ファイルまたはフォルダーのコンテンツにインデックスを付ける API を「直接」提供しません。私たちがしなければならないことは、

ファイルを解析します。膨大な種類のファイルの解析をサポートするApache Tikaを使用できます。
その情報をLucene Documentオブジェクトに入力します。
そのドキュメントを IndexWriter.addDocument() に渡します
各ファイル、つまりインデックス内の異なるエントリごとに上記の手順を繰り返します。

直接索引付けの問題は、たとえ存在するとしても、フィールドの作成と、特定の文書のそのフィールドに対応するコンテンツの選択の柔軟性が失われることです。

以下は、サンプルコードを見つけることができる優れたチュートリアルです: Lucene in 5 minutes

solr - フォルダー内のファイルのインデックス作成

2 に答える 2

Related

Reference