java - Lucene インデックス作成 - 多くのドキュメント/フレーズ

Question

次の一連のファイルのインデックス作成にどのアプローチを使用する必要がありますか。

各ファイルには、約 50 万行の文字 (400MB) が含まれています。文字は単語ではありません。質問のために、空白のないランダムな文字としましょう。

与えられた 12 文字の文字列を含む各行を見つけられるようにする必要があります。次に例を示します。

行: AXXXXXXXXXXXXJJJJKJIDJUD....最大 200 文字

興味深い部分: XXXXXXXXXXXX

検索中、私は文字1〜13（したがってXXXXXXXXXXXX）にのみ関心があります。検索後、ファイルをループせずにXXXXXXXXXXXXを含む行を読み取れるようにしたいと考えています。

私は次の poc を書きました (質問のために簡略化しました:

索引付け:

 while ( (line = br.readLine()) != null ) {
        doc = new Document();
        Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
        doc.add(fileNameField);
        Field characterOffset = new IntField(CHARACTER_OFFSET, charsRead, Field.Store.YES);
        doc.add(characterOffset);
        String id = "";
        try {
            id = line.substring(1, 13);
            doc.add(new TextField(CONTENTS, id, Field.Store.YES));
            writer.addDocument(doc);
        } catch ( IndexOutOfBoundsException ior ) {
            //cut off for sake of question
        } finally {
            //simplified snipped for sake of question. characterOffset is amount of chars to skip which reading a file (ultimately bytes read)
             charsRead += line.length() + 2;

        }
    }

検索中:

RegexpQuery q = new RegexpQuery(new Term(CONTENTS, id), RegExp.NONE); //cause id can be a regexp concernign 12char string

TopDocs results = searcher.search(q, Integer.MAX_VALUE);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
Map<String, Set<Integer>> fileToOffsets = new HashMap<String, Set<Integer>>();

for ( int i = 0; i < numTotalHits; i++ ) {
    Document doc = searcher.doc(hits[i].doc);
    String fileName = doc.get(FILE_NAME);
    if ( fileName != null ) {
        String foundIds = doc.get(CONTENTS);
        Set<Integer> offsets = fileToOffsets.get(fileName);
        if ( offsets == null ) {
            offsets = new HashSet<Integer>();
            fileToOffsets.put(fileName, offsets);
        }
        String offset = doc.get(CHARACTER_OFFSET);
        offsets.add(Integer.parseInt(offset));
    }
}

このアプローチの問題は、1 行に 1 つのドキュメントが作成されることです。

ルセンを使用してこの問題にアプローチする方法と、ルセンがここに行く方法であるかどうかのヒントを教えてください。

score 0 · Accepted Answer

反復ごとに新しいドキュメントを追加する代わりに、同じドキュメントを使用して、同じ名前のフィールドを次のように追加し続けます。

Document doc = new Document();
Field fileNameField = new StringField(FILE_NAME, file.getName(), Field.Store.YES);
doc.add(fileNameField);
String id;
while ( (line = br.readLine()) != null ) {
    id = "";
    try {
        id = line.substring(1, 13);
        doc.add(new TextField(CONTENTS, id, Field.Store.YES));
        //What is this (characteroffset) field for?
        Field characterOffset = new IntField(CHARACTER_OFFSET, bytesRead, Field.Store.YES);
        doc.add(characterOffset);
    } catch ( IndexOutOfBoundsException ior ) {
        //cut off
    } finally {
        if ( "".equals(line) ) {
            bytesRead += 1;
        } else {
            bytesRead += line.length() + 2;
        }
    }
}
writer.addDocument(doc);

これにより、各行の ID が同じフィールドに新しい用語として追加されます。同じクエリが引き続き機能するはずです。

CharacterOffsetしかし、あなたがフィールドをどのように利用するかはよくわかりません。各値は、ID と同様に、別の用語としてフィールドの末尾に追加されます。フィールドに同じ数のトークンがあることを除けば、特定の用語に直接関連付けられることはありません。ファイル全体の内容ではなく、特定の行を取得する必要がある場合は、行ごとにインデックスを作成する現在のアプローチが最も合理的かもしれません。

java - Lucene インデックス作成 - 多くのドキュメント/フレーズ

1 に答える 1

Related

Reference