java - Lucene で文を検索して一致する用語を取得する

Question

数ギガバイトの文 (約 1600 万行) のインデックスを作成する必要があるアプリケーションがあります。

現在、私の検索は次のように機能します。

私の検索用語は通常、フレーズを中心に展開しています。例えば「公園を走る」。これに似た文章や、これらのフレーズの一部を含む文章を検索できるようにしたいです。私はより小さなフレーズを構築することによってそうします:

「ランニング中」「公園内」など。

それぞれに重みが与えられます（長いものほど重みが大きくなります）

現時点では、各行を 1 つのドキュメントとして扱っています。典型的な検索には数秒かかりますが、検索の速度を上げる方法があるかどうか疑問に思っています。

その上、一致するものも取得する必要があります。

例: 「I was jogging in the park this morning」は「in the park」に一致しますが、どのように一致するか知りたいと思います。lucene 検索の Explainer については知っていますが、もっと簡単な方法や、Lucene の Explainer から必要な情報を抽出する方法を学べるリソースはありますか。

現在、正規表現を使用して一致用語を取得しています。高速ですが、lucene が句読点やその他のものを無視することがあり、すべての特殊なケースを処理できないため、不正確です。

score 3 · Accepted Answer

Highlighter は Explainer よりも優れており、高速です。タグを強調表示した後、タグ間で一致するフレーズを抽出できます。タグ間のテキストを抽出する Java 正規表現

public class HighlightDemo {
Directory directory;
Analyzer analyzer;
String[] contents = {"running in the park",
        "I was jogging in the park this morning",
        "running on the road",
        "The famous New York Marathon has its final miles in Central park every year and it's easy to understand why: the park, with a variety of terrain and excellent scenery, is the ultimate runner's dream. With its many paths that range in level of difficulty, Central Park allows a runner to experience clarity and freedom in this picturesque urban oasis."};


@Before
public void setUp() throws IOException {


    directory = new RAMDirectory();
    analyzer = new WhitespaceAnalyzer();

    // indexed documents


    IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < contents.length; i++) {
        Document doc = new Document();
        doc.add(new Field("content", contents[i], Field.Store.NO, Field.Index.ANALYZED)); // store & index
        doc.add(new NumericField("id", Field.Store.YES, true).setIntValue(i));      // store & index
        writer.addDocument(doc);
    }
    writer.close();
}

@Test
public void test() throws IOException, ParseException, InvalidTokenOffsetsException {
    IndexSearcher s = new IndexSearcher(directory);
    QueryParser parser = new QueryParser(Version.LUCENE_36, "content", analyzer);
    org.apache.lucene.search.Query query = parser.parse("park");

    TopDocs hits = s.search(query, 10);
    SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
    Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
    for (int i = 0; i < hits.scoreDocs.length; i++) {
        int id = hits.scoreDocs[i].doc;
        Document doc = s.doc(id);
        String text = contents[Integer.parseInt(s.doc(id).get("id"))];

        TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
        org.apache.lucene.search.highlight.TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                assertTrue(frag[j].toString().contains("<B>"));
                assertTrue(frag[j].toString().contains("</B>"));

                System.out.println(frag[j].toString());
            }
        }

    }

}
}

score 2 · Accepted Answer

Lucene の「contrib」モジュールHighlighterを使用すると、Lucene によって一致したものを抽出できます。

java - Lucene で文を検索して一致する用語を取得する

3 に答える 3

Related

Reference