lucene - Lucene SpanNearQuery partial matching

Question

Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}

But this fails.

How do I go around this?

Sample test (using lucene 2.9.1) with the following results:

givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

score 6 · Accepted Answer

SpanNearQuery を使用すると、互いに一定の距離内にある用語を見つけることができます。

例 ( http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/から):

doug の 5 ポジション以内で lucene を見つけたいとします。doug は lucene に続きます (順序が重要です)。次の SpanQuery を使用できます。

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

_{（出典：lucidimagination.com）}

このサンプルテキストでは、Lucene は Doug の 3 以内にあります。

しかし、あなたの例では、私が見ることができる唯一の一致は、クエリとターゲットドキュメントの両方に「cd」があることです (これらの用語はすべて単一のフィールドにあると仮定しています)。その場合、特別なクエリタイプを使用する必要はありません。標準のメカニズムを使用すると、両方が同じフィールドに同じ用語を含むという事実に基づいて、ゼロ以外の重み付けが得られます。

編集 3SpanNearQuery - 最新のコメントに応えて、答えは、ドキュメント内の複数の用語が互いに特定の数の場所内で発生するかどうかを確認するという目的以外のことを行うために使用できないということです。あなたの特定のユースケース/期待される結果が何であるかはわかりません（自由に投稿してください）が、最後のケースでは、（「BAZ」、「EXTRA」）の1つ以上が含まれているかどうかだけを知りたい場合ドキュメント、 aBooleanQueryは問題なく機能します。

編集 4 - ユースケースを投稿したので、何をしたいのか理解しました。これを行う方法は次のとおりです。BooleanQuery上記のように a を使用して、必要な個々の用語とを組み合わせ、SpanNearQueryにブーストを設定しSpanNearQueryます。

したがって、テキスト形式のクエリは次のようになります。

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(例として、これは「BAZ」または「EXTRA」のいずれかを含むすべてのドキュメントに一致しますが、用語「BAZ」と「EXTRA」が互いに 100 桁以内で出現するドキュメントに高いスコアを割り当てます。位置を調整し、次のようにブーストします。 This example is from the Solr cookbook so it may not parse in Lucene, or may not give recent results. 次のセクションで、API を使用してこれを構築する方法を示すので、それは問題ありません)。

プログラム的には、これを次のように構築します。

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

それが役立つことを願っています! 将来的には、期待する結果を正確に投稿することから始めてみてください。たとえあなたにとって明らかであっても、読者にとってはそうではないかもしれません。

lucene - Lucene SpanNearQuery partial matching

1 に答える 1

Related

Reference