java - LuceneIndex-単一の用語とフレーズのクエリ

Question

私はいくつかのドキュメントを読み、次のようなluceneインデックスを作成しました

ドキュメント：

id        1
keyword   foo bar
keyword   john

id        2
keyword   foo

id        3
keyword   john doe
keyword   bar foo
keyword   what the hell

単一の用語とフレーズを組み合わせることができる方法でluceneを照会したいと思います。

私のクエリが

foo bar

ドキュメントID1、2、3を返す必要があります

クエリ

"foo bar"

ドキュメントIDを返す必要があります1

クエリ

john

ドキュメントID1と3を返す必要があります

クエリ

john "foo bar"

ドキュメントIDを返す必要があります1

Javaでの私の実装は機能していません。また、大量のドキュメントを読んでも役に立ちませんでした。

インデックスをクエリすると

"foo bar"

ヒット数が0

インデックスをクエリすると

foo "john doe"

ドキュメントID1、2、3を取得します（クエリはfoo AND "john doe"として意図されているため、ドキュメントID 3のみを期待します）問題は、 "john doe"は0ヒットを返しますが、fooは3を返します。ヒット。

私の目標は、単一の用語とフレーズの用語を組み合わせることです。私は何が間違っているのですか？私も運が悪かったのでアナライザーをいじってみました。

私の実装は次のようになります。

インデクサー

  import ...

  public class Indexer
  {
    private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);

    private final File indexDir;

    private IndexWriter writer;

    public Indexer(File indexDir)
    {
    this.indexDir = indexDir;
    this.writer = null;
  }

  private IndexWriter createIndexWriter()
  {
    try
    {
      Directory dir = FSDirectory.open(indexDir);
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
      iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
      iwc.setRAMBufferSizeMB(256.0);
      IndexWriter idx = new IndexWriter(dir, iwc);
      idx.deleteAll();
      return idx;
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
    }
  }

  public void index(TestCaseDescription desc)
  {
    if (writer == null)
      writer = createIndexWriter();

    Document doc = new Document();
    addPathToDoc(desc, doc);
    addLastModifiedToDoc(desc, doc);
    addIdToDoc(desc, doc);
    for (String keyword : desc.getKeywords())
      addKeywordToDoc(doc, keyword);

    updateIndex(doc, desc);
  }

  private void addIdToDoc(TestCaseDescription desc, Document doc)
  {
    Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
    idField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(idField);
  }

  private void addKeywordToDoc(Document doc, String keyword)
  {
    Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
    keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(keywordField);
  }

  private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
  {
    NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
    modifiedField.setLongValue(desc.getLastModified());
    doc.add(modifiedField);
  }

  private void addPathToDoc(TestCaseDescription desc, Document doc)
  {
    Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
    pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
    doc.add(pathField);
  }

  private void updateIndex(Document doc, TestCaseDescription desc)
  {
    try
    {
      if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
      {
        // New index, so we just add the document (no old document can be there):
        LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
        writer.addDocument(doc);
      } else
      {
        // Existing index (an old copy of this document may have been indexed) so
        // we use updateDocument instead to replace the old one matching the exact
        // path, if present:
        LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
        writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
      }
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
          desc.getPath()), e);
    }
  }

  public void store()
  {
    try
    {
      writer.close();
    } catch (IOException e)
    {
      throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
    }
    writer = null;
  }
}

サーチャー：

import ...

public class Searcher
{
  private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);

  private final Analyzer analyzer;

  private final QueryParser parser;

  private final File indexDir;

  public Searcher(File indexDir)
  {
    this.indexDir = indexDir;
    analyzer = new StandardAnalyzer(Version.LUCENE_34);
    parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
    parser.setAllowLeadingWildcard(true);
  }

  public List<String> search(String searchString)
  {
    List<String> testCaseIds = new ArrayList<String>();
    try
    {
      IndexSearcher searcher = getIndexSearcher(indexDir);

      Query query = parser.parse(searchString);
      LOG.info("Searching for: " + query.toString(parser.getField()));
      AllDocCollector results = new AllDocCollector();
      searcher.search(query, results);

      LOG.info("Found [{}] hit", results.getHits().size());

      for (ScoreDoc scoreDoc : results.getHits())
      {
        Document doc = searcher.doc(scoreDoc.doc);
        String id = doc.get(LuceneConstants.FIELD_ID);
        testCaseIds.add(id);
      }

      searcher.close();
      return testCaseIds;
    } catch (Exception e)
    {
      throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
    }

  }

  private IndexSearcher getIndexSearcher(File indexDir)
  {
    try
    {
      FSDirectory dir = FSDirectory.open(indexDir);
      return new IndexSearcher(dir);
    } catch (IOException e)
    {
      LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
      throw new RuntimeException(e);
    }
  }
}

score 3 · Accepted Answer

なぜ DOCS_ONLY を使用しているのですか?! docid のみをインデックス化する場合は、用語 -> ドキュメントマッピングを使用した基本的な逆インデックスのみがあり、近接情報はありません。そのため、フレーズクエリが機能しません。

score 0 · Accepted Answer

私はあなたが大まかに望んでいると思います：

keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"

つまり、「foo bar」に一致するフレーズとブースト (完全なフレーズを優先)、または「foo」に一致する、または「bar」に一致する。

完全なクエリ構文はこちら: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html

編集：

あなたが見逃しているように見えるのは、デフォルトの演算子が OR であることです。したがって、おそらく次のようなことをしたいと思うでしょう:

+keyword:john AND +keyword:"foo bar"

プラス記号は「含まなければならない」を意味します。AND を明示的に配置して、ドキュメントに両方を含める必要があるようにします (デフォルトでは、「john を含める必要がある OR 必ず "foo bar" を含める必要があります)。

score 0 · Accepted Answer

交換することで問題は解決しました

StandardAnalyzer

と

KeywordAnalyzer

インデクサーとサーチャーの両方に。

StandardAnalyzer は入力テキストを複数の単語に分割することを指摘できたので、入力 (1 つまたは複数の単語で構成される可能性がある) は変更されないため、KeywordAnalyzer に置き換えました。次のような用語を認識します

bla foo

一つのキーワードとして

java - LuceneIndex-単一の用語とフレーズのクエリ

インデクサー

サーチャー：

3 に答える 3

Related

Reference