java - LuceneアルゴリズムがJavaの正確な文字列に対して機能しないのはなぜですか？

Question

私はJavaでLuceneアルゴリズムに取り組んでいます。MySQLデータベースには100Kのストップネームがあります。停車地の名前は次のようなものです

NEW YORK PENN STATION, 
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc

ユーザーがNEWYORKのような検索入力を行うと、結果にNEW YORK PENN STATIONストップが表示されますが、ユーザーが検索入力に正確なNEW YORK PENN STATIONを入力すると、結果はゼロになります。

私のコードは-

public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
  {
      ArrayList<String> arResult = new ArrayList<String>();

        try
        {
            // 0. Specify the analyzer for tokenizing text.
            //    The same analyzer should be used for indexing and searching
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

            // 1. create the index
            Directory index = new RAMDirectory();

            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

            IndexWriter w = new IndexWriter(index, config);

            for(int i = 0; i < source.size(); i++)
            {
                addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
            }

            w.close();

            // 2. query
            // the "title" arg specifies the default field to use
            // when no field is explicitly specified in the query.
            Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

            // 3. search
            int hitsPerPage = 20;
            IndexReader reader = DirectoryReader.open(index);
            IndexSearcher searcher = new IndexSearcher(reader);
            TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
            searcher.search(q, collector);
            ScoreDoc[] hits = collector.topDocs().scoreDocs;

            // 4. Get results
            for(int i = 0; i < hits.length; ++i) 
            {
                  int docId = hits[i].doc;
                  Document d = searcher.doc(docId);
                  arResult.add(d.get("title"));
            }

            // reader can only be closed when there
            // is no need to access the documents any more.
            reader.close();

        }
        catch(Exception e)
        {
            System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
        }

        return arResult;

  }

  private static void addDoc(IndexWriter w, String title, String isbn) throws IOException 
  {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));

        // use a string field for isbn because we don't want it tokenized
        doc.add(new StringField("isbn", isbn, Field.Store.YES));
        w.addDocument(doc);
  }

このコードソースにはストップネームのリストがあり、クエリはユーザー指定の検索入力です。

Luceneアルゴリズムはラージストリングで機能しますか？

Luceneアルゴリズムが正確な文字列で機能しないのはなぜですか？

score 2 · Accepted Answer

それ以外の

1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

例：「ニューヨークステーション」は「title：newtitle：yorktitle：station」に解析されます。このクエリは、上記の用語のいずれかを含むすべてのドキュメントを返します。

これを試して..

2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");

例1：「newyork」は「+（title： newtitle：york）」に解析されます

上記の「+」は、結果ドキュメントでの用語の「必須」の出現を示します。「ニューヨーク」と「ニューヨークステーション」を含むドキュメントの両方に一致します

例2：「ニューヨークステーション」は+（ title：new title：york title：station）に解析されます。ステーションが存在しないため、クエリは「ニューヨーク」だけでなく「ニューヨークステーション」のみに一致します。

フィールド名「title」が探しているものであることを確認してください。

あなたの質問。

Luceneアルゴリズムはラージストリングで機能しますか？

大きな文字列とは何かを定義する必要があります。あなたは実際にフレーズ検索を探していますか？一般的に、はい、Luceneは大きな文字列に対して機能します。

Luceneアルゴリズムが正確な文字列で機能しないのはなぜですか？

解析（ "querystr" + "*"）は、OR演算子を接続して個々の用語クエリを生成するためです。例：'new york *'は次のように解析されます： "title：new OR title：york *

「ニューヨーク駅」を探すのを楽しみにしているのなら、上記のワイルドカードクエリはあなたが探しているべきものではありません。これは、渡したStandardAnalyserがインデックス作成中に、ニューヨークステーションを3つの用語にトークン化（用語を分解）するためです。

したがって、クエリ「york *」は、「york」が含まれているという理由だけで「york station」を検索しますが、ワイルドカードが原因ではありません。「york」は、異なる用語、つまり、インデックス。

実際に必要なのは、正確な文字列を検索するためのPhraseQueryです。このクエリ文字列は、引用符付きの「ニューヨーク」である必要があります。

java - LuceneアルゴリズムがJavaの正確な文字列に対して機能しないのはなぜですか？

1 に答える 1

Related

Reference