lucene - Lucene StandardAnalyzer と EnglishAnalyzer の違いは何ですか?

Question

Lucene 4.3 を使用して英語のツイートのインデックス作成に取り組んでいますが、どのアナライザーを使用すればよいかわかりません。Lucene StandardAnalyzer と EnglishAnalyzer の違いは何ですか?

また、「XY&Z Corporation - xyz@example.com」というテキストで StandardAnalyzer をテストしようとしました。出力は [xy] [z] [corporation] [xyz] [example.com] ですが、出力は [XY&Z] [Corporation] [xyz@example.com] になると思います。

私は何か間違ったことをしていますか？

score 16 · Accepted Answer

ソースを見てみましょう。一般に、アナライザーはかなり読みやすいです。CreateComponentsメソッドを調べて、トークナイザーとフィルターが使用されていることを確認するだけです。

@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final Tokenizer source = new StandardTokenizer(matchVersion, reader);
    TokenStream result = new StandardFilter(matchVersion, source);
    // prior to this we get the classic behavior, standardfilter does it for us.
    if (matchVersion.onOrAfter(Version.LUCENE_31))
      result = new EnglishPossessiveFilter(matchVersion, result);
    result = new LowerCaseFilter(matchVersion, result);
    result = new StopFilter(matchVersion, result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
 }

一方、StandardAnalyzerはStandardTokenizer、StandardFilter、LowercaseFilter、およびStopFilterです。、、およびEnglishAnalyzerでロールします。EnglishPossesiveFilterKeywordMarkerFilterPorterStemFilter

主に、EnglishAnalyzer には英語のステミング機能が強化されており、プレーンな英語のテキストでうまく機能するはずです。

StandardAnalyzer の場合、英語の分析に直接結びついていると私が認識している唯一の仮定は、デフォルトのストップワードセットです。もちろん、これは単なるデフォルトであり、変更することができます。StandardAnalyzer は、非言語固有のテキストセグメンテーションを提供しようとするUnicode Standard Annex #29を実装するようになりました。

lucene - Lucene StandardAnalyzer と EnglishAnalyzer の違いは何ですか?

1 に答える 1

Related

Reference