solr - Luceneを使用したスペルチェッカー

Question

luceneスペルチェッカーを使用してスペルコレクターを作成しようとしています。ブログのテキストコンテンツを含む単一のテキストファイルを提供したいと思います。問題は、辞書ファイルの1行に1文/単語を指定した場合にのみ機能することです。また、suggest APIは、発生数に重みを付けずに結果を返します。以下はソースコードです

   public class SpellCorrector {

        SpellChecker spellChecker = null;

        public SpellCorrector() {
                try {
                        File file = new File("/home/ubuntu/spellCheckIndex");
                        Directory directory = FSDirectory.open(file);

                        spellChecker = new SpellChecker(directory);

                        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
                        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
                        spellChecker.indexDictionary(
                                        new PlainTextDictionary(new File("/home/ubuntu/main.dictionary")), config, true);
                                                                        //Should I format this file with one sentence/word per line?

                } catch (IOException e) {

                }

        }

        public String correct(String query) {
                if (spellChecker != null) {
                        try {
                                String[] suggestions = spellChecker.suggestSimilar(query, 5);  
                                 // This returns the suggestion not based on occurence but based on when it occured

                                if (suggestions != null) {
                                        if (suggestions.length != 0) {
                                                return suggestions[0];
                                        }
                                }
                        } catch (IOException e) {
                                return null;
                        }
                }
                return null;
        }
}

変更を加える必要がありますか？

score 2 · Accepted Answer

あなたの最初の問題に関しては、予想される文書化された辞書形式のように聞こえます。ここではPlainTextDictionary APIです。任意のテキストを渡したい場合は、それにインデックスを付けて、代わりにLuceneDictionaryを使用するか、必要に応じてHighFrequencyDictionaryを使用することをお勧めします。

スペルチェッカーは、単語間の類似性 ( Levenstein Distanceに基づく) に基づいて、他の懸念事項より先に置換を提案します。より一般的な用語のみを提案として推奨する場合は、SuggestModeをSpellChecker.suggestSimilarに渡す必要があります。これにより、提案された一致が、置き換えようとしている単語と少なくとも同じくらい人気が高いことが保証されます。

Lucene が最適な一致を決定する方法をオーバーライドする必要がある場合は、SpellChecker.setComparatorを使用してそれを実行し、 SuggestWordに独自の Comparator を作成できます。SuggestWord はfreqあなたに公開するので、見つかった一致を人気順に並べるのは簡単なはずです。

solr - Luceneを使用したスペルチェッカー

1 に答える 1

Related

Reference