lucene - 単語の共起 - 一連の n-gram で用語の共起を見つけます

Question

Java のようなもので、n-gram でいっぱいのファイルを取り、特定の入力用語の単語の共起を計算する共起クラスを作成するにはどうすればよいでしょうか。

Lucene (インデックス) または Hadoop の n-gram リストに対する map-reduce のようなもので動作するライブラリまたはパッケージはありますか?

ありがとう。

score 2 · Accepted Answer

さて、ngram のファイル内で 2 つの異なる単語の共起を見つけたいとします。

これが疑似コードっぽい Java です。

// Co-occurrence matrix
Hashmap<String,HashMap<String,Integer>> map = new HashMap();

// List of ngrams
ArrayList<ArrayList<String>> ngrams = ..... // assume we've loaded them into here already

// build the matrix
for(ArrayList<String> ngram:ngrams){
  // Calculate word co-occurrence in ngram for all words
  // result is an map strings-> count
  // words in alphabetical order
  Hashmap<String,<ArrayList<String>,Integer> wordCoocurrence = cooccurrence(ngram) // assume we have this

  // then just join this with original
}

// and just query with words in alphabetic order

このようなカウントを行うことは、Pig ではおそらくきれいですが、おそらく私よりも慣れているでしょう。

lucene - 単語の共起 - 一連の n-gram で用語の共起を見つけます

1 に答える 1

Related

Reference