tm - R: コーパスで最も頻繁に使用される単語のグループを見つける

Question

Rのテキストコーパスで最も頻繁に使用される用語だけでなく、表現(複数の単語、単語のグループ)も簡単に見つける方法はありますか?

tm パッケージを使用すると、次のような最も頻繁に使用される用語を見つけることができます。

tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=3, highfreq=Inf)

関数を使用して最も頻繁に使用される単語に関連する単語を見つけることができるためfindAssocs()、これらの単語を手動でグループ化できます。しかし、コーパスでこれらの単語グループの出現回数を見つけるにはどうすればよいでしょうか?

どうも

score 4 · Accepted Answer

私の記憶が正しければ、weka を使用してバイグラム (常に一緒に発生する 2 つの単語) の TermDocumentMatrix を作成し、必要に応じて処理することができます。

library("tm") #text mining
library("RWeka") # for tokenization algorithms more complicated than single-word


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

# process tdm 
# findFreqTerms(tdm, lowfreq=3, highfreq=Inf)
# ...

tdm <- removeSparseTerms(tdm, 0.99)
print("----")
print("tdm properties")
str(tdm)
tdm_top_N_percent = tdm$nrow / 100 * topN_percentage_wanted

あるいは、

#words combinations that occur at least once together an at most 5 times
wmin=1
wmax = 5

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = wmin, max = wmax))

「より良い」単語グループを取得するために、最初に単語ステミングを実行すると役立つ場合があります。

tm - R: コーパスで最も頻繁に使用される単語のグループを見つける

1 に答える 1

Related

Reference