r - トピックモデルを使用するために DocumentTermMatrix から単語を削除しようとしています

Question

topicmodelsそのため、パッケージを使用しようとしていますR(それぞれが 1000 単語である ~6400 のドキュメントのコーパスで 100 のトピック)。プロセスが実行されてから終了します。メモリが不足しているためだと思います。

lda()そこで、関数が入力として受け取るドキュメントタームマトリックスのサイズを縮小しようとしました。minDocFreqドキュメント用語マトリックスを生成するときに、関数を使用してそれを行うことができると思います。でも使ってみると、特に違和感はありません。ここにいくつかのコードがあります：

関連するコードは次のとおりです。

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

同じ次元、同じ列数 (つまり、同じ数の項)。

私が間違っていることは何ですか？ありがとう。

score 15 · Accepted Answer

あなたの質問への答えはここにあります: https://stackoverflow.com/a/13370840/1036500 (賛成票を投じてください!)

簡単に言えば、パッケージの最近のバージョンにはtm含まれていませんがminDocFreq、代わりに使用boundsされています。たとえば、

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

今あるべき

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17

r - トピックモデルを使用するために DocumentTermMatrix から単語を削除しようとしています

1 に答える 1

Related

Reference