r - 数値行列が本来よりもはるかに多くのメモリを消費しています - R

Question

Naive Bayes 実装用のドキュメントタームマトリックス (略して dtm) を作成しています (これには関数があることは知っていますが、宿題のために自分でコーディングする必要があります)。問題である dtm を正常に作成する関数を作成しました。結果の行列が大量のメモリを占有していることが原因です。たとえば、100 x 32000 のマトリックス (0 と 1 の) のサイズは 24MB です! これにより、完全な 10k ドキュメントを操作しようとすると、r でクラッシュが発生します。関数が続き、おもちゃの例が最後の 3 行にあります。特に「sparser」関数がそのようなメモリ集約的な結果を返す理由を誰でも見つけることができますか?

listAllWords <- function(docs)
{
  str1 <- strsplit(x=docs, split="\\s", fixed=FALSE)
  dictDupl <- unlist(str1)[!(unlist(str1) %in% stopWords)]
  dictionary <- unique(dictDupl)
}

#function to create the sparse matrix of words as they appear in each article segment
sparser <- function (docs, dictionary) 
{
  num.docs <- length(docs) #dtm rows
  num.words <- length(dictionary) #dtm columns
  dtm <- mat.or.vec(num.docs,num.words) # Instantiate dtm of zeroes
  for (i in 1:num.docs)
  {
    doc.temp <- unlist(strsplit(x=docs[i], split="\\s", fixed=FALSE)) #vectorize words
    num.words.doc <- length(doc.temp)
    for (j in 1:num.words.doc)
    {
      ind <- which(dictionary == doc.temp[j]) #loop over words and find index in dict.
      dtm[i,ind] <- 1 #indicate this word is in this document
    }
  }
  return(dtm)
}


docs <- c("the first document contains words", "the second document is also made of words", "the third document is words and a number 4")
dictionary <- listAllWords(docs)
dtm <- sparser(docs,dictionary)

違いがある場合は、Mac OSX、64ビットのR Studioでこれを実行しています

score 1 · Accepted Answer

確かに問題の一部は、実際には整数ではなく倍精度を格納していることです。ノート：

m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)

> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes

のコードに「L」がないことに注意してくださいmat.or.vec。

> mat.or.vec
function (nr, nc) 
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>

を明示的に代入することも必要になるでしょう1L。そうしないと、R は最初の代入時にすべてを double に変換すると思います。値 1より大きい値を 1 つ割り当てm1、オブジェクトのサイズを再確認するだけで、これを確認できます。

おそらく、storage.mode整数を使用していることを確認するのに役立つ関数についても言及する必要があります。

r - 数値行列が本来よりもはるかに多くのメモリを消費しています - R

3 に答える 3

Related

Reference