r - R: メモリが不足しています。行をループするにはどうすればよいですか?

Question

700.000 行以上のデータフレーム (myDF) があり、各行には id と text の 2 つの列があります。テキストには 140 文字のテキスト (ツイート) が含まれており、Web から取得したセンチメント分析を実行したいと考えています。ただし、何を試しても、4 GB RAM の MacBook でメモリの問題が発生します。

行をループできるのではないかと考えていました。たとえば、最初の 10 を実行し、次に 2 番目の 10 を実行するなどです。(100 個のバッチでも問題が発生します) これで問題は解決しますか? そのような方法でループする最良の方法は何ですか?

ここにコードを投稿しています：

library(plyr)
library(stringr)

# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
   # Parameters
   # sentences: vector of text to score
   # pos.words: vector of words of postive sentiment
   # neg.words: vector of words of negative sentiment
   # .progress: passed to laply() to control of progress bar

   # create simple array of scores with laply
   scores = laply(sentences,
   function(sentence, pos.words, neg.words)
   {

      # split sentence into words with str_split (stringr package)
      word.list = str_split(sentence, "\\s+")
      words = unlist(word.list)

      # compare words to the dictionaries of positive & negative terms
      pos.matches = match(words, pos.words)
      neg.matches = match(words, neg.words)

      # get the position of the matched term or NA
      # we just want a TRUE/FALSE
      pos.matches = !is.na(pos.matches)
      neg.matches = !is.na(neg.matches)

      # final score
    score = sum(pos.matches)- sum(neg.matches)
      return(score)
      }, pos.words, neg.words, .progress=.progress )

   # data frame with scores for each sentence
   scores.df = data.frame(text=sentences, score=scores)
   return(scores.df)
}

# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")

# apply function score.sentiment


myDF$scores = score.sentiment(myDF$text, pos, neg, .progress='text')

score 5 · Accepted Answer

4 GB は、140 文字の文を 700,000 個保存するのに十分なメモリのように思えます。センチメントスコアを計算する別の方法は、メモリと時間の効率が高く、チャンクに分割しやすい場合があります。各文を処理する代わりに、文のグループ全体を単語に分割する

words <- str_split(sentences, "\\s+")

次に、各文に含まれる単語数を決定し、単語の単一ベクトルを作成します

len <- sapply(words, length)
words <- unlist(words, use.names=FALSE)

変数を再利用するwordsことで、以前に使用したメモリを再利用のために解放します (@ cryo111 のアドバイスに反して、ガベージコレクターを明示的に呼び出す必要はありません!)。pos.wordsNA を気にせずに単語が入っているかどうかをで調べることができますwords %in% pos.words。しかし、少し賢く、この論理ベクトルの累積和を計算し、各文の最後の単語で累積和をサブセット化できます。

cumsum(words %in% pos.words)[len]

これの導関数として単語数を計算します

pos.match <- diff(c(0, cumsum(words %in% pos.words)[len]))

これはpos.matchスコアの一部です。そう

scores <- diff(c(0, cumsum(words %in% pos.words)[len])) - 
          diff(c(0, cumsum(words %in% neg.words)[len]))

以上です。

score_sentiment <-
    function(sentences, pos.words, neg.words)
{
    words <- str_split(sentences, "\\s+")
    len <- sapply(words, length)
    words <- unlist(words, use.names=FALSE)
    diff(c(0, cumsum(words %in% pos.words)[len])) - 
      diff(c(0, cumsum(words %in% neg.words)[len]))
}

ここでの意図は、これが単一のパスですべての文を処理することです

myDF$scores <- score_sentiment(myDF$text, pos, neg)

これにより、@joran で示されているように正しく実装された場合lapply、 and と比べて本質的に非効率的ではありませんが、ベクトル化されたソリューションと比較すると非常に非効率的な for ループが回避されます。おそらくここではコピーされず、スコアを返す (だけ) ことで、既に知っている情報 (文) を返すことでメモリが無駄になることはありません。最大のメモリはとです。sentencessentenceswords

それでもメモリに問題がある場合は、テキストを小さなグループに分割するために使用できるインデックスを作成し、それぞれのスコアを計算します

nGroups <- 10 ## i.e., about 70k sentences / group
idx <- seq_along(myDF$text)
grp <- split(idx, cut(idx, nGroups, labels=FALSE))
scorel <- lapply(grp, function(i) score_sentiment(myDF$text[i], pos, neg))
myDF$scores <- unlist(scorel, use.names=FALSE)

最初にそれmyDF$textが実際に文字であることを確認します。たとえば、myDF$test <- as.character(myDF$test)

score 0 · Accepted Answer

私の理解が正しければ、ループを使用して 10 行のセットに関数を適用する必要があります。これを行う一般的な方法を次に示します。まず、を使用して 10 行のセットでリストを作成しますsplit。それらは順序付けされていませんが、必要に応じて最後に並べ替えることができるので問題ありません。次に、関数をループに適用し、を使用して結果を「出力」ベクトルに追加しますrbind。

x <-matrix(1:100,ncol=1)
parts.start <-split(1:100,1:10) #creates list: divide in 10 sets of 10 lines

out <-NULL
for (i in 1:length(parts.start)){
res <- x[parts.start[[i]],,drop=FALSE]*2 #your function applied to elements of the list.
out <-rbind(out,res)
}
head(out)

     [,1]
[1,]    2
[2,]   22
[3,]   42
[4,]   62
[5,]   82
[6,]  102

r - R: メモリが不足しています。行をループするにはどうすればよいですか?

3 に答える 3

Related

Reference