r - Rのdata.tableでランダムな内部自己参照エラーを取得する

Question

私はdata.tableが大好きです、それは速くて直感的です、何がより良いでしょうか？残念ながら、ここに私の問題があります：ループdata.table内を参照するとき（実装を使用して）、次のエラーが発生することがあります： 付録の例foreach()doMC

Error in { : 
  Internal error: .internal.selfref prot is not itself an extptr

ここでの厄介な問題の1つは、一貫性を持って再現できないことですが、それはいくつかの長い（数時間）タスク中に発生するため、可能であれば、決して発生しないようにしたいと思います。

data.table各ループで同じ、、を参照しているのでDT、各ループの先頭で次のコマンドを実行してみました。

setattr(DT,".internal.selfref",NULL)

...無効/破損した自己参照属性を削除します。これは機能し、内部自己参照エラーは発生しなくなります。ただし、これは回避策です。

根本的な問題に対処するためのアイデアはありますか？

助けてくれてありがとう！

エリック

付録：最新バージョンを確認するための省略されたRセッション情報：

R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
other attached packages:
 [1] data.table_1.8.8  doMC_1.3.0

シミュレートされたデータの使用例-history()エラーを取得するには、関数を何度も（数百回など）実行する必要がある場合があります。

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Load packages and Prepare Data
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
require(data.table)
##this is the package we use for multicore
require(doMC)
##register n-2 of your machine's cores
registerDoMC(multicore:::detectCores()-2) 

## Build simulated data
value.a <- runif(500,0,1)
value.b <- 1-value.a
value <- c(value.a,value.b)
answer.opt <- c(rep("a",500),rep("b",500))
answer.id <- rep( 6000:6499 , 2)
question.id <- rep( sample(c(1001,1010,1041,1121,1124),500,replace=TRUE) ,2)
date <- rep( (Sys.Date() - sample.int(150, size=500, replace=TRUE)) , 2)
user.id <- rep( sample(250:350, size=500, replace=TRUE) ,2)
condition <- substr(as.character(user.id),1,1)
condition[which(condition=="2")] <- "x"
condition[which(condition=="3")] <- "y"

##Put everything in a data.table
DT.full <- data.table(user.id = user.id,
                      answer.opt = answer.opt,
                      question.id = question.id,
                      date = date,
                      answer.id = answer.id,
                      condition = condition,
                      value = value)

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Daily Aggregation Function
##
##a basic function that aggregates all the values from
##all users for every question on a given day:
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
each.day <- function(val.date){
  DT <- DT.full[ date < val.date ]

  #count the number of updates per user (for weighting)
  setkey(DT, question.id, user.id)
  DT <- DT[ DT[answer.opt=="a",length(value),by="question.id,user.id"] ]
  setnames(DT, "V1", "freq")

  #retain only the most recent value from each user on each question
  setkey(DT, question.id, user.id, answer.id)
  DT <- DT[ DT[ ,answer.id == max(answer.id), by="question.id,user.id", ][[3]] ]

  #now get a weighted mean (with freq) of the value for each question
  records <- lapply(unique(DT$question.id), function(q.id) {
    DT <- DT[ question.id == q.id ]
    probs <- DT[ ,weighted.mean(value,freq), by="answer.opt" ]
    return(data.table(q.id = rep(q.id,nrow(probs)),
                      ans.opt = probs$answer.opt,
                      date = rep(val.date,nrow(probs)),
                      value = probs$V1))
  })
  return(do.call("rbind",records))
}

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## foreach History Function 
##
##to aggregate accross many days quickly
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
history <- function(start, end){
  #define a sequence of dates
  date.seq <- seq(as.Date(start),as.Date(end),by="day")

  #now run a foreach to get the history for each date
  hist <- foreach(day = date.seq,  .combine = "rbind") %dopar% {
    #setattr(DT,".internal.selfref",NULL) #resolves occasional internal selfref error
    each.day(val.date = day)
  }
}

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Examples
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

##aggregate only one day
each.day(val.date = "2012-12-13")

##generate a history
hist.example <- history (start = "2012-11-01", end = Sys.Date())

score 4 · Accepted Answer

ご報告いただきありがとうございます。v1.8.11 で修正されました。NEWSより：

data.table が何度も繰り返し呼び出される長時間の計算では、次のエラーが発生する場合がありました #2647 :
Internal error: .internal.selfref prot is not itself an extptr
修正済み。（難しい）再現可能な例を提供してくれたEricStone、StevieP、およびJasonBに感謝します。

グループ化でのメモリリークが関連している可能性がありますが、これも現在修正されています。

グループ化での長い未解決の (通常は小さい) メモリリークが修正されました (#2648)。最後のグループが最大のグループよりも小さい場合、それらのサイズの違いは解放されていませんでした。また、各グループが異なる数の行を返す重要な集計でも。ほとんどのユーザーはグループ化クエリを 1 回実行し、これらに気付くことはありませんが、グループ化の呼び出しをループしているユーザー (並列実行中やベンチマーク中など) は問題を抱えている可能性があります。テストが追加されました。vc273 や Y T を含む多くの方々に感謝します。data.table での
参照によるグループ化された割り当てでの
 メモリリーク j で名前付きリストを返すときの data.table でのメモリリークが遅い (data.table を再形成しようとしている)

score 2 · Accepted Answer

同様の問題が何ヶ月も私を悩ませてきました。おそらく、私たちの経験をまとめることで、パターンを見ることができます。

再現可能な例を作成できるようになるまで、投稿を待っていました。これまでのところ不可能です。バグは同じコードの場所では発生しません。以前は、まったく同じコードを再実行するだけで、多くの場合エラーを回避できました。また、式を再定式化し、再実行して成功することもあります。いずれにせよ、これらのエラーは本当に data.table の内部にあると確信しています。

パターンを検出するために、最後の 4 つのエラーメッセージを保存しました (以下に貼り付けます)。

---------------------------------------------------
[1] "err msg: location 1"
Error in selfrefok(x) : 
  Internal error: .internal.selfref prot is not itself an extptr
Calls: my.fun1 ... $<- -> $<-.data.table -> [<-.data.table -> selfrefok
Execution halted


---------------------------------------------------
[1] "err msg: location 1"
Error in alloc.col(newx) : 
  Internal error: .internal.selfref prot is not itself an extptr
Calls: my.fun1 -> $<- -> $<-.data.table -> copy -> alloc.col
Execution halted


---------------------------------------------------
[1] "err msg: location 2"
Error in shallow(x) : 
  Internal error: .internal.selfref prot is not itself an extptr
Calls: print ... do.call -> lapply -> as.list -> as.list.data.table -> shallow
Execution halted

---------------------------------------------------
[1] "err msg: location 3"
Error in shallow(x) : 
  Internal error: .internal.selfref prot is not itself an extptr
Calls: calc.book.summ ... .rbind.data.table -> as.list -> as.list.data.table -> shallow
Execution halted

上記の例とのもう 1 つの類似点: 並列スレッド間で data.tables を渡しているため、それらはシリアル化/非シリアル化されています。

上記の「setattr」修正を試みます。

これが役に立てば幸いです。ありがとう、ジェイソン

以下は、実行される 50 ～ 100,000 回ごとに 1 回このエラーを生成するように見えるコードセグメントの 1 つを簡略化したものです。

ありがとう@MatthewDowleところで。data.table が最も役に立ちました。以下はコードの一部を省略したものです。

require(data.table)
require(xts)

book <- data.frame(name='',
                   s=0,
                   Value=0.0,
                   x=0.0,
                   Qty=0)[0, ]

for (thing in list(1,2,3,4,5)) {

  tmp <- xts(1:5, order.by= make.index.unique(rep(Sys.time(), 5)))
  colnames(tmp) <- 'A'
  tmp <- cbind(coredata(tmp[nrow(tmp), 'A']),
               coredata(colSums(tmp[, 'A'])),
               coredata(tmp[nrow(tmp), 'A']))

  book <- rbind(book,
                data.table(name='ALPHA',
                           s=0*NA,
                           Value=tmp[1],
                           x=tmp[2],
                           Qty=tmp[3]))

}

このようなものがこのエラーの原因のようです:

Error in shallow(x) : 
  Internal error: .internal.selfref prot is not itself an extptr
Calls: my.function ... .rbind.data.table -> as.list -> as.list.data.table -> shallow
Execution halted

score 1 · Accepted Answer

エラーを再現するために、このバグがどこから来ているのかを突き止めるためのスクリプトがあります。エラーは次のとおりです。

Error in { : 
task 96 failed - "Internal error: .internal.selfref prot is not itself an extptr"
Calls: apply ... system.time -> apply -> FUN -> %dopar% -> <Anonymous>
Execution halted

doParallelのバックエンドを登録するために使用していforeachます。

コンテキスト: MNIST 手書き数字データセットで分類子をテストしています。経由で私からデータを取得できます

wget -nc https://www.dropbox.com/s/xr4i8gy11ed8bsh/digit_id_data_and_benchmarks.zip

load_data.Rを正しく指し、load_data.R が MNIST データを正しく指すように、スクリプト (上記) を必ず変更してください。次に、dt_centric_random_gov.R を実行します。

申し訳ありませんが、再現可能な最小限の例を作成できませんでしたが、@ JasonBの回答のように、大量の計算を行うまでこのエラーは表示されないようです。

編集:上記の回避策を使用してスクリプトを再実行しましたが、問題なく実行されたようです。

r - Rのdata.tableでランダムな内部自己参照エラーを取得する

3 に答える 3

Related

Reference