r - 関数の高速化: 平均を計算する前に NA カウントをチェックする

Question

以下の関数は、ベクトルの平均を計算します。ただし、最初にベクトルに存在するの割合をチェックしNA、指定されたしきい値を超えている場合は、平均ではなく NA を返します。

私の問題は、現在の実装がかなり非効率的であることです。単純に実行するよりも 7 倍以上の時間がかかりますmean(vec, na.rm=TRUE)

を使用して別の方法を試しましたna.omitが、それはさらに遅くなります。

私のデータのサイズを考えると、シングルの実行にlapplyは 40 分以上かかります。

同じタスクをより迅速に達成する方法について何か提案はありますか?

更新 - RE: @thelatemail のソリューションと @Arun のコメント:

私はこの関数を数百のグループにわたって実行していますが、各グループのサイズはさまざまです。この質問で (元々) 提供されたサンプルデータは、人工データを簡単に作成できるように、適切なデータフレームとして提供されました。

混乱を避けるための代替サンプルデータ

# Sample Data 
# ------------
  set.seed(1)
  # slightly different sizes for each group
  N1 <- 5e3
  N2 <- N1 + as.integer(rnorm(1, 0, 100))

  # One group has only a moderate amount of NA's
  SAMP1 <- rnorm(N1)
  SAMP1[sample(N1, .25 * N1, FALSE)] <- NA  # add in NA's

  # Another group has many NA's
  SAMP2 <- rnorm(N2)
  SAMP2[sample(N2, .95 * N2, FALSE)] <- NA  # add in large number of NA's

  # put them all in a list
  SAMP.NEW <- list(SAMP1, SAMP2)

  # keep it clean
  rm(SAMP1, SAMP2)

# Execute 
# -------    
  lapply(SAMP.NEW, meanIfThresh)

オリジナルサンプルデータ、機能など

# Sample Data 
# ------------
  set.seed(1)
  rows <- 20000  # actual data has more than 7M rows
  cols <-  1000  

  SAMP <- replicate(cols, rnorm(rows))
  SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA  # add in NA's

  # Select 5 random rows, and have them be 90% NA
  tooSparse <- sample(rows, 5)
  for (r in tooSparse)
    SAMP[r, sample(cols, cols * .9, FALSE)] <- NA

# Function 
# ------------
    meanIfThresh <- function(vec, thresh=12/15) { 
      # Calculates the mean of vec, however, 
      #   if the number of non-NA values of vec is less than thresh, returns NA 

      # thresh : represents how much data must be PRSENT. 
      #          ie, if thresh is 80%, then there must be at least 


      len <- length(vec)

      if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
      # if the proportion of NA's is greater than the threshold, return NA
      # example:  if I'm looking at 14 days, and I have 12 NA's,
      #            my proportion is 85.7 % = (12 / 14)
      #            default thesh is  80.0 % = (12 / 15)
      #            Thus, 12 NAs in a group of 14 would be rejected


    # else, calculate the mean, removing NA's
    return(mean(vec, na.rm=TRUE))       
  }


  # Execute
  # -----------------
  apply(SAMP, 1, meanIfThresh)

  # Compare with `mean`
  #----------------
  plain    <- apply(SAMP, 1, mean, na.rm=TRUE)
  modified <- apply(SAMP, 1, meanIfThresh)

  # obviously different
  identical(plain, modified)
  plain[tooSparse]
  modified[tooSparse]


  microbenchmark( "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
                , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
                , times = 15L)

 #  With the actual data, the penalty is sevenfold
 #  Unit: seconds
 #           expr      min       lq   median       uq      max neval
 #   meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871    15
 # mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450    15

score 1 · Accepted Answer

関数内でベクトルvecを 2 回処理する必要がありますか? 最初に保存できればNA、計算が少し高速化される可能性があります。

meanIfThresh2 <- function(vec, thresh=12/15) { 

  len <- length(vec)
  nas <- is.na(vec)
  Nna <- sum(nas)
  if( (Nna / len) > thresh)
    return(NA_real_)

  return(sum(vec[!nas])/(len-Nna))
}

編集:この変更への影響を確認するために、同様のベンチマークを実行しました。

> microbenchmark(  "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
+                 , "meanIfThresh2"   = apply(SAMP, 1, meanIfThresh2)
+                 , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+                 , times = 15L)
Unit: seconds
           expr      min       lq   median       uq      max neval
   meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493    15
  meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028    15
 mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495    15

r - 関数の高速化: 平均を計算する前に NA カウントをチェックする

更新 - RE: @thelatemail のソリューションと @Arun のコメント:

混乱を避けるための代替サンプル データ

オリジナルサンプルデータ、機能など

2 に答える 2

Related

Reference

混乱を避けるための代替サンプルデータ