r - カスタム関数を data.table に行単位で適用すると、誤った量の値が返される

Question

私はdata.tablesが初めてで、次のようなDNAゲノム座標を含むテーブルがあります。

       chrom   pause strand coverage
    1:     1 3025794      +        1
    2:     1 3102057      +        2
    3:     1 3102058      +        2
    4:     1 3102078      +        1
    5:     1 3108840      -        1
    6:     1 3133041      +        1

約 200 万行のテーブルの各行に適用するカスタム関数を作成しました。GenomicFeatures の mapToTranscripts を使用して、2 つの関連する値を文字列と新しい座標の形式で取得します。次のように、2 つの新しい列でそれらをテーブルに追加したいと考えています。

       chrom   pause strand coverage       transcriptID CDS
    1:     1 3025794      +        1 ENSMUST00000116652 196
    2:     1 3102057      +        2 ENSMUST00000116652  35
    3:     1 3102058      +        2 ENSMUST00000156816 888
    4:     1 3102078      +        1 ENSMUST00000156816 883
    5:     1 3108840      -        1 ENSMUST00000156816 882
    6:     1 3133041      +        1 ENSMUST00000156816 880

機能は次のとおりです。

    get_feature <- function(dt){

      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) 
      hit <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) 
      tx_id <- tx_names[as.character(seqnames(hit))] 
      cds_coordinate <- sapply(ranges(hit), '[[', 1)

      if(length(tx_id) == 0 || length(cds_coordinate) == 0) {  
        out <- list('NaN', 0)
      } else {
        out <- list(tx_id, cds_coordinate)
      }

      return(out)
    }

次に、私は：

    counts[, c("transcriptID", "CDS"):=get_feature(.SD), by = .I]

そして、このエラーが発生します。これは、関数が行ごとに 1 つの新しい要素ではなく、元のテーブルよりも短い長さの 2 つのリストを返していることを示しています。

Warning messages:
    1: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'transcriptID' (recycled leaving remainder of 774162 items).
    2: In `[.data.table`(counts, , `:=`(c("transcriptID", "CDS"),  ... :
      Supplied 1112452 items to be assigned to 1886614 items of column 'CDS' (recycled leaving remainder of 774162 items).

.I演算子を使用すると、関数が行単位で適用され、行ごとに 1 つの値が返されると想定しました。また、 ifステートメントを使用して、関数が空の値を返さないようにしました。

次に、関数のこのモックバージョンを試しました。

    get_feature <- function(dt) {

      return('I should be returned once for each row')

    }

そしてそれを次のように呼び出しました：

    new.table <- counts[, get_feature(.SD), by = .I]

元の長さではなく、1行のデータテーブルを作成します。したがって、私の関数、またはおそらく私がそれを呼び出している方法は、結果のベクトルの要素を何らかの形で折りたたんでいると結論付けました。私は何を間違っていますか？

更新 (解決策あり): @StatLearner が指摘したように、この回答では、で説明されているように?data.table、(のように).Iでの使用のみを目的としていることが説明されています。したがって、はと同等であり、適切な構文は、行番号でグループ化し、関数を行ごとに適用するためのものです。jDT[i,j,by=]by=.Iby=NULLby=1:nrow(dt)

残念ながら、私の特定のケースでは、これはまったく非効率的であり、100 行で 20 秒の実行時間を計算しました。私の 3,600 万行のデータセットの場合、完了するまでに 3 か月かかります。

私の場合、このようにテーブル全体で関数をあきらめて使用する必要mapToTranscriptsがありました。これには数秒かかり、明らかに意図された使用方法でした。

    get_features <- function(dt){
      coordinate <- GRanges(dt$chrom, IRanges(dt$pause, width = 1), dt$strand) # define coordinate
      hits <- mapToTranscripts(coordinate, cds_canonical, ignore.strand = FALSE) # map it to a transcript
      tx_hit <- as.character(seqnames(hits)) # get transcript number
      tx_id <- tx_names[tx_hit] # get transcript name from translation table

      return(data.table('transcriptID'= tx_id, 
                       'CDS_coordinate' =  start(hits))
    }

     density <- counts[, get_features(.SD)]

mapFromTranscripts次に、 fromGenomicFeaturesパッケージを使用してゲノムにマップし直して、data.tables結合を使用して元のテーブルから情報を取得できるようにしました。

score 4 · Accepted Answer

data.table の各行に関数を適用する必要がある場合の方法は、行番号でグループ化することです。

counts[, get_feature(.SD), by = 1:nrow(counts)]

この回答で説明されているように、グループ化によって生成された行インデックスのシーケンスを返す必要がある.Iため、 in での使用は意図されていません。がエラーをスローしないby理由は、data.table がdata.table 名前空間に等しいオブジェクトを作成するためです。by = .I.INULLby = .Iby = NULL

by=1:nrow(dt)行番号によるグループを使用すると、関数は data.table から 1 行のみにアクセスできることに注意してください。

require(data.table)
counts <- data.table(chrom = sample.int(10, size = 100, replace = TRUE),
                     pause = sample((3 * 10^6):(3.2 * 10^6), size = 100), 
                     strand = sample(c('-','+'), size = 100, replace = TRUE),
                     coverage = sample.int(3, size = 100, replace = TRUE))

get_feature <- function(dt){
    coordinate <- data.frame(dt$chrom, dt$pause, dt$strand)
    rowNum <- nrow(coordinate)
    return(list(text = 'Number of rows in dt', rowNum = rowNum))  
}

counts[, get_feature(.SD), by = 1:nrow(counts)]

と同じ行数の data.table が生成されますcountsが、coordinate含まれる行はから 1 つだけです。counts

   nrow                 text rowNum
1:    1 Number of rows in dt      1
2:    2 Number of rows in dt      1
3:    3 Number of rows in dt      1
4:    4 Number of rows in dt      1
5:    5 Number of rows in dt      1

whileby = NULLは data.table 全体を関数に提供します。

counts[, get_feature(.SD), by = NULL]

                   text rowNum
1: Number of rows in dt    100

これは、が動作するための意図された方法ですby。

r - カスタム関数を data.table に行単位で適用すると、誤った量の値が返される

1 に答える 1

Related

Reference