r - 重要な基準で効率的に 2 つのデータフレームをマージする

Question

昨夜この質問に答えdata.frameて、 for ループで成長しない解決策を見つけるのにかなりの時間を費やしましたが、成功しませんでした。この問題に対処するより良い方法があるかどうか知りたいです。

問題の一般的なケースは次のようになります。

2つをマージdata.frames
いずれかdata.frameのエントリは、他方に 0 個以上の一致するエントリを持つことができます。
両方で 1 つ以上の一致があるエントリのみを考慮します。
一致関数が複雑で、両方data.frameの sに複数の列が含まれている

具体的な例として、リンクされた質問と同様のデータを使用します。

genes <- data.frame(gene       = letters[1:5], 
                    chromosome = c(2,1,2,1,3),
                    start      = c(100, 100, 500, 350, 321),
                    end        = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
                   chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
                   position   = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))

そして、複雑なマッチング関数:

# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
  return(
    marker$chromosome == gene$chromosome & 
    marker$postion >= (gene$start - 10) &
    marker$postion <= (gene$end + 10)
  )
}

がであるエントリの出力はsql INNER JOIN、2 つの data.frames ののようになります。もう一方に0個以上の一致があるように、2つを構築しようとしました。isMatchTRUEdata.framesdata.frame

私が思いついた解決策は次のとおりです。

joined <- data.frame()
for (i in 1:nrow(genes)) {
   # This repeated subsetting returns the same results as `isMatch` applied across
   # the `markers` data.frame for each entry in `genes`.
   matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
   matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
   matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
   # matches may now be 0 or more rows, which we want to repeat the gene for:
   if(nrow(matches) != 0) {
     joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
   }
}

結果を与える：

   gene chromosome start end marker position
1     a          2   100 200      3       96
2     a          2   100 200      4      206
3     b          1   100 200      1      105
4     b          1   100 200      5      150
5     b          1   100 200      9      120
51    e          3   321 567      6      400

これは非常に醜く扱いにくい解決策ですが、私が試した他の方法はすべて失敗に終わりました。

を使用するとapply、list各要素が行列であることがわかりましたが、rbindそれらへの道はありませんでした。
joined最終的に何行必要になるか分からないので、最初の次元を指定することはできません。

将来、この一般的な形式の問題を思いつくと思います。では、この種の問題を解決する正しい方法は何ですか?

score 4 · Accepted Answer

データテーブルソリューション: ローリング結合で最初の不等式を満たし、続いてベクトルスキャンで 2 番目の不等式を満たします。join-on-first-inequality は、最終結果よりも多くの行を持ちます (したがって、メモリの問題が発生する可能性があります) が、この回答のストレートアップマージよりも小さくなります。

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

二重結合を行うこともできますが、2 回目の結合の前にデータテーブルのキーを変更する必要があるため、上記のベクトルスキャンソリューションよりも高速になるとは思いません。

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

score 4 · Accepted Answer

マージを実行し、後で条件を満たす行を選別することで、非常によく似た問題を自分で処理しました。これが普遍的な解決策であるとは言いません。条件に一致するエントリがほとんどない大規模なデータセットを扱っている場合、これはおそらく非効率的です。しかし、それをあなたのデータに適応させるには:

joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
#    chromosome gene start end marker position
# 1           1    b   100 200      1      105
# 2           1    b   100 200      5      150
# 4           1    b   100 200      9      120
# 10          2    a   100 200      4      206
# 11          2    a   100 200      3       96
# 16          3    e   321 567      6      400

r - 重要な基準で効率的に 2 つのデータ フレームをマージする

4 に答える 4

Related

Reference

r - 重要な基準で効率的に 2 つのデータフレームをマージする