sql - R または SQL の間隔を使用したグループ参照ごとの最大 n

Question

以下で説明する（自明ではない）問題について説明しました。これは私の最初の投稿で、現在は修正版です。入力または提案された解決策は役に立ちます。

これにはいくつかの側面があります:小規模な問題の最適解の決定(既に以下にいくつかの提案があります)、時間(以下の data.table ソリューションはボックスをチェックしているようです)、およびメモリ管理です。問題は、あるテーブルで列挙され、別のテーブルのクラスターによって表されるタグに関するものです (同じストランドで 30bp 以内の場合は同じクラスター)。

課題は、特定のタグを適切な間隔に割り当てる効率的な手順を決定することです。つまり、タグ座標は開始位置、終了位置 (= 開始位置 + 1)、染色体 (完全なデータセットで 25 の値)、および鎖 (位置はプラス鎖またはマイナス鎖のいずれかにある) によって決定されます。二本鎖DNA）。したがって、クラスターは同じストランド上でオーバーラップしませんが、クラスターの間隔が異なるストランド上にある場合、クラスター座標がオーバーラップする可能性があり、事態が複雑になります。

これは 1 月 9 日の私の投稿の修正版であり、問題の本質的な難しさをよりよくカプセル化しています。小規模な問題を解決する高速なソリューションを後で示します。誰かが完全なデータセットで作業したい場合は、私に知らせてください. よろしくお願いします！

よろしく、

ニック・クラーク

背景この問題には、グループごとの間隔と最大 n が含まれます。クラスター化された遺伝子座標 (クラスター) と入力データ (タグ) を含む 2 つのテーブルがあります。クラスタテーブルには、タグテーブルからの同じストランドでカバーされた各重複しない間隔から合計されたタグが含まれます。完全なクラスターテーブルには 160 万行あります。タグテーブルは約 400 万行なので、ソリューションは理想的にはベクトル化する必要があります。サンプルデータについては、以下を参照してください。設定は、ヒトの転写開始点（CAGE）に関するデータセットです。

私は R で作業しているので、R または SQL に基づくソリューションを探しています。Rのplyrパッケージとsqldfパッケージの両方を使用して、以前に失敗した試みを行いました.

課題私が見逃しているのは、クラスター化されたテーブル内の行で、最大のタグの貢献度に関連付けられた入力データテーブルから開始座標を特定することです。

1) 異なるストランドからのクラスターは重複する座標を持つことができる、2) chr / chr_clst は 25 の異なる値を取ることができる (例には示されていません)、3) ソリューションはストランドと chr / chr_clst の両方を考慮する必要があることに注意してください。

私の理想的な解決策: ベクトル化された R コードまたは以下の SQL ステートメントの改善。メモリの問題を説明する以下のソリューションのバージョンは、うまくいくでしょう。クラスタテーブルから適切な行を効率的に決定する改善された SQL ステートメントと同様です。

これまでの状況これ は、これまでのところ明らかに最良の解決策です。コードについては user1935457 を、その後の修正案については mnel を参照してください。ここでの問題は、おもちゃの例からフルスケールテーブルに移動すると、メモリに対する過剰な要求が原因で R (および R Studio) の両方がクラッシュすることです。

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
       list(IDX=.I[which.max(tag_count)]),
       by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

最初に、R の sqldf パッケージを使用して基本的な SQL クエリを作成しました (このバージョンでは、最大値に関連付けられた座標ではなく、最大値が検出されます)。両方のテーブルに (できれば) 適切なインデックスを配置しているにもかかわらず、クエリの実行には永遠に時間がかかります。

output_tablename <- sqldf(c(
"create index ai1 on clusters(chr_clst, start_clst, end_clst, strand_clst)",
"create index ai2 on tags(chr, start, end, strand)",
"select a.chr_clst, a.start_clst, a.end_clst, a.strand_clst, sum(b.tags)
from main.clusters a
inner join main.tags b on a.chr_clst=b.chr and a.strand_clst = b.strand 
and b.end between a.start_clst and a.end_clst
group by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst
order by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst"
))

テーブル構造
クラスター: chr_clst、start_clst、end_clst、strand_clst、tags_clst。
タグ: chr、start、end、strand、tag_count。

R 形式のサンプルデータ 完全なデータセットで作業したい人がいたら教えてください。

クラスター:

chr_clst <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_clst <- c(568911, 569233, 569454, 569793, 569877, 569926, 569972, 570048, 570166, 713987)
end_clst <- c(568941, 569256, 569484, 569803, 569926, 569952, 569973, 570095, 570167, 714049)
strand_clst <- c("+", "+", "+", "+", "+", "-", "+", "+", "+", "-")
tags_clst <- c(37, 4, 6, 3, 80, 25, 1, 4, 1, 46)

clusters <- data.frame(cbind(chr_clst, start_clst, end_clst, strand_clst, tags_clst))
clusters$start_clst <- as.numeric(as.character(clusters$start_clst))
clusters$end_clst <- as.numeric(as.character(clusters$end_clst))
clusters$tags_clst <- as.numeric(as.character(clusters$tags_clst))
rm(chr_clst, start_clst, end_clst, start_clst, strand_clst, tags_clst)

タグ:

chr <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1")

start <- c(568911, 568912, 568913, 568913, 568914, 568916, 568917, 568918, 568929, 
568929, 568932, 568933, 568935, 568937, 568939, 568940, 568940, 569233, 569247, 
569255, 569454, 569469, 569471, 569475, 569483, 569793, 569802, 569877, 569880, 
569887, 569889, 569890, 569891, 569893, 569894, 569895, 569895, 569896, 569897, 
569898, 569898, 569899, 569900, 569900, 569901, 569901, 569903, 569905, 569906, 
569907, 569907, 569908, 569908, 569909, 569910, 569910, 569911, 569911, 569912, 
569914, 569914, 569915, 569916, 569917, 569918, 569919, 569920, 569920, 569925, 
569926, 569936, 569938, 569939, 569939, 569940, 569941, 569941, 569942, 569942, 
569943, 569944, 569948, 569948, 569951, 569972, 570048, 570057, 570078, 570094, 
570166, 713987, 713989, 713995, 714001, 714001, 714007, 714008, 714010, 714011, 
714011, 714011, 714013, 714015, 714015, 714017, 714018, 714019, 714023, 714025, 
714029, 714034, 714034, 714037, 714038, 714039, 714039, 714040, 714042, 714048, 
714048)

end <- c(568912, 568913, 568914, 568914, 568915, 568917, 568918, 568919, 568930, 
568930, 568933, 568934, 568936, 568938, 568940, 568941, 568941, 569234, 569248,
569256, 569455, 569470, 569472, 569476, 569484, 569794, 569803, 569878, 569881, 
569888, 569890, 569891, 569892, 569894, 569895, 569896, 569896, 569897, 569898, 
569899, 569899, 569900, 569901, 569901, 569902, 569902, 569904, 569906, 569907, 
569908, 569908, 569909, 569909, 569910, 569911, 569911, 569912, 569912, 569913, 
569915, 569915, 569916, 569917, 569918, 569919, 569920, 569921, 569921, 569926, 
569927, 569937, 569939, 569940, 569940, 569941, 569942, 569942, 569943, 569943, 
569944, 569945, 569949, 569949, 569952, 569973, 570049, 570058, 570079, 570095, 
570167, 713988, 713990, 713996, 714002, 714002, 714008, 714009, 714011, 714012, 
714012, 714012, 714014, 714016, 714016, 714018, 714019, 714020, 714024, 714026, 
714030, 714035, 714035, 714038, 714039, 714040, 714040, 714041, 714043, 714049, 
714049)

strand <- c("+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "+", "+", "+", "+", "+", "+", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-")

tag_count <- c(1, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 1, 2, 1, 6, 2, 8, 1, 1, 2, 1, 1, 2, 
1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 4, 2, 4, 2, 4, 
1, 1, 1, 1, 3, 2, 1, 3, 1, 2, 3, 1, 1, 3, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 1, 2, 2, 
4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 2, 4, 2, 1, 1, 1, 
2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 2)

tags <- data.frame(cbind(chr, start, end, strand, tag_count))    
tags$start <- as.numeric(as.character(tags$start))
tags$end <- as.numeric(as.character(tags$end))
tags$tag_count <- as.numeric(as.character(tags$tag_count))
rm(chr, start, end, strand, tag_count)

score 2 · Accepted Answer

data.tableパッケージでの試みは次のとおりです。

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
           list(IDX=.I[which.max(tag_count)]),
           by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

最後の行の出力:

     chr strand start_clst end_clst tags_clst  start    end tag_count
 1: chr1      -     569926   569952        25 569942 569943         4
 2: chr1      -     713987   714049        46 714011 714012         4
 3: chr1      +     568911   568941        37 568940 568941         8
 4: chr1      +     569233   569256         4 569255 569256         2
 5: chr1      +     569454   569484         6 569471 569472         2
 6: chr1      +     569793   569803         3 569793 569794         2
 7: chr1      +     569877   569926        80 569925 569926         5
 8: chr1      +     569972   569973         1 569972 569973         1
 9: chr1      +     570048   570095         4 570048 570049         1
10: chr1      +     570166   570167         1 570166 570167         1

編集

以下のコメントで説明されているメモリの問題に基づいて、別の試みがあります。パッケージを使用してintervals、2 つのテーブル間の重複する間隔を見つけます。forループを並列化して速度を上げることもできます。

require(data.table)
require(intervals)
clusters <- data.table(clusters)
tags <- data.table(tags)

#  Find all unique combinations of chr and strand...
setkey(clusters, chr_clst, strand_clst)
setkey(tags, chr, strand)

unique.keys <- unique(rbind(clusters[, key(clusters), with=FALSE],
                            tags[, key(tags), with=FALSE], use.names=FALSE))

# ... and then work on each pair individually to avoid creating
# enormous objects in memory
result.list <- vector("list", nrow(unique.keys))
for(i in seq_len(nrow(unique.keys))) {
  tmp.clst <- clusters[unique.keys[i]]
  tmp.tags <- tags[unique.keys[i]]

  # Keep track of each row for later
  tmp.clst[, row.id := seq_len(nrow(tmp.clst))]
  tmp.tags[, row.id := seq_len(nrow(tmp.tags))]

  # Use intervals package to find all overlapping [start, end] 
  # intervals between the two tables
  clst.intervals <- Intervals(tmp.clst[, list(start_clst, end_clst)],
                              type = "Z")
  tags.intervals <- Intervals(tmp.tags[, list(start, end)],
                              type = "Z")
  rownames(tags.intervals) <- tmp.tags$row.id

  # This goes to C++ code in intervals package; 
  # I didn't spend too much time looking over how it works
  overlaps <- interval_overlap(tags.intervals,
                               clst.intervals,
                               check_valid = FALSE)

  # Retrieve rows from clusters table with overlaps and add a column
  # indicating which intervals in tags table they overlapped with
  matches <- lapply(as.integer(names(overlaps)), function(n) {
    ans <- tmp.clst[overlaps[[n]]]
    ans[, match.in.tags := n]
  })

  # List back to one table...
  matches <- rbindlist(matches)

  # ... and join each match from tags to its relevant row from tags
  setkey(matches, match.in.tags)
  setkey(tmp.tags, row.id)

  # add the rows for max of tag_count by start_clst and
  # end_clst from this particular unique key to master list...
  result.list[[i]] <- tmp.tags[matches][, .SD[which.max(tag_count)],
                                        by = list(start_clst, end_clst)]
}

# and concatenate master list into none table,
# getting rid of the helper columns
rbindlist(result.list)[, c("row.id", "row.id.1") := NULL][]

最後の行は次のとおりです。

    start_clst end_clst  chr strand  start    end tag_count chr_clst strand_clst tags_clst
 1:     569926   569952 chr1      - 569942 569943         4     chr1           -        25
 2:     713987   714049 chr1      - 714011 714012         4     chr1           -        46
 3:     568911   568941 chr1      + 568940 568941         8     chr1           +        37
 4:     569233   569256 chr1      + 569255 569256         2     chr1           +         4
 5:     569454   569484 chr1      + 569471 569472         2     chr1           +         6
 6:     569793   569803 chr1      + 569793 569794         2     chr1           +         3
 7:     569877   569926 chr1      + 569925 569926         5     chr1           +        80
 8:     569972   569973 chr1      + 569972 569973         1     chr1           +         1
 9:     570048   570095 chr1      + 570048 570049         1     chr1           +         4
10:     570166   570167 chr1      + 570166 570167         1     chr1           +         1

score 2 · Accepted Answer

他の回答やコメントに基づいて、いくつかのヒントを提供するための簡単な回答です。

X[Y](または) が(たとえば)merge(X,Y)より大きい多数の行を返す場合(つまり、サブセットが続く) は役に立ちません。最終結果ははるかに小さくなりますが、最初に大きなものを作成する必要があります。max(nrow(X),nrow(Y))nrow(X)*nrow(Y)X[Y][where]X[Y]X[Y]

範囲が必要な場合、1 つの方法はw = X[Y,roll=TRUE,which=TRUE]またはw=X[Y,mult="first",which=TRUE]そのようなもので、おそらく最初と最後で 2 回です。w各範囲の行の位置 ( ) を取得したら、最初と最後の間で、結果を選択できますseq。vecseqこのタグの他の SO の質問にはいくつかの例があります。もちろん、これを data.table に組み込むとよいでしょう。また、結合列自体を、列ごとの行ごとの範囲クエリの境界を含む 2 つの列リスト列にすることを提案する機能要求があります。

または、by-without-byを使用できます。これは、句がない場合jの行ごとに評価される場所です。by-without-by を検索して、例を参照してください。これが、最初にデカルト結果全体を実際に作成することなく、デカルトからサブセットの考え方に固執できる方法です。次のようなもの: . ただし、 3 つのベクトルスキャン ( 、および) があるため、おそらくorアプローチよりも遅くなります。しかし、少なくとも大きなメモリ割り当てを回避できます。プレフィックスを使用して、非結合列を明示的に参照できることに注意してください。data.tableの関数は、C でコーディングして、より効率的に行うことができます。iby?data.tableX[Y,.SD[start<=i.value & i.value<=end]]rollmult&<=<=i.ibetweenclampRcpp で機能します。しかし、現在between()は R ベクトルスキャンとして記述されているため、同じように低速です。

それが役立つことを願っています。私は現在の考えを説明しようとしましたが、それが正しいか間違っているかはわかりません。

そして、data.table を改善して、コメントに記載されているようにいくつかのヒントを提供する優雅なエラーでデカルト割り当てをキャッチします [編集: allow.cartesian=FALSEv1.8.7 で追加されました]。ありがとう！

パラグラフ 2 を拡張するには:

setkey(clusters,chr,strand,end_clst)
setkey(tags,chr,strand,end)

begin = tags[clusters[,list(chr,strand,start_clst)],roll=-Inf,mult="first",which=TRUE]
end = tags[clusters[,list(chr,strand,end_clst)],roll=+Inf,mult="last",which=TRUE]

idx = mapply(function(x,y){.i=seq.int(x,y); .i[ which.max(tags$tag_count[.i]) ]}, begin, end)
cbind(clusters, tags[idx])
     chr start_clst end_clst strand tags_clst  chr  start    end strand tag_count
 1: chr1     569926   569952      -        25 chr1 569942 569943      -         4
 2: chr1     713987   714049      -        46 chr1 714011 714012      -         4
 3: chr1     568911   568941      +        37 chr1 568940 568941      +         8
 4: chr1     569233   569256      +         4 chr1 569255 569256      +         2
 5: chr1     569454   569484      +         6 chr1 569471 569472      +         2
 6: chr1     569793   569803      +         3 chr1 569793 569794      +         2
 7: chr1     569877   569926      +        80 chr1 569925 569926      +         5
 8: chr1     569972   569973      +         1 chr1 569972 569973      +         1
 9: chr1     570048   570095      +         4 chr1 570048 570049      +         1
10: chr1     570166   570167      +         1 chr1 570166 570167      +         1

これにより、他の回答やコメントで言及されているカルトのメモリ割り当ての問題が回避されます。v1.8.7 で次の新機能を使用します。

TRUEo /FALSEに加えてroll、正の数 (ロールフォワード/LOCF) または負の数 (ロールバック/NOCB) を指定できるようになりました。有限数は、値が転がる距離を制限します (限られた古さ)。roll=TRUEとroll=+Inf同等です。
rollendsは、2 つの論理を保持する新しいパラメーターです。がの場合、最初の観測が逆方向にロールバックされrollends[1]ますTRUE。がの場合、最後の観測がロールフォワードされrollends[2]ますTRUE。rollが有限数の場合、同じ制限が両端に適用されます。

score 1 · Accepted Answer

を使用した 1 つの提案を次に示しますapply。

transform(
  clusters,
  start = apply(clusters[c("chr_clst", "start_clst", "end_clst", "strand_clst")],
                1, function(x) {
                     tmp <- tags[tags$start >= as.numeric(x[2]) &
                                 tags$end <= as.numeric(x[3]) & 
                                 tags$chr == x[1] & 
                                 tags$strand == x[4], c("tag_count", "start")]
                     tmp$start[which.max(tmp$tag_count)]}))

基本的に、関数の各行について、関連するのサブセット内のclustersの最大値を探します。の適切な値が選択されます。これらの値の新しいベクトルは、の新しい列として使用されます。tag_counttagsstartstartclusters

結果：

   chr_clst start_clst end_clst strand_clst tags_clst  start
1      chr1     568911   568941           +        37 568940
2      chr1     569233   569256           +         4 569255
3      chr1     569454   569484           +         6 569471
4      chr1     569793   569803           +         3 569793
5      chr1     569877   569926           +        80 569925
6      chr1     569926   569952           -        25 569942
7      chr1     569972   569973           +         1 569972
8      chr1     570048   570095           +         4 570048
9      chr1     570166   570167           +         1 570166
10     chr1     713987   714049           -        46 714011

score 1 · Accepted Answer

これは、以来利用可能なfoverlaps()関数を使用して非常に効率的に行うことができます:data.tablev1.9.4

require(data.table) #v1.9.4+
setDT(clusters, key=c("chr_clst", "strand_clst", "start_clst", "end_clst"))
setDT(tags, key=c("chr", "strand", "start", "end"))

ans = foverlaps(clusters, tags)[, .SD[which.max(tag_count)], by=.(chr_clst, strand_clst, start_clst, end_clst)]

#     chr_clst strand_clst start_clst end_clst  start    end tag_count tags_clst
#  1:     chr1           -     569926   569952 569942 569943         4        25
#  2:     chr1           -     713987   714049 714011 714012         4        46
#  3:     chr1           +     568911   568941 568940 568941         8        37
#  4:     chr1           +     569233   569256 569255 569256         2         4
#  5:     chr1           +     569454   569484 569471 569472         2         6
#  6:     chr1           +     569793   569803 569793 569794         2         3
#  7:     chr1           +     569877   569926 569925 569926         5        80
#  8:     chr1           +     569972   569973 569972 569973         1         1
#  9:     chr1           +     570048   570095 570048 570049         1         4
# 10:     chr1           +     570166   570167 570166 570167         1         1

foverlaps()また、新しい(v1.9.6+) 引数と同様に、最初にキーを設定しなくても、最終的に重複する範囲結合を実行できるようになります。on=

sql - R または SQL の間隔を使用したグループ参照ごとの最大 n

4 に答える 4

編集

Related

Reference