r - グループごとにdata.tableを分割し、列内のオカレンスごとにサブセットを使用する方法は?

Question

次のような 287046 x 18 の大きなデータセットがあります (部分的な表現のみ)。

tdf
         geneSymbol     peaks
16         AK056486 Pol2_only
13         AK310751   no_peak
7          BC036251   no_peak
10         DQ575786   no_peak
4          DQ597235   no_peak
5          DQ599768   no_peak
11         DQ599872   no_peak
12         DQ599872   no_peak
2           FAM138F   no_peak
15           FAM41C   no_peak
34116         GAPDH      both
283034        GAPDH Pol2_only
6      LOC100132062   no_peak
9      LOC100133331   no_peak
14     LOC100288069      both
8            M37726   no_peak
3             OR4F5   no_peak
17           SAMD11      both
18           SAMD11      both
19           SAMD11      both
20           SAMD11      both
21           SAMD11      both
22           SAMD11      both
23           SAMD11      both
24           SAMD11      both
25           SAMD11      both
1            WASH7P Pol2_only

私がやりたいことは、（1）「Pol2_only」または「both」のいずれかであるgeneSymbolを抽出することです。(2) 「両方」ではなく「Pol2_only」である遺伝子シンボルのみ。たとえば、GAPDH は条件 1 を満たしますが、2 を満たしません。

私はこのようなものでplyrを試しました（そこには余分な条件があります、無視してください）：

## grab genes with both peaks 
pol2.peaks <- ddply(filem, .(geneSymbol), function(dfrm) subset(dfrm, peaks == "both" | (peaks == "Pol2_only" & peaks == "CBP20_only")), .parallel=TRUE)

## grab genes pol2 only peaks 
pol2.only.peaks <- ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"), .parallel=TRUE)

しかし、それには長い時間がかかり、それでも間違った答えを返します。たとえば、2 の答えは次のとおりです。

pol2.only.peaks
  geneSymbol     peaks
1   AK056486 Pol2_only
2      GAPDH Pol2_only
3     WASH7P Pol2_only

ご覧のとおり、GAPDH は存在しないはずです。私の data.table での実装 (これは非常に好ましいため、好ましい) でも同じ結果が得られます。

filem.dt <- as.data.table(tdf)
setkey(filem.dt, "geneSymbol")
test.dt <- filem.dt[ , .SD[ peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only"]]
test.dt
   geneSymbol     peaks
1:   AK056486 Pol2_only
2:      GAPDH Pol2_only
3:     WASH7P Pol2_only

問題は、サブセット化が行ごとに機能しているように思われますが、geneSymbol のサブグループ全体に適用する必要があります。

グループのサブセット化を手伝ってもらえますか? data.table ソリューションは高速ですが、plyr (またはベース R) で問題ないため、歓迎されます。ピークの性質に注意する余分な列を追加するソリューションは完璧です. これが私が意味することです：

tdf
         geneSymbol     peaks      newCol
16         AK056486 Pol2_only   Pol2_only
13         AK310751   no_peak     no_peak
7          BC036251   no_peak     no_peak
10         DQ575786   no_peak     no_peak
4          DQ597235   no_peak     no_peak
5          DQ599768   no_peak     no_peak
11         DQ599872   no_peak     no_peak
12         DQ599872   no_peak     no_peak
2           FAM138F   no_peak     no_peak
15           FAM41C   no_peak     no_peak
34116         GAPDH      both        both
283034        GAPDH Pol2_only        both
6      LOC100132062   no_peak     no_peak
9      LOC100133331   no_peak     no_peak
14     LOC100288069      both        both
8            M37726   no_peak     no_peak
3             OR4F5   no_peak     no_peak
17           SAMD11      both        both
18           SAMD11      both        both
19           SAMD11      both        both
20           SAMD11      both        both
21           SAMD11      both        both
22           SAMD11      both        both
23           SAMD11      both        both
24           SAMD11      both        both
25           SAMD11      both        both
1            WASH7P Pol2_only   Pol2_only

2 つの行で GAPDH が "both" になっていることに注目してください。データは次のとおりです。

dput(tdf)
structure(list(geneSymbol = c("AK056486", "AK310751", "BC036251", 
"DQ575786", "DQ597235", "DQ599768", "DQ599872", "DQ599872", "FAM138F", 
"FAM41C", "GAPDH", "GAPDH", "LOC100132062", "LOC100133331", "LOC100288069", 
"M37726", "OR4F5", "SAMD11", "SAMD11", "SAMD11", "SAMD11", "SAMD11", 
"SAMD11", "SAMD11", "SAMD11", "SAMD11", "WASH7P"), peaks = c("Pol2_only", 
"no_peak", "no_peak", "no_peak", "no_peak", "no_peak", "no_peak", 
"no_peak", "no_peak", "no_peak", "both", "Pol2_only", "no_peak", 
"no_peak", "both", "no_peak", "no_peak", "both", "both", "both", 
"both", "both", "both", "both", "both", "both", "Pol2_only")), .Names = c("geneSymbol", 
"peaks"), row.names = c(16L, 13L, 7L, 10L, 4L, 5L, 11L, 12L, 
2L, 15L, 34116L, 283034L, 6L, 9L, 14L, 8L, 3L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 1L), class = "data.frame")

ありがとうございました！

編集 ** 問題の回避策を見つけました。選択は行ごとに行われていました。必要なのはハックだけです。つまり、返される論理ベクトルですべての値が true であることです。だからここに私が plyr 関数でやったことがあります：

ddply(tdf, .(geneSymbol), function(dfrm) subset(dfrm, all(peaks != "both" & peaks == "Pol2_only" & peaks != "CBP20_only")), .parallel=TRUE)
  geneSymbol     peaks
1   AK056486 Pol2_only
2     WASH7P Pol2_only

条件と一緒に all in を使用することに注意してください。これで、結果は期待どおりになりました。つまり、「Pol2_only」のみ (冗長性アラート) 遺伝子 :) まだやらなければならないことは、data.table での実装です。何か助けはありますか？

誰かがdata.tableでより良い解決策を提示してくれることを期待して、質問への回答を書いていません。

score 3 · Accepted Answer

data.table ソリューションをリクエストしたとおりです。

# set the key to be "peaks
TDF <- data.table(tdf, key = c('geneSymbol','peaks'))

# use unique to get unique combinations, then for each geneSymbol get the first
# match (we have keyed by peak soboth < Pol2_only < no_peak within each 
# geneSymbol )
# then exclude those with "peak == "no_peak")

unique(TDF)[.(unique(geneSymbol)), mult = 'first'][!peaks =='no_peak']

#      geneSymbol     peaks
# 1:     AK056486 Pol2_only
# 2:        GAPDH      both
# 3: LOC100288069      both
# 4:       SAMD11      both
# 5:       WASH7P Pol2_only

r - グループごとにdata.tableを分割し、列内のオカレンスごとにサブセットを使用する方法は?

2 に答える 2

Related

Reference