r - ダウンサンプリングデータセット

Question

遺伝子 ID で構成される大きな文字ベクトル (1,024,459 要素) であるデータセットがあります。次のようになります。

> length(allres)
[1] 1024459
>allres[1:10]  
[1] "1"   "1"   "1"   "1"   "1"   "1"   "1"   "10"  "10"  "100"

ここで、各遺伝子 ID は、RNA シーケンス実行で見られた回数だけ繰り返されます (つまり、ここでは、遺伝子 "1" に対して 7 回の読み取り、遺伝子 "10" に対して 2 回の読み取りがありました)。10,000回の読み取り間隔で、読み取り数ごとに識別された遺伝子の数をプロットして、10,000回の読み取り、20,000、30,000などをランダムにサンプリングした場合に識別される遺伝子の数を確認したい. seq() 関数は次のようになります。

> gaps <- seq(10000, length(allres), by=10000)

しかし、それを allres ベクトルに適用してプロットする方法がわかりません。どんな助けでも大歓迎です。

score 1 · Accepted Answer

したがって、おそらく必要なのは次のようなものです。

gaps <- seq(10000, length(allres), by = 10000)

lapply(gaps, function(x){

    #This will give you the number of appearances of each value, within
    #an gaps[x]-sized sample of allres
    aggregated_sample <- table(sample(allres, size = x))

    #plotting code for sample goes here. And "x" is the number of reads so
    #you can even use it in the title!
    #Just remember to include code to save it to disc, if you want to save it to disc.
    return(TRUE)

})

もちろん、プロットに ggplot2 を使用している場合は、プロットをオブジェクトとして保存してから、 return(TRUE) の代わりに return(plot) を使用して、後でさらに調整/調査を行うこともできます。

r - ダウンサンプリング データセット

1 に答える 1

Related

Reference

r - ダウンサンプリングデータセット