optimization - Rを使用して文字列を分割し、文字をカウントするより速い方法は?

Question

FASTA ファイルから読み込まれた DNA 文字列の GC コンテンツを計算するより高速な方法を探しています。これは、文字列を取得して、文字「G」または「C」が出現する回数を数えることに要約されます。考慮する文字の範囲も指定したいと思います。

かなり遅い作業関数があり、コードでボトルネックを引き起こしています。次のようになります。

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}

Rprof を実行すると、次の出力が得られます。

> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")

                   self.time self.pct total.time total.pct
"gcCount"          77.36     76.8     100.74     100.0
"=="               18.30     18.2      18.30      18.2
"strsplit"          3.58      3.6       3.64       3.6
"+"                 1.14      1.1       1.14       1.1
":"                 0.30      0.3       0.30       0.3
"as.logical"        0.04      0.0       0.04       0.0
"as.character"      0.02      0.0       0.02       0.0

$by.total
               total.time total.pct self.time self.pct
"gcCount"          100.74     100.0     77.36     76.8
"=="                18.30      18.2     18.30     18.2
"strsplit"           3.64       3.6      3.58      3.6
"+"                  1.14       1.1      1.14      1.1
":"                  0.30       0.3      0.30      0.3
"as.logical"         0.04       0.0      0.04      0.0
"as.character"       0.02       0.0      0.02      0.0

$sampling.time
[1] 100.74

このコードを高速化するためのアドバイスはありますか?

score 14 · Accepted Answer

まったく分割しない方が良いです。マッチを数えるだけです:

gcCount2 <-  function(line, st, sp){
  sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}

それは桁違いに高速です。

文字を繰り返すだけの小さな C 関数は、さらに桁違いに高速になります。

score 6 · Accepted Answer

6

ワンライナー：

table(strsplit(toupper(a), '')[[1]])

于 2010-03-15T17:37:32.003 に答える

score 4 · Accepted Answer

それよりも速いかどうかはわかりませんが、R パッケージの seqinR を確認することをお勧めします - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng。これは、配列分析のための多くの方法を備えた優れた一般的なバイオインフォマティクスパッケージです。それはCRANにあります（これを書いているときはダウンしているようです）。

GC コンテンツは次のようになります。

mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
    GC(mysequence)  # 0.4761905

これは文字列からのもので、"read.fasta()" を使用して fasta ファイルを読み取ることもできます。

score 3 · Accepted Answer

ここでループを使用する必要はありません。

これを試して：

gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]][st:sp]
  length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}

optimization - Rを使用して文字列を分割し、文字をカウントするより速い方法は?

6 に答える 6

Related

Reference