bioinformatics - GC含量によるビニングシーケンスの読み取り

Question

マルチファスタヌクレオチド配列ファイルを「ビン」（別々のファイルに分割）したいと思います（たとえば、Roche-454の実行で約500,000回の読み取りが平均読み取り長250bp）。各読み取りのGC含量に基づいたビンが欲しいのですが。結果の出力は、8つのmulti-fastaファイルになります。

<20％のGC含量

21〜30％のGC含量

31〜40％のGC含量

41-50％のGC含量

51〜60％のGC含量

61〜70％のGC含量

71〜80％のGC含量

> 80％のGC含量

誰かがこれをすでに行っているスクリプトやプログラムを知っていますか？そうでない場合、誰かがGCコンテンツに基づいてmulti-fastaファイルをソートする方法を提案できますか（それを関連するビンに分割できます）？

score 2 · Accepted Answer

R/Bioconductor では、タスクは (a) 適切なライブラリをロードする (b) fasta ファイルを読み取る (c) ヌクレオチドの使用を計算し、% を gc する (d) データをビンに分割する (e) 元のデータを個別に出力するファイル。の線に沿って

## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
    fileName = sprintf("%s.fasta", lvl)
    writeFasta(fa[bin==lvl], file=fileName)
}

R/Bioconductor を使用するには、http://bioconductor.org/install を参照してください。示されたサイズの 454 データのメモリ要件はそれほど悪くなく、ここでのスクリプトはかなり高速です (たとえば、260k の読み取りで 7 秒)。

score 1 · Accepted Answer

Python とBiopythonまたは Perl とBioperlを使用して FASTA ファイルを読み込むことをお勧めします。ここには Bioperl の配列の C 含有量を計算するスクリプトがあり、Biopython にはそのための機能があります。次に、各シーケンスの GC コンテンツを辞書またはハッシュに保存し、それぞれを調べて、GC コンテンツの高さに応じてファイルに書き込みます。

さらに具体的なヘルプが必要ですか?

score 0 · Accepted Answer

問題を正しく理解している場合は、次のようなものが必要です（Python）：

def GC(seq): # determine the GC content
    s = seq.upper()
    return 100.0 * (s.count('G') + s.count('C')) / len(s)

def bin(gc): # get the number of the 'bin' for this value of GC content
    if gc < 20: return 1
    elif gc > 80: return 8
    else:
        return int(gc/10)

次に、ファイルからエントリを読み取り、GC含量を計算し、適切なビンを見つけて、適切なファイルにエントリを書き込む必要があります。次の例では、ラボで使用するPythonパッケージを使用してこれを実装しています。

from pyteomics import fasta

def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""

    for entry in fasta.read(multifile):
        fasta.write((entry,), (multifile+'_bin_'+
                    str(bin(GC(entry[1])))))

次に、それをのように呼び出しますsplit_to_bin_files('mybig.fasta')。

bioinformatics - GC含量によるビニングシーケンスの読み取り

3 に答える 3

Related

Reference