python-2.7 - Pythonを使用して、タブ区切り形式のSNPマトリックスを数値に変換する方法は?

Question

タブ区切りファイルのベースを整数に変換しようとしています。入力ファイルは複製する必要があり、新しいコピーには参照ベースの「1」、その後の最も一般的な代替対立遺伝子の「2」、次の対立遺伝子の「3」などが含まれている必要があります。ファイルが非常に大きいため、パンダを使用しようとしています。データのサンプルを次に示します。

dfmolecule      refpos  syn?    refbase 0.1288  1304    09BKT076207
NC_011353       68      NA            A       A    A        A
NC_011353       255     NSYN          A       A    A        A
NC_011353       493     NSYN          T       T    T        C
NC_011353       514     NSYN          C       C    C        C
NC_011353       1790    SYN           G       G    G        G
NC_011353       1798    NSYN          A       A    A        T
NC_011353       2015    SYN           C       C    C        C
NC_011353       2345    SYN           T       T    T        T
NC_011353       2655    NSYN          C       C    C        C
NC_011353       2906    NSYN          C       C    C        C

理論的には、出力は次のようになります。

dfmolecule      refpos  syn?    refbase 0.1288  1304    09BKT076207
NC_011353       68      NA            1       1    1        1
NC_011353       255     NSYN          1       1    1        1
NC_011353       493     NSYN          1       1    1        2
NC_011353       514     NSYN          1       1    1        1
NC_011353       1790    SYN           1       1    1        1
NC_011353       1798    NSYN          1       1    1        2
NC_011353       2015    SYN           1       1    1        1
NC_011353       2345    SYN           1       1    1        1
NC_011353       2655    NSYN          1       1    1        1
NC_011353       2906    NSYN          1       1    1        1

これにより、SNP をよりよく視覚化し、行ごとに最も一般的な対立遺伝子の変化をランク付けすることができます。どこから始めればいいのかわからないので、投稿しています。私が持っているコードは、基数を数値に変換するだけです。「refbase」は常に「1」である必要があり、python が行全体の参照ベースとは異なるベースを読み取ると、その行で 2 番目に一般的な対立遺伝子のベースが「2」に置き換えられます。それがもう少し明確になることを願っています。

私のコード新しいコード, 対立遺伝子の変化を頻度でランク付けする方法を理解する必要がありますか?:

import csv
import pdb
import os
import sys


if len(sys.argv) != 2:

        exit("Need arg <snp file>")

snp_file = sys.argv[1]

wtf = csv.writer(open('/users/new_snp.txt' , 'w'), delimiter='\t')




newf = list(csv.reader(open(snp_file,'rU'), delimiter='\t'))



#--------------------------------------------------------------
# Returns an array of tuples with ('A', 8)
# where the letter is the nucleotide and the number
# is the amount of times a letter is present in a row
#--------------------------------------------------------------

def refbase_count(r):

# This is a blank hash to keep count of occurances
# of the alleles

    rep = {'A':0, 'T':0, 'G':0,'C': 0}


    for i in r:

        rep[i] += 1


    # sort before returning
    import operator

    sorted_rep = sorted(rep.iteritems(), key=operator.itemgetter(1))

    # Want them with the most frequent first
    sorted_rep.reverse()

    return sorted_rep

 #--------------------------------------------------------------

 # print the top row outside of the loop
 print newf[0]
 wtf.writerow(newf[0])

 for row in newf[1:]:

    #rep = refbase_count(row[4:])

 for index, val in enumerate( row[4:] ):

        # If the refbase (in index 3) is equal to the
        # value at a given spot, then we give it a new value of 1
        # otherwise, it's a 2
        if row[3] == val:

            row[index+4] = 1

        else:

             row[index+4] = 2



print row
wtf.writerow(row)

score 0 · Accepted Answer

試したことと、出力がどのように見えるべきだと思うかを投稿する必要があります。たぶん、あなたの質問を明確にするのにも時間を費やしてください。

私の理解が正しければ、列でランク付けしますrefbaseか? その使用を行うには

In [38]: rnk = df.groupby('refbase').apply(lambda x: np.size(x)).rank(ascending=False)

In [39]: rnk
Out[39]: 
refbase
A          2
C          1
G          4
T          3
dtype: float64

次に、それに基づくランクで新しい列を作成できます。

In [40]: df['refpos_rank'] = df.refbase.replace(rnk.to_dict())

In [41]: df
Out[41]: 
  dfmolecule  refpos  syn? refbase 0.1288 1304 09BKT076207 refpos_rank
0  NC_011353      68   NaN       A      A    A           A           2
1  NC_011353     255  NSYN       A      A    A           A           2
2  NC_011353     493  NSYN       T      T    T           C           3
3  NC_011353     514  NSYN       C      C    C           C           1
4  NC_011353    1790   SYN       G      G    G           G           4
5  NC_011353    1798  NSYN       A      A    A           T           2
6  NC_011353    2015   SYN       C      C    C           C           1
7  NC_011353    2345   SYN       T      T    T           T           3
8  NC_011353    2655  NSYN       C      C    C           C           1
9  NC_011353    2906  NSYN       C      C    C           C           1

これは、タブ区切りファイルに書き込むことができますdf.to_csv(path, sep='\t')

python-2.7 - Pythonを使用して、タブ区切り形式のSNPマトリックスを数値に変換する方法は?

1 に答える 1

Related

Reference