python - データセットの大部分をスキップするループの場合

Question

データセットに別の問題があります。基本的に、位置麻痺 (3 列目と 4 列目) と鎖の向き (+ または -) を含む関連する機能を持つ遺伝子のリストがあります。(現在のように) ゲノム全体ではなく、各遺伝子の開始コドン TYPE (2 番目の列) を基準にして位置を計算しようとしています。問題は、計算が + STRAND シーケンスでのみ実行され、- STRAND シーケンスが出力に表示されないことです。以下は、データセット、私のコード、出力、および私が試したことのサンプルです。

データセットは次のとおりです。

    GENE_ID TYPE    POS1    POS2    STRAND
PITG_00002  start_codon 10520   10522   -
PITG_00002  stop_codon  10097   10099   -
PITG_00002  exon    10474   10522   -
PITG_00002  CDS 10474   10522   -
PITG_00002  exon    10171   10433   -
PITG_00002  CDS 10171   10433   -
PITG_00002  exon    10097   10114   -
PITG_00002  CDS 10100   10114   -
PITG_00003  start_codon 38775   38777   +
PITG_00003  stop_codon  39069   39071   +
PITG_00003  exon    38775   39071   +
PITG_00003  CDS 38775   39068   +

コードは次のとおりです。

import numpy
import pandas
import pandas as pd
import sys

sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    ##print direction,gene_name
    if group.index[group.TYPE=='start_codon']:
        start_exon = group.index[group.TYPE=='exon'][0]
    if direction == '+':
        group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
        group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
    else:
        group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
        group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
    ##print group
    corrected.append(group)

出力のサンプルを次に示します。

     + PITG_00003
    GENE_ID     TYPE         POS1   POS2   STRAND  POSA  POSB
8   PITG_00003  start_codon  38775  38777  +       1     3   
9   PITG_00003  stop_codon   39069  39071  +       295   297 
10  PITG_00003  exon         38775  39071  +       1     297 
11  PITG_00003  CDS          38775  39068  +       1     294

以前は、配列値エラー ( Tab delimited dataset ValueError Truth of array with multiple elements is ambiguous error ) が発生していましたが、それは処理されました。次に、この部分だけをやってみました：

import numpy
import pandas
import pandas as pd
import sys

##sys.stdout = open("outtry2.txt", "w")
data = pd.read_csv('pinfestans-edited2.csv', sep='\t')#,
              #converters={'STRAND': lambda s: s[0]})
groups = data.groupby(['STRAND', 'GENE_ID'])

corrected = []

for (direction, gene_name), group in groups:
    print direction,gene_name

出力には、すべての GENE_ID とそれらの STRAND 記号 (+ または -) が出力され、+ シーケンスと - シーケンスの両方に対して出力されました。その下のどこかで、STRAND 列で - を含むシーケンスを選択していません。

だから私はこれを元のコードに追加しようとしました：

if direction == '+':
    group['POSA'] = 1 + abs(group.POS1 - group.POS1[start_exon])
    group['POSB'] = 1 + abs(group.POS2 - group.POS1[start_exon])
elif direction == '-':
    group['POSA'] = 1 - abs(group.POS2 - group.POS2[start_exon])
    group['POSB'] = 1 - abs(group.POS1 - group.POS2[start_exon])
else:
    break
print group
# put into the result array
corrected.append(group)

そして、これが出力の最後です。最初の出力が出力され、終了する前にしばらくフリーズしました。

+
        GENE_ID     TYPE         POS1    POS2    STRAND  POSA  POSB
134991  PITG_23350  start_codon  161694  161696  +       516   518 
134992  PITG_23350  stop_codon   162135  162137  +       957   959 
134993  PITG_23350  exon         161179  162484  +       1     1306
134994  PITG_23350  CDS          161694  162134  +       516   956 
-

score 2 · Accepted Answer

これらの行は私には奇妙に思えます:

if group.index[group.TYPE=='start_codon']:
    start_exon = group.index[group.TYPE=='exon'][0]

1 つ目は、グループに開始コドンマーカーがあるかどうかを確認しようとしているだけだと思います。しかし、それは 2 つの理由から意味がありません。

(1) start_codon エントリが 1 つしかなく、それが最初のエントリである場合、条件は実際には false です。

In [8]: group.TYPE == 'start_codon'
Out[8]: 
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: TYPE

In [9]: group.index[group.TYPE == 'start_codon']
Out[9]: Int64Index([0], dtype=int64)

In [10]: bool(group.index[group.TYPE == 'start_codon'])
Out[10]: False

多分あなたはany(group.TYPE == 'start_codon')、または(group.TYPE == 'start_codon').any()またはsum(group.TYPE == 'start_codon') == 1何かをしたいですか？しかし、それも正しいとは言えません。

(2) コードstart_exonはが設定されている場合にのみ機能します。そうでない場合は、を与えるか、NameError前回たまたまあった値にフォールバックし、それが適切な順序になるという保証はありません。

単にstart_exon = group.index[group.TYPE=='exon'][0]単独で使用すると、

In [28]: for c in corrected:
   ....:     print c
   ....:     
       GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
8   PITG_00003  start_codon  38775  38777      +     1     3
9   PITG_00003   stop_codon  39069  39071      +   295   297
10  PITG_00003         exon  38775  39071      +     1   297
11  PITG_00003          CDS  38775  39068      +     1   294
      GENE_ID         TYPE   POS1   POS2 STRAND  POSA  POSB
0  PITG_00002  start_codon  10520  10522      -     1    -1
1  PITG_00002   stop_codon  10097  10099      -  -422  -424
2  PITG_00002         exon  10474  10522      -     1   -47
3  PITG_00002          CDS  10474  10522      -     1   -47
4  PITG_00002         exon  10171  10433      -   -88  -350
5  PITG_00002          CDS  10171  10433      -   -88  -350
6  PITG_00002         exon  10097  10114      -  -407  -424
7  PITG_00002          CDS  10100  10114      -  -407  -421

これらの値に意味があるかどうかはわかりませんが、何もスキップしていないようです。

python - データセットの大部分をスキップするループの場合

1 に答える 1

Related

Reference