sorting - ファイルへの書き込みエラー

Question

subprocess モジュールを使用して UNIX ソートを呼び出す Python スクリプトを作成しました。2 つの列 (2 と 6) に基づいてテーブルを並べ替えようとしています。これが私がやったことです

sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)

ただし、出力ファイルには不完全な行が含まれており、テーブルを解析するとエラーが発生しますが、行をソートするために指定された入力ファイルのエントリをチェックすると、完全に見えます。sort が指定されたファイルに結果を書き込もうとするときに問題があると思いますが、解決方法はわかりません。

入力ファイルの行は次のようになります

gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS 結合タンパク質 3C (RIMBP3C)、mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS 結合タンパク質 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0

ただし、出力ファイルには gi|19125 のみが出力されます。これを解決するにはどうすればよいですか？

どんな助けでも大歓迎です。

RAM

score 0 · Accepted Answer

表示されるのは、おそらく複数のプロセスから同時にファイルに書き込もうとした結果です。

エミュレートするsort -k2,2 -k6,6n ${tabname} > sort_blast.txtには: Python のコマンド:

from subprocess import check_call

with open("sort_blast.txt",'wb') as output_file:
     check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)

たとえば、小さな入力ファイルの場合は、純粋な Python で記述できます。

def custom_key(line):
    fields = line.split() # split line on any whitespace
    return fields[1], float(fields[5]) # Python uses zero-based indexing

with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
     L = input_file.read().splitlines() # read from the input file
     L.sort(key=custom_key)             # sort it
     output_file.write("\n".join(L))    # write to the output file

メモリに収まらないファイルを並べ替える必要がある場合。Python を使用したテキストファイルの並べ替えを参照してください。

score 0 · Accepted Answer

サブプロセスを使用して外部の並べ替えツールを呼び出すことは、Python にアイテムを並べ替えるためのメソッドが組み込まれていることを考えると、非常にばかげているように思えます。

|サンプルデータを見ると、区切り記号が付いた構造化データのようです。そのファイルを開き、ソートされた方法で python で結果を反復処理する方法は次のとおりです。

def custom_sorter(first, second):
    """ A Custom Sort function which compares items
    based on the value in the 2nd and 6th columns. """
    # First, we break the line into a list
    first_items, second_items = first.split(u'|'), second.split(u'|')  # Split on the pipe character.
    if len(first_items) >= 6 and len(second_items) >= 6:
        # We have enough items to compare
        if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
            return 1
        elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
            return -1
        else:  # They are the same
            return 0  # Order doesn't matter then
    else:
        return 0

with open(src_file_path, 'r') as src_file:
    data = src_file.read()  # Read in the src file all at once. Hope the file isn't too big!
    with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
        for line in sorted(data.splitlines(), cmp = custom_sorter):  # Sort the data on the fly
            dst_sorted_file.write(line)  # Write the line to the dst_file.

参考までに、このコードには少し手を加える必要があるかもしれません。私はそれをあまりよくテストしませんでした。

sorting - ファイルへの書き込みエラー

2 に答える 2

Related

Reference