python - RAM gzip で圧縮された csv ファイルよりも大きな値を集計するにはどうすればよいですか?

Question

手始めに、私はバイオインフォマティクス、特にプログラミングは初めてですが、いわゆるVCFファイル（個人のみが含まれ、1つの塊= 1つの個人）を通過するスクリプトを作成し、検索文字列を使用して見つけます個体がホモ接合体であるかヘテロ接合体であるかにかかわらず、すべてのバリアント (系統) について。

このスクリプトは、少なくとも小さなサブセットでは機能しますが、すべてをメモリに保存することはわかっています。非常に大きな zip ファイル (ゲノム全体であっても) でこれを実行したいのですが、このスクリプトを行ごとにすべてを実行するスクリプトに変換する方法がわかりません (列全体をカウントしたいので、それを解決する方法を参照してください）。

したがって、出力は個人ごとに 5 つです (バリアントの総数、ホモ接合体の数、ヘテロ接合体の数、およびホモ接合体とヘテロ接合体の割合)。以下のコードを参照してください。

#!usr/bin/env python
import re
import gzip

subset_cols = 'subset_cols_chr18.vcf.gz'
#nuc_div = 'nuc_div_chr18.txt'

gz_infile = gzip.GzipFile(subset_cols, "r")  
#gz_outfile = gzip.GzipFile(nuc_div, "w") 

# make a dictionary of the header line for easy retrieval of elements later on

headers = gz_infile.readline().rstrip().split('\t')             
print headers                                                   

column_dict = {}                                        
for header in headers:
        column_dict[header] = []                        
for line in gz_infile:                                     
        columns = line.rstrip().split('\t')             
        for i in range(len(columns)):                   
                c_header=headers[i]                     
                column_dict[c_header].append(columns[i])
#print column_dict

for key in column_dict:                         
        number_homozygotes = 0          
        number_heterozygotes = 0        

        for values in column_dict[key]: 
                SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'     
#this search string contains the regexp (this regexp was tested)
                Result = re.search(SearchStr,values)                    
                if Result:
#here, it will skip the missing genoytypes ./.
                        variant_one = int(Result.group(1))              
                        variant_two = int(Result.group(2))              

                        if variant_one == 0 and variant_two == 0:
                                continue
                        elif variant_one == variant_two:                  
#count +1 in case variant one and two are equal (so 0/0, 1/1, etc.)
                                number_homozygotes += 1
                        elif variant_one != variant_two:
#count +1 in case variant one is not equal to variant two (so 1/0, 0/1, etc.)
                                number_heterozygotes += 1

        print "%s homozygotes %s" % (number_homozygotes, key) 
        print "%s heterozygotes %s" % (number_heterozygotes,key)

        variants = number_homozygotes + number_heterozygotes
        print "%s variants" % variants

        prop_homozygotes = (1.0*number_homozygotes/variants)*100
        prop_heterozygotes = (1.0*number_heterozygotes/variants)*100

        print "%s %% homozygous %s" % (prop_homozygotes, key)
        print "%s %% heterozygous %s" % (prop_heterozygotes, key)

大規模なデータセットの調査を続けることができるように、どんな助けも大歓迎です。ありがとう:)

ちなみに、VCF ファイルは次のようになります。 0,13:13:33:347,33,0

これは、個々の ID 名を含むヘッダー行です (より複雑な ID タグを持つ合計 33 人の個人がいます。ここでは簡略化しています)。次に、同じ特定のパターンを持つこれらの情報行が多数あります。私はスラッシュの最初の部分だけに興味があるので、通常の表現です。

score 1 · Accepted Answer

RAM データセットよりも大きなデータセットを処理できるようにするには、すべての列を処理している現在、アルゴリズムを作り直してデータを 1 行ずつ処理する必要があります。

しかしその前に、gzip されたファイルから行をストリーミングする方法が必要です。

次の Python 3 コードはそれを行います。

"""https://stackoverflow.com/a/40548567/140837"""
#!/usr/bin/env python3
import zlib
from mmap import PAGESIZE


CHUNKSIZE = PAGESIZE


# This is a generator that yields *decompressed* chunks from
# a gzip file. This is also called a stream or lazy list.
# It's done like so to avoid to have the whole file into memory
# Read more about Python generators to understand how it works.
# cf. `yield` keyword.
def gzip_to_chunks(filename):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS + 16)
    with open(filename, 'rb') as f:
        chunk = f.read(CHUNKSIZE)

        while chunk:
            out = decompressor.decompress(chunk)
            yield out
            chunk = f.read(CHUNKSIZE)

        out = decompressor.flush()

        yield out


# Again the following is a generator (see the `yield` keyword).
# What id does is iterate over an *iterable* of strings and yields
# rows from the file

# (hint: `gzip_to_chunks(filename)` returns a generator of strings)
# (hint: a generator is also an iterable)

# You can verify that by calling `chunks_to_rows` with a list of
# strings, where every strings is a chunk of the VCF file.
# (hint: a list is also an iterable)

# inline doc follows
def chunks_to_rows(chunks):
    row = b''  # we will add the chars making a single row to this variable
    for chunk in chunks:  # iterate over the strings/chuncks yielded by gzip_to_chunks
        for char in chunk:  # iterate over all chars from the string
            if char == b'\n'[0]:  # hey! this is the end of the row!
                yield row.decode('utf8').split('\t')  # the row is complete, yield!
                row = b''  # start a new row
            else:
                row += int.to_bytes(char, 1, byteorder='big')  # Otherwise we are in the middle of the row
        # at this point the program has read all the chunk
    # at this point the program has read all the file without loading it fully in memory at once
    # That said, there's maybe still something in row
    if row:
        yield row.decode('utf-8').split('\t')  # yield the very last row if any


for e in chunks_to_rows(gzip_to_chunks('conceptnet-assertions-5.6.0.csv.gz')):
    uid, relation, start, end, metadata = e
    print(start, relation, end)

編集：答えを作り直し、 gzipされたconcetpnetのtsvファイルで機能するようにします

python - RAM gzip で圧縮された csv ファイルよりも大きな値を集計するにはどうすればよいですか?

2 に答える 2

Related

Reference