python - Python MemoryError - 巨大な CSV ファイルを効率的に操作する方法はありますか?

Question

[Python3.3 を使用] XX 百万行を含み、いくつかの列を含む 1 つの巨大な CSV ファイルがあります。そのファイルを読み取り、いくつかの計算列を追加して、いくつかの「セグメント化された」csv ファイルを吐き出します。次のコードで小さなテストファイルを試してみましたが、まさに私がやりたかったことを実行します。しかし、元の CSV ファイル (約 3.2 GB) を読み込んでいると、メモリエラーが発生します。以下のコードを書くよりメモリ効率の良い方法はありますか?

私はPythonに非常に慣れていないため、完全には認識していないことがたくさんあることに注意してください。

入力データの例:

email               cc  nr_of_transactions  last_transaction_date   timebucket  total_basket
email1@email.com    us  2                   datetime value          1           20.29
email2@email.com    gb  3                   datetime value          2           50.84
email3@email.com    ca  5                   datetime value          3           119.12
...                 ... ...                 ...                     ...         ...

これは私のコードです：

import csv
import scipy.stats as stats
import itertools
from operator import itemgetter


def add_rankperc(filename):
    '''
    Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules:
     Percentage rank between 75 and 100 -> top25
     Percentage rank between 25 and 74  -> mid50
     Percentage rank between 0 and 24   -> bottom25
    '''

    # Defining headers for ease of use/DictReader
    headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket']
    groups = []

    with open(filename, encoding='utf-8', mode='r') as f_in:
        # Input file is tab-separated, hence dialect='excel-tab'
        r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers)
        # DictReader reads all dict values as strings, converting total_basket to a float
        dict_list = []
        for row in r:
            row['total_basket'] = float(row['total_basket'])
            # Append row to a list (of dictionaries) for further processing
            dict_list.append(row)

    # Groupby function on cc and total_basket
    for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')):
        rows = list(group)
        for row in rows:
            # Calculates the percentile rank for each value for each country
            row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket'])
            # Percentage rank between 75 and 100 -> top25
            if 75 <= row['rankperc'] <= 100:
                row['rankbucket'] = 'top25'
            # Percentage rank between 25 and 74 -> mid50
            elif 25 <= row['rankperc'] < 75:
                row['rankbucket'] = 'mid50'
            # Percentage rank between 0 and 24 -> bottom25
            else:
                row['rankbucket'] = 'bottom25'
            # Appending all rows to a list to be able to return it and use it in another function
            groups.append(row)
    return groups


def filter_n_write(data):
    '''
    Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping.
    '''

    # Creating group iterator based on keys
    for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')):
        # List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys.
        emails = list([row['email'] for row in group])
        # Dynamically naming output file based on grouping keys
        f_out = 'output-{}-{}.csv'.format(key[0], key[1])
        with open(f_out, encoding='utf-8', mode='w') as fout:
            w = csv.writer(fout, dialect='excel', lineterminator='\n')
            # Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell.
            w.writerows([email] for email in emails)

filter_n_write(add_rankperc('infile.tsv'))

前もって感謝します！

score 4 · Accepted Answer

pandas ライブラリ ( http://pandas.pydata.org/ ) には、非常に優れた高速の CSV 読み取り機能があります ( http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-表）。追加のボーナスとして、データを numpy 配列として保持するため、パーセンタイルの計算が非常に簡単になります。この質問では、パンダを使用してチャンクで大きな CSV を読み取ることについて説明します。

score 3 · Accepted Answer

データベース関数を使用してこの問題に対処する方がよいという Inbar Rose の意見に同意します。ただし、あなたが尋ねたように質問に答える必要があるとしましょう-スピードを犠牲にして、できると思います。

すべての行の辞書のリストを作成する際にメモリが不足している可能性があります。一度に行のサブセットのみを考慮することで、これを回避できます。

最初のステップのコードは次のとおりです-大まかにあなたのadd_rankperc関数：

import csv
from scipy.stats import percentileofscore
from operator import itemgetter

# Run through the whole file once, saving each row to a file corresponding to
# its 'cc' column
cc_dict = {}
with open(input_path, encoding="utf-8", mode='r') as infile:
  csv_reader = csv.reader(infile, dialect="excel-tab")
  for row in csv_reader:
    cc = row[1]
    if cc not in cc_dict:
      intermediate_path = "intermediate_cc_{}.txt".format(cc)
      outfile = open(intermediate_path, mode='w', newline='')
      csv_writer = csv.writer(outfile)
      cc_dict[cc] = (intermediate_path, outfile, csv_writer)
    _ = cc_dict[cc][2].writerow(row)

# Close the output files
for cc in cc_dict.keys():
  cc_dict[cc][1].close()

# Run through the whole file once for each 'cc' value
for cc in cc_dict.keys():
  intermediate_path = cc_dict[cc][0]
  with open(intermediate_path, mode='r', newline='') as infile:
    csv_reader = csv.reader(infile)
    # Pick out all of the rows with the 'cc' value under consideration
    group = [row for row in csv_reader if row[1] == cc]
    # Get the 'total_basket' values for the group
    A_scores = [float(row[5]) for row in group]
    for row in group:
      # Compute this row's 'total_basket' score based on the rest of the
      # group's
      p = percentileofscore(A_scores, float(row[5]))
      row.append(p)
      # Categorize the score
      bucket = ("bottom25" if p < 25 else ("mid50" if p < 75 else "top100"))
      row.append(bucket)
  # Save the augmented rows to an intermediate file
  with open(output_path, mode='a', newline='') as outfile:
    csv_writer = csv.writer(outfile)
    csv_writer.writerows(group)

4600 万行は多いので、おそらく遅くなります。DictReaderモジュールの機能を使用することを避け csv、行に直接インデックスを付けて、そのオーバーヘッドを回避しました。percentileofscoresまた、グループ内のすべての行ではなく、グループごとに最初の引数を 1 回計算しました。

これが機能する場合、関数について同じ考えに従うことができると思いますfilter_n_write -生成された中間ファイルを1回実行して、 (timebucket, rank)ペアを選択します。次に、ペアごとに 1 回、中間ファイルをもう一度調べます。

python - Python MemoryError - 巨大な CSV ファイルを効率的に操作する方法はありますか?

2 に答える 2

Related

Reference