[Python3.3 を使用] XX 百万行を含み、いくつかの列を含む 1 つの巨大な CSV ファイルがあります。そのファイルを読み取り、いくつかの計算列を追加して、いくつかの「セグメント化された」csv ファイルを吐き出します。次のコードで小さなテスト ファイルを試してみましたが、まさに私がやりたかったことを実行します。しかし、元の CSV ファイル (約 3.2 GB) を読み込んでいると、メモリ エラーが発生します。以下のコードを書くよりメモリ効率の良い方法はありますか?
私はPythonに非常に慣れていないため、完全には認識していないことがたくさんあることに注意してください。
入力データの例:
email cc nr_of_transactions last_transaction_date timebucket total_basket
email1@email.com us 2 datetime value 1 20.29
email2@email.com gb 3 datetime value 2 50.84
email3@email.com ca 5 datetime value 3 119.12
... ... ... ... ... ...
これは私のコードです:
import csv
import scipy.stats as stats
import itertools
from operator import itemgetter
def add_rankperc(filename):
'''
Function that calculates percentile rank of total basket value of a user (i.e. email) within a country. Next, it assigns the user to a rankbucket based on its percentile rank, using the following rules:
Percentage rank between 75 and 100 -> top25
Percentage rank between 25 and 74 -> mid50
Percentage rank between 0 and 24 -> bottom25
'''
# Defining headers for ease of use/DictReader
headers = ['email', 'cc', 'nr_transactions', 'last_transaction_date', 'timebucket', 'total_basket']
groups = []
with open(filename, encoding='utf-8', mode='r') as f_in:
# Input file is tab-separated, hence dialect='excel-tab'
r = csv.DictReader(f_in, dialect='excel-tab', fieldnames=headers)
# DictReader reads all dict values as strings, converting total_basket to a float
dict_list = []
for row in r:
row['total_basket'] = float(row['total_basket'])
# Append row to a list (of dictionaries) for further processing
dict_list.append(row)
# Groupby function on cc and total_basket
for key, group in itertools.groupby(sorted(dict_list, key=itemgetter('cc', 'total_basket')), key=itemgetter('cc')):
rows = list(group)
for row in rows:
# Calculates the percentile rank for each value for each country
row['rankperc'] = stats.percentileofscore([row['total_basket'] for row in rows], row['total_basket'])
# Percentage rank between 75 and 100 -> top25
if 75 <= row['rankperc'] <= 100:
row['rankbucket'] = 'top25'
# Percentage rank between 25 and 74 -> mid50
elif 25 <= row['rankperc'] < 75:
row['rankbucket'] = 'mid50'
# Percentage rank between 0 and 24 -> bottom25
else:
row['rankbucket'] = 'bottom25'
# Appending all rows to a list to be able to return it and use it in another function
groups.append(row)
return groups
def filter_n_write(data):
'''
Function takes input data, groups by specified keys and outputs only the e-mail addresses to csv files as per the respective grouping.
'''
# Creating group iterator based on keys
for key, group in itertools.groupby(sorted(data, key=itemgetter('timebucket', 'rankbucket')), key=itemgetter('timebucket', 'rankbucket')):
# List comprehension to create a list of lists of email addresses. One row corresponds to the respective combination of grouping keys.
emails = list([row['email'] for row in group])
# Dynamically naming output file based on grouping keys
f_out = 'output-{}-{}.csv'.format(key[0], key[1])
with open(f_out, encoding='utf-8', mode='w') as fout:
w = csv.writer(fout, dialect='excel', lineterminator='\n')
# Writerows using list comprehension to write each email in emails iterator (i.e. one address per row). Wrapping email in brackets to write full address in one cell.
w.writerows([email] for email in emails)
filter_n_write(add_rankperc('infile.tsv'))
前もって感謝します!