python - Python CSV - 別の列の値でグループ化された列の値を合計する必要がある

Question

解析する必要がある csv にデータがあります。次のようになります。

Date, Name, Subject, SId, Mark
2/2/2013, Andy Cole, History, 216351, 98
2/2/2013, Andy Cole, Maths, 216351, 87
2/2/2013, Andy Cole, Science, 217387, 21
2/2/2013, Bryan Carr, Maths, 216757, 89
2/2/2013, Carl Jon, Botany, 218382, 78
2/2/2013, Bryan Carr, Biology, 216757, 27

Sid をキーとして使用し、このキーを使用してマーク列のすべての値を合計する必要があります。出力は次のようになります。

Sid     Mark
216351  185
217387   21
216757  116
218382   78

出力をファイルに書き込む必要はありません。Pythonファイルを実行するときに必要になります。これは同様の質問です。間の列をスキップするには、どのように変更する必要がありますか?

score 2 · Accepted Answer

これがヒストグラムの概念です。defaultdict(int)fromを使用しcollectionsて、行を反復処理します。「Sid」値を dict のキーとして使用し、「Mark」値を現在の値に追加します。

int 型の defaultdict は、キーが存在しない場合、その値が 0 で初期化されるようにします。

from collections import defaultdict

d = defaultdict(int)

with open("data.txt") as f:
    for line in f:
        tokens = [t.strip() for t in line.split(",")]
        try:
            sid = int(tokens[3])
            mark = int(tokens[4])
        except ValueError:
            continue
        d[sid] += mark

print d

出力：

defaultdict(<type 'int'>, {217387: 21, 216757: 116, 218382: 78, 216351: 185})

解析部分を他のものに変更できます (たとえばcsvreader、他の検証を使用または実行します)。ここでの重要なポイントは、 a を使用して、次のdefaultdict(int)ように更新することです。

d[sid] += mark

score 1 · Accepted Answer

提供したリンクのソリューションを適応させたい場合は、アンパックする行を変更できます。

ここにアイデアがあります（OPのリンクのsamplebiasのソリューションから適応）：

import csv
from collections import defaultdict

# a dictionary whose value defaults to a list.
data = defaultdict(list)
# open the csv file and iterate over its rows. the enumerate()
# function gives us an incrementing row number
for i, row in enumerate(csv.reader(open('data.csv', 'rb'))):
    # skip the header line and any empty rows
    # we take advantage of the first row being indexed at 0
    # i=0 which evaluates as false, as does an empty row
    if not i or not row:
        continue
    # unpack the columns into local variables


    _, _, _, SID, mark = row#### <--- HERE, change what you unpack


    # for each SID, add the mark the list
    data[SID].append(float(mark))

# loop over each SID and its list of mark and calculate the sum
for zipcode, mark in data.iteritems():
    print SID, sum(mark)

score -1 · Accepted Answer

まず、CSV を解析するには、次のcsvモジュールを使用します。

with open('data.csv', 'rb') as f:
    data = csv.DictReader(f)

次に、それらを Sid でグループ化します。これを行うには、並べ替えてからを使用しgroupbyます。(等しい値が常に連続する場合、並べ替えは不要です。)

    siddata = sorted(data, key=operator.itemgetter('SId'))
    sidgroups = itertools.groupby(siddata, operator.itemgetter('SId'))

ここで、各グループの値を合計します。

    for key, group in sidgroups:
        print('{}\t{}'.format(key, sum(int(value['Mark']) for value in group))

または、すべてをデータベースに書き込んで、SQLite にその方法を理解させることもできます。

with open('data.csv', 'rb') as f, sqlite3.connect(':memory:') as db:
    db.execute('CREATE TABLE data (SId, Mark)')
    db.executemany('INSERT INTO data VALUES (:SId, :Mark)', csv.DictReader(f))
    cursor = db.execute('SELECT SId, SUM(Mark) AS Mark FROM data GROUP BY SId')
    for row in cursor:
        print('{}\t{}'.format(row))

python - Python CSV - 別の列の値でグループ化された列の値を合計する必要がある

3 に答える 3

Related

Reference