python - X 量の以前のデータを CSV の行にプルする方法

Question

データの非常に大きな CSV があり、列 2 で規定されている現在の日付より前の日付について、列 2 の各名前の各行に以前のデータを追加する必要があります。この問題を表現する最も簡単な方法は、実際のデータに似た詳細な例を提供することですが、大幅に縮小すると思います。

Datatitle,Date,Name,Score,Parameter
data,01/09/13,george,219,dataa,text
data,01/09/13,fred,219,datab,text
data,01/09/13,tom,219,datac,text
data,02/09/13,george,229,datad,text
data,02/09/13,fred,239,datae,text
data,02/09/13,tom,219,dataf,text
data,03/09/13,george,209,datag,text
data,03/09/13,fred,217,datah,text
data,03/09/13,tom,213,datai,text
data,04/09/13,george,219,dataj,text
data,04/09/13,fred,212,datak,text
data,04/09/13,tom,222,datal,text
data,05/09/13,george,319,datam,text
data,05/09/13,fred,225,datan,text
data,05/09/13,tom,220,datao,text
data,06/09/13,george,202,datap,text
data,06/09/13,fred,226,dataq,text
data,06/09/13,tom,223,datar,text
data,06/09/13,george,219,dataae,text

したがって、この csv の最初の 3 行には、以前のデータはありません。したがって、現在の日付よりも前の日付の george (行 1) の最後の 3 回の出現に対して列 3 と 4 をプルしたいと言った場合、次のようになります。

data,01/09/13,george,219,dataa,text,x,y,x,y,x,y

ただし、以前のデータが利用可能になり始めたら、次のような csv を生成したいと考えています。

Datatitle,Date,Name,Score,Parameter,LTscore,LTParameter,LTscore+1,LTParameter+1,LTscore+2,LTParameter+3,
data,01/09/13,george,219,dataa,text,x,y,x,y,x,y
data,01/09/13,fred,219,datab,text,x,y,x,y,x,y
data,01/09/13,tom,219,datac,text,x,y,x,y,x,y
data,02/09/13,george,229,datad,text,219,dataa,x,y,x,y
data,02/09/13,fred,239,datae,text,219,datab,x,y,x,y
data,02/09/13,tom,219,dataf,text,219,datac,x,y,x,y
data,03/09/13,george,209,datag,text,229,datad,219,dataa,x,y
data,03/09/13,fred,217,datah,text,239,datae,219,datab,x,y
data,03/09/13,tom,213,datai,text,219,dataf,219,datac,x,y
data,04/09/13,george,219,dataj,text,209,datag,229,datad,219,dataa
data,04/09/13,fred,212,datak,text,217,datah,239,datae,219,datab
data,04/09/13,tom,222,datal,text,213,datai,219,dataf,219,datac
data,05/09/13,george,319,datam,text,219,dataj,209,datag,229,datad
data,05/09/13,fred,225,datan,text,212,datak,217,datah,239,datae
data,05/09/13,tom,220,datao,text,222,datal,213,datai,219,dataf
data,06/09/13,george,202,datap,text,319,datam,219,dataj,209,datag
data,06/09/13,fred,226,dataq,text,225,datan,212,datak,217,datah
data,06/09/13,tom,223,datar,text,220,datao,222,datal,213,datai
data,06/09/13,george,219,datas,text,319,datam,219,dataj,209,datag

2013 年 6 月 9 日のジョージが 2 回出現し、2 回とも同じ文字列319,datam,219,dataj,209,datagが行に追加されていることがわかります。george が 2 回目に表示されると、3 行上の george が同じ日付であるため、同じ文字列が追加されます。（これは「現在より前の日付」を強調しているだけです。

列のタイトルからわかるように、最後の 3 つのスコアと関連する 3 つのパラメーターを収集し、それらを各行に追加しています。これは非常に単純化された例であることに注意してください。実際には、各日付には数千行が含まれますが、実際のデータには名前のパターンもありません。そのため、繰り返しパターンで fred、tom、george が隣り合っているとは考えられません。誰かがこれを達成するための最善の方法（最も効率的）を解決するのを手伝ってくれるなら、私はとても素晴らしいです. 不明な点があればお知らせください。詳細を追加します。建設的なコメントをいただければ幸いです。ありがとう

score 11 · Accepted Answer

ファイルは日付順になっているようです。日付ごとに名前ごとの最後のエントリを取得し、各行を書き出すときに名前ごとのサイズの両端キューに追加すると、トリックが実行されます。

import csv
from collections import deque, defaultdict
from itertools import chain, islice, groupby
from operator import itemgetter

# defaultdict whose first access of a key will create a deque of size 3
# defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']]
# Since deques are efficient at head/tail manipulation, then an insert to
# the start is efficient, and when the size is fixed it will cause extra
# elements to "fall off" the end... 
names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3))
with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout:
    csvin = csv.reader(fin)
    csvout = csv.writer(fout)
    # Use groupby to detect changes in the date column. Since the data is always
    # asending, the items within the same data are contigious in the data. We use
    # this to identify the rows within the *same* date.
    # date=date we're looking at, rows=an iterable of rows that are in that date...
    for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)):
        # After we've processed entries in this date, we need to know what items of data should
        # be considered for the names we've seen inside this date. Currently the data
        # is taken from the last occurring row for the name.
        to_add = {}
        for row in rows:
            # Output the row present in the file with a *flattened* version of the extra data
            # (previous items) that we wish to apply. eg:
            # [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y'] 
            # So we're easily able to store 3 pairs of data, but flatten it into one long
            # list of 6 items...
            # If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict
            # will automatically create the default key as above.
            csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]])))
            # Here, we store for the name any additional data that should be included for the name
            # on the next date group. In this instance we store the information seen for the last
            # occurrence of that name in this date. eg: If we've seen it more than once, then
            # we only include data from the last occurrence. 
            # NB: If you wanted to include more than one item of data for the name, then you could
            # utilise a deque again by building it within this date group
            to_add[row[2]] = row[3:5]            
        for key, val in to_add.iteritems():
            # We've finished the date, so before processing the next one, update the previous data
            # for the names. In this case, we push a single item of data to the front of the deck.
            # If, we were storing multiple items in the data loop, then we could .extendleft() instead
            # to insert > 1 set of data from above.
            names_previous[key].appendleft(val)

これにより、実行中に名前と最後の 3 つの値のみがメモリに保持されます。

入力時にそれらをスキップするのではなく、正しいヘッダーを含める/新しいヘッダーを書き込むように調整したい場合があります。

score 3 · Accepted Answer

私の2セント：
- Python 2.7.5 - defaultdict を使用して、各Name
の前の行を保持しました。 - 完全な両端キューの fifo 動作が好きだったので、制限付き長さ両端キューを使用して前の行を保持しました。それは私がそれについて考えるのを容易にしました-ただそれに何かを押し込み続けてください. - インデックス作成とスライスに operator.itemgetter() を使用しました。

from collections import deque, defaultdict
import csv
from functools import partial
from operator import itemgetter

# use a 3 item deque to hold the 
# previous three rows for each name
deck3 = partial(deque, maxlen = 3)
data = defaultdict(deck3)


name = itemgetter(2)
date = itemgetter(1)
sixplus = itemgetter(slice(6,None))

fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
          'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
          'LTscore+2', 'LTParameter+3']
with open('data.txt') as infile, open('processed.txt', 'wb') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    writer.writerow(fields)
    # comment out the next line if the data file does not have a header row
    reader.next()
    for row in reader:
        default = deque(['x', 'y', 'x', 'y', 'x', 'y'], maxlen = 6)
        try:
            previous_row = data[name(row)][-1]
            previous_date = date(previous_row)
        except IndexError:
            previous_date = None
        if  previous_date == date(row):
            # use the xtra stuff from last time
            row.extend(sixplus(previous_row))
            # discard the previous row because
            # there is a new row with the same date
            data[name(row)].pop()
        else:
            # add columns 3 and 4 from each previous row
            for deck in data[name(row)]:
                # adding new items to a full deque causes
                # items to drop off the other end
                default.appendleft(deck[4])
                default.appendleft(deck[3])
            row.extend(default)
        writer.writerow(row)
        data[name(row)].append(row)

ポートのガラス越しにその解決策について少し考えた後、私はそれがあまりにも複雑すぎることに気付きました. プロトコルについてはよくわからないので、そのままにしておきます。名前ごとに前の 3 行を維持できるという利点があります。

これは、スライスと通常の辞書を使用したソリューションです。以前に処理された行のみが保持されます。はるかに簡単です。読みやすくするために、アイテムゲッターを保持しました。

import csv
from operator import itemgetter

fields = ['Datatitle', 'Date', 'Name', 'Score', 'Parameter',
          'LTscore', 'LTParameter', 'LTscore+1', 'LTParameter+1',
          'LTscore+2', 'LTParameter+3']

name = itemgetter(2)
date = itemgetter(1)
cols_sixplus = itemgetter(slice(6,None))
cols34 = itemgetter(slice(3, 5))
cols6_9 = itemgetter(slice(6, 10))
data_alt = {}

with open('data.txt') as infile, open('processed_alt.txt', 'wb') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    writer.writerow(fields)
    # comment out the next line if the data file does not have a header row
    reader.next()
    for row in reader:
        try:
            previous_row = data_alt[name(row)]
        except KeyError:
            # first time this name encountered
            row.extend(['x', 'y', 'x', 'y', 'x', 'y'])
            data_alt[name(row)] = row
            writer.writerow(row)
            continue
        if  date(previous_row) == date(row):
            # use the xtra stuff from last time
            row.extend(cols_sixplus(previous_row))
        else:
            row.extend(cols34(previous_row))
            row.extend(cols6_9(previous_row))
        data_alt[name(row)] = row
        writer.writerow(row)

同様のタイプの処理では、行を個別にではなく蓄積してチャンクに書き込むと、パフォーマンスが大幅に向上することがわかりました。また、可能であれば、データファイル全体を一度に読み取ると役立ちます。

score 3 · Accepted Answer

これは、質問で提供されたサンプルデータで何を探しているかを示すコードサンプルです。入力ファイルに「input.csv」という名前を付け、作業ディレクトリから読み取り/書き込みを行うと、「output.csv」は同じフォルダーに移動します。コード内のコメントを使用して説明し、以前のレコードを辞書に保存して名前で検索し、それぞれのスコアのリストを保存しました-現在の日付レコードを新しいバッファー辞書に保存し、日付が更新されるたびにそれをメイン辞書に追加しました入力の変化。ご不明な点がございましたら、お知らせください。コードは少し大雑把です。簡単な例です。[:6] スライスは、現在の名前の最新の 6 つのリスト項目 (以前の 3 つのスコア/パラメーターのペア) を提供します。

import csv

myInput = open('input.csv','rb')
myOutput = open('output.csv','wb')
myFields = ['Datatitle','Date','Name','Score','Parameter','Text',
            'LTscore','LTParameter','LTscore+1','LTParameter+1',
            'LTscore+2','LTParameter+2']
inCsv = csv.DictReader(myInput,myFields)
outCsv = csv.writer(myOutput)
outCsv.writerow(myFields) # Write header row

previous_dict = dict() # store scores from previous dates
new_dict = dict() # buffer for records on current-date only

def add_new():
    # merge new_dict into previous_dict
    global new_dict, previous_dict
    for k in new_dict:
        if not previous_dict.has_key(k):
            previous_dict[k] = list()
        # put new items first
        previous_dict[k] = new_dict[k] + previous_dict[k]
    new_dict = dict() # reset buffer

old_date = '00/00/00' # start with bogus *oldest* date string
inCsv.next() # skip header row
for row in inCsv:
    myTitle = row['Datatitle']
    myDate = row['Date']
    myName = row['Name']
    myScore = row['Score']
    myParameter = row['Parameter']
    myText = row['Text']
    if old_date != myDate:
        add_new() # store new_dict buffer with previous data
        old_date = myDate
    if not new_dict.has_key(myName):
        new_dict[myName] = []
    # put new scores first
    new_dict[myName] = [myScore,myParameter] + new_dict[myName]
    if not previous_dict.has_key(myName):
        previous_dict[myName] = []
    outCsv.writerow([myTitle,myDate,myName,myScore,myParameter,myText] \
                     + previous_dict[myName][:6])
# end loop for each row

myInput.close()
myOutput.close()

私のソリューションは、大規模なデータセットに対してうまく機能するはずです。メモリ消費が懸念される場合は、名前ごとのリストの長さを 3 つのスコアに制限できます。現在、以前のすべてのスコアを保持しており、将来さらに必要になった場合に備えて 3 つだけを表示しています。データのサイズが扱いにくい場合は、メモリ内のすべてではなく、ディスク上の一時的なルックアップデータに dict の代わりに sqlite ファイルデータベースを常に使用できます。8G の RAM と 2G のデータを使用すると、ここで使用されているメモリ内の Python 辞書で問題ないはずです。64 ビット OS で 64 ビットリリースの Python を使用していることを確認してください。私の例では画面に何も出力しませんが、大きなファイルの場合は、進行状況を N 行ごとに表示する print ステートメントを配置することができます (100、1000 ごとなど、システムの速度に基づいて選択します)。画面出力では、ファイルデータの処理速度が低下することに注意してください。

score 0 · Accepted Answer

1 つのアプローチを次に示します。正確な実装はデータによって異なりますが、これは良い出発点になるはずです。

入力 CSV データに対して 2 つのパスを実行します。

入力の最初のパスで、行をスキャンしてディクショナリを作成します。名前はキーとして使用できます {'Tom' : [(date1, values),(date2, values)], 'George' : [(date1, values), (date2,values)]}。ネストされた辞書を使用する方が簡単であることが判明する場合があります{'Tom' : {date1: values, date2: values}, 'George' : {date1: values, date2: values}}。以下のデータ構造の詳細。
入力の 2 回目のパスでは、元の入力データとディクショナリの履歴データを連結して、出力データを作成します。

履歴データを選択する方法は、入力データの規則性によって異なります。たとえば、日付が昇順でソートされていて、リストの辞書を実装している場合、関連するリストからスライスを取得するのと同じくらい簡単dataDict['Tom'][i-3:i]です。ただし、同じ日付に複数のレコードが存在する可能性があると述べているため、追加の作業が必要になる可能性があります。いくつかの可能性は次のとおりです。

リストアプローチのディクショナリを考えると、値をリストとして維持して、重複する日付エントリがないようにします{'Tom' :(date1, [val1, val2, val3]),(date2, values)], 'George' : [(date1, values),(date2,values)]}。
辞書の辞書アプローチを考えると、必要な特定の日付範囲を検索します。この場合、すべての日付が連続して利用可能でない限り、おそらく KeyError 例外をチェックする必要があります。利用可能な日付の追加のソート済みインデックスを維持することもできます。

python - X 量の以前のデータを CSV の行にプルする方法

5 に答える 5

Related

Reference