python - テキストファイル内の行のグループ化とその後の処理

Question

以下の形式の入力ファイルがあります。これは単なるサンプルファイルであり、実際のファイルには同じように多くのエントリがあります。

0.0 aa:bb:cc dd:ee:ff 100  000 ---------->line1
0.2 aa:bb:cc dd:ee:ff 101  011 ---------->line2
0.5 dd:ee:ff aa:bb:cc 230  001 ---------->line3
0.9 dd:ee:ff aa:bb:cc 231  110 ---------->line4
1.2 dd:ee:ff aa:bb:cc 232  101 ---------->line5
1.4 aa:bb:cc dd:ee:ff 102  1111 ---------->line6
1.6 aa:bb:cc dd:ee:ff 103  1101 ---------->line7
1.7 aa:bb:cc dd:ee:ff 108  1001 ---------->line8
2.4 dd:ee:ff aa:bb:cc 233  1000 ---------->line9  
2.8 gg:hh:ii jj:kk:ll 450  1110 ---------->line10
3.2 jj:kk:ll gg:hh:ii 600  010 ---------->line11

最初の列はタイムスタンプ、2 番目の送信元アドレス、3 番目の宛先アドレス、4 番目のシーケンス番号、5 番目は不要を表します。

この問題では、グループの定義:

i. The lines should be consecutive(lines 1 and 2)  
ii. Should have same second and third column, but fourth column should be differed by 1.

同じ（列2、列3）に対応するすべてのグループについて、グループの最初の行と次の最初の行のタイムスタンプの違いを計算する必要があります。
たとえば、(aa:bb:cc dd:ee:ff) に対応するグループは、(line1, line2) & (lin6, line7) & (line8) です。最終的な出力は、(aa:bb:cc dd:ee:ff) = [1.4 0.3] のようになります。
1.4 = line6 と line1 の間のタイムスタンプの差。0.3 は (aa:bb:cc dd:ee:ff) エントリの 8 行目と 6 行目の時間差です。
これらは、すべての (column2 column3) ペアに対して計算する必要があります。

以下のように、グループ内のメンバーの数を数えるプログラムを作成しました。

#!/usr/bin/python

with open("luawrite") as f:
#read the first line and set the number from it as the value of `prev`
    num = next(f).rsplit(None,2)[-2:]
    prev  = int(num)
    count = 1                               #initialize `count` to 1
    for lin in f:
            num = lin.rsplit(None,2)[-2:]
            num  = int(num)                    #use `str.rsplit` for minimum splits
            if num - prev == 1:               #if current `num` - `prev` == 1
                    count+=1                          # increment `count`
                    prev = num                        # set `prev` = `num`
            else:
                    print count                #else print `count` or write it to a file
                    count = 1                        #reset `count` to 1
                    prev = num                       #set `prev` = `num`
    if num - prev !=1:
            print count

2列目と3列目を辞書キーにして色々試してみましたが、同じキーに対応するグループが複数あります。これは私には大変な作業のように聞こえます。このトリッキーな問題を解決するのを手伝ってください。

score 2 · Accepted Answer

from collections import defaultdict

data = list()
groups = defaultdict(list)
i = 1
with open('input') as f:
    for line in f:
        row = line.strip().split() + [ i ]
        gname = " ".join(row[1:3])
        groups[gname] += [ row ]
        i += 1

output = defaultdict(list)
for gname, group in groups.items():
    gr = []
    last_key,last_col4, last_idx='',-1,-1
    for row in group:
        key, idx = " ".join(row[1:3]), int(row[-1])
        keys_same   = last_key == key and last_col4 + 1 == int(row[3])
        consequtive = last_idx + 1 == idx
        if not (gr and keys_same and consequtive):
            if gr: output[gr[0][1]] += [ float(row[0]) - float(gr[0][0]) ]
            gr = [ row ]
        else: gr += [ row ]
        last_key, last_col4, last_idx = key, int(row[3]), idx

for k,v in output.items():
    print k, ' --> ', v

score 1 · Accepted Answer

itertools.groupby()によって定義されたグループを抽出するために使用できます。

私。行は連続している必要があります (行 1 と 2)

ii. 2 列目と 3 列目は同じにする必要がありますが、4 列目は 1 だけ異なる必要があります

次にcollections.defaultdict()、タイムスタンプを収集して違いを見つけるために使用できます。

同じ（列2、列3）に対応するすべてのグループについて、グループの最初の行と次の最初の行のタイムスタンプの違いを計算する必要があります。

from collections import defaultdict
from itertools import groupby

import sys
file = sys.stdin # could be anything that yields lines e.g., a regular file

rows = (line.split() for line in file if line.strip())

# get timestamps map: (source, destination) -> timestamps of 1st lines
timestamps = defaultdict(list) 
for ((source, dest), _), group in groupby(enumerate(rows),
                           key=lambda (i, row): (row[1:3], i - int(row[3]))):
    ts = float(next(group)[1][0]) # a timestamp from the 1st line in a group
    timestamps[source, dest].append(ts)

# find differences
for (source, dest), t in sorted(timestamps.items(), key=lambda (x,y): x):
    diffs = [b - a for a, b in zip(t, t[1:])] # pairwise differences   
    info = ", ".join(map(str, diffs)) if diffs else t # support unique
    print("{source} {dest}: {info}".format(**vars()))

出力

aa:bb:cc dd:ee:ff: 1.4, 0.3
dd:ee:ff aa:bb:cc: 1.9
gg:hh:ii jj:kk:ll: [2.8]
jj:kk:ll gg:hh:ii: [3.2]

[]は、入力に対応する (送信元アドレス、宛先アドレス) ペアの単一グループがあることを意味します。つまり、違いを構築するものは何もありません。すべてのケースを均一に処理するために、タイムスタンプリストの前にダミー0.0のタイムスタンプを追加できます。

python - テキスト ファイル内の行のグループ化とその後の処理

2 に答える 2

出力

Related

Reference

python - テキストファイル内の行のグループ化とその後の処理