python - 2 つのファイルから 4 行ごとに同時に読み取る

Question

大きなテキストファイル (10 MB の gzip) を処理しています。同じ長さと構造の 2 つのファイルが常に一緒に属します: データセットごとに 4 行。

両方のファイルから同時に、4 つのブロックごとに 2 行目のデータを処理する必要があります。

私の質問: これに対する最も時間効率の良いアプローチは何ですか?

今、私はこれをやっています：

def read_groupwise(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)

f1 = gzip.open(file1,"r")
f2 = gzip.open(file2,"r")
for (fline1,fline2,fline3,fline4), (rline1, rline2, rline3, rline4) in zip(read_groupwise(f1, 4), read_groupwise(f2, 4)):
    # process fline2, rline2

しかし、それぞれの行 2 しか必要ないので、おそらくこれを行うためのはるかに効率的な方法があると思いますか?

score 1 · Accepted Answer

これは、独自のジェネレーターを構築することで実行できます。

def get_nth(iterable, n, after=1):
    if after > 1:
        consume(iterable, after-1)
    while True:
        yield next(iterable)
        consume(iterable, n-1)

with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2:
    every = (4, 2)
    for line_f1, line_f2 in zip(get_nth(f1, *every), get_nth(f2, *every)):
        ...

ジェネレーターは、与えられる最初の項目に進み (この場合、2 番目の項目が必要なので、1 つスキップして 2 番目の項目の前に反復子を配置します)、1 つの値を生成し、次の項目の前に自分自身を配置します。 . これは、目の前のタスクを達成するための非常に簡単な方法です。

ここではconsume()from itertools' レシピを使用しています:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

最後に、gzip.open()がコンテキストマネージャを提供するかどうかはわかりません。提供されない場合は、を使用することをお勧めしますcontextlib.closing()。

score 0 · Accepted Answer

メモリがある場合は、次を試してください。

ln1 = f1.readlines()[2::4]
ln2 = f2.readlines()[2::4]
for fline, rline in zip(ln1, ln2):
    ...

ただし、記憶がある場合に限ります。

python - 2 つのファイルから 4 行ごとに同時に読み取る

3 に答える 3

Related

Reference