python - 入力変数として行数を使用して大きなテキストファイルを分割するPythonの高速メソッド

Question

行数を変数としてテキストファイルを分割しています。この関数は、吐き出されたファイルを一時ディレクトリに保存するために作成しました。各ファイルには400万行あり、最後のファイルが必要です。

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

主な問題は、この関数の速度です。800万行の1つのファイルを400万行の2つのファイルに分割するための時間は、私のWindowsOSとPython2.7の30分以上です。

score 6 · Accepted Answer

       for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

ファイルを開き、グループ内の各行に1行ずつ書き込みます。これは遅いです。

代わりに、グループごとに1回書き込みます。

            with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))

score 1 · Accepted Answer

800万行のファイル（稼働時間行）を使用して簡単なテストを実行し、ファイルの長さを実行してファイルを半分に分割しました。基本的に、1回のパスで行数を取得し、2回目のパスで分割書き込みを実行します。

私のシステムでは、最初のパスの実行にかかった時間は約2〜3秒でした。分割ファイルの実行と書き込みを完了するのにかかった合計時間は21秒未満でした。

OPの投稿にランバ関数を実装しませんでした。以下で使用されるコード：

#!/usr/bin/env python

import sys
import math

infile = open("input","r")

linecount=0

for line in infile:
    linecount=linecount+1

splitpoint=linecount/2

infile.close()

infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")

print linecount , splitpoint

linecount=0

for line in infile:
    linecount=linecount+1
    if ( linecount <= splitpoint ):
        outfile1.write(line)
    else:
        outfile2.write(line)

infile.close()
outfile1.close()
outfile2.close()

いいえ、パフォーマンステストやコードエレガンステストに勝つことはありません。:)しかし、パフォーマンスのボトルネックである他の何かを除いて、ラムダ関数がファイルをメモリにキャッシュしてスワップの問題を強制するか、ファイルの行が非常に長いので、なぜ30がかかるのかわかりません800万行のファイルを読み取る/分割するのに数分。

編集：

私の環境：Mac OS X、ストレージはFW800に接続された単一のハードドライブでした。ファイルシステムのキャッシュの利点を回避するために、ファイルは新しく作成されました。

score 1 · Accepted Answer

tempfile.NamedTemporaryFileは、コンテキストマネージャーで直接使用できます。

import tempfile
import time
from itertools import groupby, count

def tempfile_split(filename, temp_dir, chunk=4*10**6):
    fns={}
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            with tempfile.NamedTemporaryFile(delete=False,
                           dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
                outfile.write(''.join(group))
                fns[k]=outfile.name   
    return fns                     

def make_test(size=8*10**6+1000):
    with tempfile.NamedTemporaryFile(delete=False) as fn:
        for i in xrange(size):
            fn.write('Line {}\n'.format(i))

    return fn.name        

fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0

私のコンピューターでは、tempfile_splitパーツは3.6秒で実行されます。OSXです。

score 0 · Accepted Answer

LinuxまたはUNIX環境を使用している場合は、少しごまかして、splitPython内からコマンドを使用できます。私のために、そして非常に速いトリックをします：

def split_file(file_path, chunk=4000):

    p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
                          os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.communicate()
    # Remove the original file if required
    try:
        os.remove(file_path)
    except OSError:
        pass
    return True

python - 入力変数として行数を使用して大きなテキストファイルを分割するPythonの高速メソッド

4 に答える 4

Related

Reference