python - ファイルから行を読み取り、処理してから削除します

Question

数字のリスト (1 行に 1 つの数字) を含む 22MB のテキストファイルがあります。私はpythonに番号を読み取らせ、番号を処理させ、結果を別のファイルに書き込もうとしています。これはすべて機能しますが、プログラムを停止する必要がある場合は、最初からやり直します。最初は mysql データベースを使用しようとしましたが、遅すぎました。この方法で処理されている数の約 4 倍を取得しています。番号が処理された後に行を削除できるようにしたいと思います。

with open('list.txt', 'r') as file:
for line in file:
    filename = line.rstrip('\n') + ".txt"
    if os.path.isfile(filename):
        print "File", filename, "exists, skipping!"
    else:
        #process number and write file
    #(need code to delete current line here)

ご覧のとおり、再起動するたびにハードドライブでファイル名を検索して、中断した場所に確実に移動する必要があります。150 万の数字があるため、これには時間がかかる場合があります。切り捨ての例を見つけましたが、うまくいきませんでした。

テキストファイルで動作するPython用のarray_shift（PHP）に似たコマンドはありますか。

score 7 · Accepted Answer

入力ファイルを書き換える代わりに、マーカーファイルを使用して、処理された最後の行の番号を保持します。

start_from = 0

try:
    with open('last_line.txt', 'r') as llf: start_from = int(llf.read())
except:
    pass

with open('list.txt', 'r') as file:
    for i, line in enumerate(file):
        if i < start_from: continue

        filename = line.rstrip('\n') + ".txt"
        if os.path.isfile(filename):
            print "File", filename, "exists, skipping!"
        else:
            pass
        with open('last_line.txt', 'w') as outfile: outfile.write(str(i))

このコードは、最初にファイル last_line.txt をチェックし、そこから数値を読み取ろうとします。number は、前回の試行中に処理された行の数です。次に、必要な行数だけスキップします。

score 1 · Accepted Answer

データファイルの読み取りがボトルネックになることはありません。次のコードは、私のマシンで約 0.2 秒で 36 MB、697997 行のテキストファイルを読み取りました。

import time

start = time.clock()
with open('procmail.log', 'r') as f:
    lines = f.readlines()
end = time.clock()
print 'Readlines time:', end-start

次の結果が得られたからです。

Readlines time: 0.1953125

このコードは、一度に行のリストを生成することに注意してください。

どこに行ったかを知るには、処理した行数をファイルに書き込むだけです。次に、やり直したい場合は、すべての行を読み、既に行った行をスキップします。

import os

# Raad the data file
with open('list.txt', 'r') as f:
    lines = f.readlines()

skip = 0
try:
    # Did we try earlier? if so, skip what has already been processed
    with open('lineno.txt', 'r') as lf:
        skip = int(lf.read()) # this should only be one number.
        del lines[:skip] # Remove already processed lines from the list.
except:
    pass

with open('lineno.txt', 'w+') as lf:
    for n, line in enumerate(lines):
        # Do your processing here.
        lf.seek(0) # go to beginning of lf
        lf.write(str(n+skip)+'\n') # write the line number
        lf.flush()
        os.fsync() # flush and fsync make sure the lf file is written.

score 1 · Accepted Answer

私はそのようなものにRedisを使用しています。redis をインストールしてから pyredis をインストールすると、永続的なセットをメモリに保持できます。次に、次のことができます。

r = redis.StrictRedis('localhost')
with open('list.txt', 'r') as file:
    for line in file:
        if r.sismember('done', line):
            continue
        else:
            #process number and write file
            r.sadd('done', line)

Redis をインストールしたくない場合は、shelve モジュールを使用することもできます。必ず writeback=False オプションで開いてください。Redis をお勧めしますが、このような作業が非常に簡単になります。

python - ファイルから行を読み取り、処理してから削除します

3 に答える 3

Related

Reference