python - 非常に大きなテキストファイルの最後の MB を読み取る方法

Question

テキストファイルの末尾近くにある文字列を見つけようとしています。問題は、テキストファイルのサイズが大きく異なる可能性があることです。3MBから4GBへ。しかし、約 3 GB のテキストファイルでこの文字列を検索するスクリプトを実行しようとするたびに、コンピューターのメモリが不足します。だから、Pythonがファイルのサイズを見つけて、最後のメガバイトを読み取ることができるかどうか疑問に思っていました。

私が現在使用しているコードは次のとおりですが、前述のように、そのような大きなファイルを読み取るのに十分なメモリがないようです。

find_str = "ERROR"
file = open(file_directory)                           
last_few_lines = file.readlines()[-20:]   

error = False  

for line in last_few_lines:
    if find_str in line:
        error = True

score 35 · Accepted Answer

file.seek()を使用します。

import os
find_str = "ERROR"
error = False
# Open file with 'b' to specify binary mode
with open(file_directory, 'rb') as file:
    file.seek(-1024 * 1024, os.SEEK_END)  # Note minus sign
    if find_str in file.read():
        error = True

ファイルを開くときにバイナリモードを指定する必要があります。そうしないと、「未定義の動作」が発生します。とにかく動作する可能性があります（私にとってはそうでした）が、ファイルがデフォルトのテキストモードで開かれた場合、python3では例外seek()が発生します。io.UnsupportedOperationPython 3 のドキュメントはこちらです。これらのドキュメントからは明らかではありませんが、SEEK_*定数はまだosモジュールにあります。

更新: Chris Betti によって提案された、より安全なリソース管理のためのwithステートメントの使用。

score 2 · Accepted Answer

両端キューで tail レシピを使用して、大きなファイルの最後の行を取得できます。n

from collections import deque

def tail(fn, n):
    with open(fn) as fin:
        return list(deque(fin, n))

これをテストします。

最初に大きなファイルを作成します。

>>> with open('/tmp/lines.txt', 'w') as f:
...    for i in range(1,10000000+1):
...       print >> f, 'Line {}'.format(i)  # Python 3: print('Line {}'.format(i), file=f)

# about 128 MB on my machine

次にテストします。

print tail('/tmp/lines.txt', 20) 
# ['Line 9999981\n', 'Line 9999982\n', 'Line 9999983\n', 'Line 9999984\n', 'Line 9999985\n', 'Line 9999986\n', 'Line 9999987\n', 'Line 9999988\n', 'Line 9999989\n', 'Line 9999990\n', 'Line 9999991\n', 'Line 9999992\n', 'Line 9999993\n', 'Line 9999994\n', 'Line 9999995\n', 'Line 9999996\n', 'Line 9999997\n', 'Line 9999998\n', 'Line 9999999\n', 'Line 10000000\n']

これは、ファイルの最後の X バイトではなく、最後の n 行を返します。データのサイズは、ファイルのサイズではなく、行のサイズと同じです。ファイルオブジェクトは、finファイルの行に対する反復子として使用されるため、ファイル全体が一度にメモリに常駐するわけではありません。

score 1 · Accepted Answer

The proposed answer using seek is a correct answer to your question, but I think it's not what you really want to do. Your solution loads the whole file into memory, just to get the last 20 lines. That's the main cause of your problem. The following would solve your memory issue:

for line in file(file_directory):
    if find_str in line:
        error = True

This will iterate over all lines in the file, but releasing the lines after they have been processed. I would guess, that this solution is already much faster than yours so no further optimization is needed. But if you really want to have just the last 20 lines, but the lines in a deque with a max length of 20.

python - 非常に大きなテキスト ファイルの最後の MB を読み取る方法

3 に答える 3

Related

Reference

python - 非常に大きなテキストファイルの最後の MB を読み取る方法