python - Python 2.6の100MBファイルでの複数の文字列置換

Question

約 5000 個の文字列置換を実行したい 100 MB の大きなファイルがあります。これを実現する最も効率的な方法は何ですか?

ファイルを 1 行ずつ読み取り、各行で 5000 個の置換を実行するよりも良い方法はありませんか?

また、ファイルを開いて文字列に対して 5000 個の置換を実行するときに .read メソッドを使用してファイルを文字列として読み取ろうとしましたが、ファイル全体のコピーが 5000 個作成されるため、これはさらに遅くなります。

このスクリプトは、Python 2.6 を使用して Windows で実行する必要があります。

前もって感謝します

score 2 · Accepted Answer

5000回の検索を行う代わりに、5000個のアイテムを1回検索することをお勧めします。

import re

replacements = {
    "Abc-2454": "Gb-43",
    "This": "that",
    "you": "me"
}

pat = re.compile('(' + '|'.join(re.escape(key) for key in replacements.iterkeys()) + ')')
repl = lambda match: replacements[match.group(0)]

これで、ファイル全体にre.subを適用できます。

with open("input.txt") as inf:
    s = inf.read()

s = pat.sub(repl, s)

with open("result.txt") as outf:
    outf.write(s)

または行ごとに、

with open("input.txt") as inf, open("result.txt") as outf:
    outf.writelines(pat.sub(repl, line) for line in inf)

score 2 · Accepted Answer

十分に高速なものが得られるまで、次の手順をこの順序で試してください。

ファイルを大きな文字列に読み取り、各置換を順番に実行して、同じ変数を上書きします。

with open(..., 'w') as f:
    s = f.read()
    for src, dest in replacements:
        s = s.replace(src, dest)
    f.seek(0)
    f.write(s)

ファイルをメモリマップし、置換を行うカスタム置換関数を記述します。

score 0 · Accepted Answer

open() と read() を使用してテキストを読み取り、(コンパイルされた) 正規表現を使用して文字列の置換を行う必要があります。簡単な例:

import re

# read data
f = open("file.txt", "r")
txt = f.read()
f.close()

# list of patterns and what to replace them with
xs = [("foo","bar"), ("baz","foo")]

# do replacements
for (x,y) in xs:
    regexp = re.compile(x)
    txt = regexp.sub(y, txt)

# write back data
f = open("file.txt", "w")
f.write(txt)
f.close()

python - Python 2.6の100MBファイルでの複数の文字列置換

3 に答える 3

Related

Reference