python - Python での一連の正規表現置換を高速化する

Question

私の python スクリプトは、ファイル内の各行を読み取り、各行で多くの正規表現の置換を行います。

正規表現が成功した場合は、次の行にスキップします

この種のスクリプトを高速化する方法はありますか?
代わりに subn を呼び出し、置換が完了したかどうかを確認してから、残りのものにスキップする価値はありますか?
正規表現をコンパイルすると、コンパイルされたすべての正規表現をメモリに保存できますか?

for file in files:  
     for line in file:  
         re.sub() # <--- ~ 100 re.sub

PS: 各正規表現の代替 vaires

score 2 · Accepted Answer

@Tim Pietzcker が言ったように、代替にすることで正規表現の数を減らすことができます。一致オブジェクトの「lastindex」属性を使用して、どの代替が一致したかを判断できます。

これはあなたができることの例です：

>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
...     return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'

score 2 · Accepted Answer

You should probably do three things:

Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
If possible (depending on file size), read the file into memory completely.
Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).

This gives you something like:

regex = re.compile(r"My big honking regex")
for datafile in files:
    content = datafile.read()
    result = regex.sub("Replacement", content)

python - Python での一連の正規表現置換を高速化する

2 に答える 2

Related

Reference