python - Python 正規表現のパフォーマンスを向上させる

Question

以下の正規表現を改善しようとしています：

urlpath=columns[4].strip()
                                urlpath=re.sub("(\?.*|\/[0-9a-f]{24})","",urlpath)
                                urlpath=re.sub("\/[0-9\/]*","/",urlpath)
                                urlpath=re.sub("\;.*","",urlpath)
                                urlpath=re.sub("\/",".",urlpath)
                                urlpath=re.sub("\.api","api",urlpath)
                                if urlpath in dlatency:

これにより、URL が次のように変換されます。

/api/v4/path/apiCallTwo?host=wApp&trackId=1347158

に

api.v4.path.apiCallTwo

このスクリプトは 5 分ごとに約 50,000 ファイルにわたって実行され、実行に全体で約 40 秒かかるため、正規表現のパフォーマンスを向上させたいと考えています。

ありがとうございました

score 2 · Accepted Answer

これが私のワンライナーソリューション（編集済み）です。

urlpath.partition("?")[0].strip("/").replace("/", ".")

他の人が言及しているように、ここでは速度の向上はごくわずかです。re.compile（）を使用して式をプリコンパイルする以外に、他の場所を探し始めます。

import re


re1 = re.compile("(\?.*|\/[0-9a-f]{24})")
re2 = re.compile("\/[0-9\/]*")
re3 = re.compile("\;.*")
re4 = re.compile("\/")
re5 = re.compile("\.api")
def orig_regex(urlpath):
    urlpath=re1.sub("",urlpath)
    urlpath=re2.sub("/",urlpath)
    urlpath=re3.sub("",urlpath)
    urlpath=re4.sub(".",urlpath)
    urlpath=re5.sub("api",urlpath)
    return urlpath


myregex = re.compile(r"([^/]+)")
def my_regex(urlpath):
    return ".".join( x.group() for x in myregex.finditer(urlpath.partition('?')[0]))

def test_nonregex(urlpath)
    return urlpath.partition("?")[0].strip("/").replace("/", ".")

def test_func(func, iterations, *args, **kwargs):
    for i in xrange(iterations):
        func(*args, **kwargs)

if __name__ == "__main__":
    import cProfile as profile

    urlpath = u'/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
    profile.run("test_func(orig_regex, 10000, urlpath)")
    profile.run("test_func(my_regex, 10000, urlpath)")
    profile.run("test_func(non_regex, 10000, urlpath)")

結果

Iterating orig_regex 10000 times
     60003 function calls in 0.108 CPU seconds

....

Iterating my_regex 10000 times
     130003 function calls in 0.087 CPU seconds

....

Iterating non_regex 10000 times
     40003 function calls in 0.019 CPU seconds

5つの正規表現でre.compileを実行しないと、

running <function orig_regex at 0x100532050> 10000 times
     210817 function calls (210794 primitive calls) in 0.208 CPU seconds

score 2 · Accepted Answer

2

One-liner with urlparse:

urlpath = urlparse.urlsplit(url).path.strip('/').replace('/', '.')

于 2012-06-05T15:23:35.103 に答える

score 2 · Accepted Answer

これを試して：

s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'
re.sub(r'\?.+', '', s).replace('/', '.')[1:]
> 'api.v4.path.apiCallTwo'

パフォーマンスをさらに向上させるには、次のように、正規表現を一度コンパイルして再利用します。

regexp = re.compile(r'\?.+')
s = '/api/v4/path/apiCallTwo?host=wApp&trackId=1347158'

# `s` changes, but you can reuse `regexp` as many times as needed
regexp.sub('', s).replace('/', '.')[1:]

正規表現を使用しない、さらに単純なアプローチ:

s[1:s.index('?')].replace('/', '.')
> 'api.v4.path.apiCallTwo'

score 1 · Accepted Answer

行を 1 つずつ見ていきます。

キャプチャもグループ化もしていないので(and)は不要で、/Python の正規表現では特殊文字ではないため、エスケープする必要はありません。

urlpath = re.sub("\?.*|/[0-9a-f]{24}", "", urlpath)

/a の後に続くゼロの繰り返しをa に置き換えるの/は無意味です。

urlpath = re.sub("/[0-9/]+", "/", urlpath)

文字列メソッドを使用すると、固定文字とそれ以降のすべてをより高速に削除できます。

urlpath = urlpath.partition(";")[0]

文字列メソッドを使用すると、固定文字列を別の固定文字列に置き換えるのも高速になります。

urlpath = urlpath.replace("/", ".")

urlpath = urlpath.replace(".api", "api")

score 0 · Accepted Answer

これには正規表現が必要ですか？
つまり、

urlpath = columns[4].strip()
urlpath = urlpath.split("?")[0]
urlpath = urlpath.replace("/", ".")

score 0 · Accepted Answer

re ステートメントをコンパイルして、パフォーマンスを向上させることもできます。

例えば

compiled_re_for_words = re.compile("\w+")
compiled_re_for_words.match("test")

python - Python 正規表現のパフォーマンスを向上させる

6 に答える 6

Related

Reference