python - re.match を使用する Python が長いテキストでハングする

Question

ドメインのリストを含むテキストファイルがあり、Python の正規表現を使用してドメインとサブドメインを照合したいと考えています。

サンプルドメインファイル

admin.happy.com
nothappy.com

次の正規表現があります。

main_domain = 'happy.com'
mydomains = open('domains.txt','r').read().replace('\n',',')
matchobj = re.match(r'^(.*\.)*%s$' % main_domain,mydomains)

コードは短いテキストでは問題なく動作しますが、ドメインファイルに 100 以上のエントリがあるとハングしてフリーズします。

テキストファイルのコンテンツを操作するために正規表現を最適化する方法はありますか?

score 0 · Accepted Answer

壊滅的なバックトラッキングを避けるために、正規表現を単純化できます。

import re

with open("domains.txt") as file:
    text = file.read()
main_domain = "happy.com"
subdomains = re.findall(r"^(.+)\.%s$" % re.escape(main_domain), text, re.M)

メインドメインも一致させたい場合: (r"^(?:(.+)\.)?%s$".

python - re.match を使用する Python が長いテキストでハングする

2 に答える 2

Related

Reference