python - 不連続/中断された文字列のマッチング

Question

大きなテキストファイル内で一致させたい、定義済みの文字列のリストがあります。問題は、これらの文字列の多くがテキストに存在することですが、保持したい偽の文字/html-xml タグによって中断されていることです。

たとえば、「国連本部」に一致させたい場合、次の形式でテキストに存在できます。

United Nations & Headquarters
United <br> Nations Headquarters
United Natio<b>ns Hea</b>dquarters

基本的にこれらの文字列の位置を知る必要があり、偽の文字については後で処理します。中断されていない文字列に対して私がすることは次のとおりです。

sting_locations=[v.span() for v in re.finditer(our_string,text)]

これらの中断を何らかの形で無視するための正規表現の設定はありますか、または解決策は何ですか?

score 2 · Accepted Answer

import re

text = """United Nations & Headquarters
United <br> Nations Headquarters
United Natio<b>ns Hea</b>dquarters"""

s = "United Nations Headquarters"

r = re.compile(".*?".join(s))
print([v.span() for v in r.finditer(text)])

キーは、の連続する文字のすべてのペアの間に".*?".join(s)挿入され、正規表現に変換されます。.*?s

.*?許容される中断を制限したい場合は、少し締めたほうがよいかもしれません。

score 1 · Accepted Answer

壊滅的なバックトラッキングを回避し、任意の量の中断を許可するソリューションがいくつかあります!

ソリューション A

これは最もクリーンなソリューションですが、正規表現モジュールが必要です (win binaries here )。(?>...)バックトラッキングを避けるために、アトミックグループを使用します。

import regex

strExampleFile = '''United Nations & Headquarters
United <br> Nations Headquarters
United Natio<b>ns Hea</b>dquarters'''

strSearch = 'United Nations Headquarters'

strRegex = regex.sub(r'((?<!^).)',r'(?>[\s\S]*?(?=\1))\1',strSearch)
rexRegex = regex.compile(strRegex)

print([objMatch.span() for objMatch in rexRegex.finditer(strExampleFile)])

ソリューション B

正規表現モジュールをインストールしていない、またはインストールしたくない場合は、reを使用してアトミックグループ化を模倣できます。ただし、検索文字列は最大 100 文字に制限されています。

import re

strExampleFile = '''United Nations & Headquarters
United <br> Nations Headquarters
United Natio<b>ns Hea</b>dquarters'''

strSearch = 'United Nations Headquarters'

strRegex = re.sub(r'((?<!^).)',r'(?=([\s\S]*?(?=\1)))\\##\1',strSearch)

for numBackReference in range(1,len(strSearch)) :
    strRegex = strRegex.replace("##", str(numBackReference),1)

rexRegex = re.compile(strRegex)

print([objMatch.span() for objMatch in rexRegex.finditer(strExampleFile)])

注: femtoRgon が指摘したように、これらのメソッドはどちらも誤検知を返す可能性があります。

python - 不連続/中断された文字列のマッチング

2 に答える 2

Related

Reference