python - Python正規表現の壊滅的なバックトラッキング

Question

Ms word から生成された XML ファイルでいくつかのフレーズを検索しています。問題は、次の例でわかるように、任意のフレーズが、単語の間、または単語の内部にある XML タグで中断される可能性があるということです。

</w:rPr><w:t> To i</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:lang w:fareast="JA"/></w:rPr><w:t>ncrease knowledge of and acquired skills for implementing social policies with a view to strengthening the capacity of developing countries at the national and community level.</w:t></w:r></w:p>

したがって、この問題を処理するための私のアプローチは、すべての XML タグを同じ長さの # 文字のクラスターに単純に削減することでした。これにより、任意のフレーズを見つけることができた場合、正規表現は各 2 文字間のすべての XML タグを無視します。

基本的に必要なのは、実際の xml ドキュメント内のこのフレーズのスパンです。そのため、このスパンを後で xml ドキュメントの処理に使用します。クローンは使用できません。

このアプローチは非常にうまく機能しますが、次の例のように破滅的なバックトラッキングを引き起こすフレーズもあります。

================================

次に例を示します。

このテキストには、その中に # 文字のクラスターがいくつかあり (保持したい)、次のようにスペースも予測できません。

2014年から2015年までの期間################戦略的枠組み######################との関係#### ################: プログラム 7、経済社会問題、サブプログラム 3、予定
(c)#######

次のフレーズに一致させるには:

2014 年から 2015 年の期間の戦略的枠組みとの関係: プログラム 7、経済社会問題、サブプログラム 3、期待される成果 (c)

予測できない # およびスペース文字に対応するために、次の正規表現を思い付きました。

u'R#*e#*l#*a#*t#*i#*o#*n#*s#*h#*i#*p#*\\s*#*t#*o#*\\s*#*t#*h#*e#*\\s*#*s#*t#*r#*a#*t#*e#*g#*i#*c#*\\s*#*f#*r#*a#*m#*e#*w#*o#*r#*k#*\\s*#*f#*o#*r#*\\s*#*t#*h#*e#*\\s*#*p#*e#*r#*i#*o#*d#*\\s*#*2#*0#*1#*4#*\\-#*2#*0#*1#*5#*:#*\\s*#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*7#*\\,#*\\s*#*E#*c#*o#*n#*o#*m#*i#*c#*\\s*#*a#*n#*d#*\\s*#*S#*o#*c#*i#*a#*l#*\\s*#*A#*f#*f#*a#*i#*r#*s#*\\,#*\\s*#*s#*u#*b#*p#*r#*o#*g#*r#*a#*m#*m#*e#*\\s*#*3#*\\,#*\\s*#*e#*x#*p#*e#*c#*t#*e#*d#*\\s*#*a#*c#*c#*o#*m#*p#*l#*i#*s#*h#*m#*e#*n#*t#*\\s*#*\\(#*c#*\\)'

そして、私が照合したい他のすべてのフレーズでは問題なく機能しますが、これには壊滅的なバックトラッキングにつながる問題があります。誰かがそれを見つけることができますか?

元のテキストは xml タグで区切られているため、正規表現を簡単にするために、タグをこれらの # クラスターに置き換えました。元のテキストは次のとおりです。

</w:rPr><w:t>Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t> for the period 2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)</w:t>

score 3 · Accepted Answer

問題を正しく理解していれば、異常な正規表現や文字ごとの解析に頼らずに問題に取り組む方法を次に示します。

def do_it(search, text_orig, verbose = False):
    # A copy of the text without any "#" markers.
    text_clean = text_orig.replace('#', '')

    # Start position of search text in the cleaned text.
    try:               i = text_clean.index(search)
    except ValueError: return [None, None]

    # Collect the widths of the runs of markers and non-markers.
    rgx    = re.compile(r'#+|[^#]+')
    widths = [len(m.group()) for m in rgx.finditer(text_orig)]

    # From that data, we can compute the span.
    return compute_span(i, len(search), widths, text_orig[0] == '#')

幅データからスパンを計算するかなり簡単な方法を次に示します。eyquem が指摘したように、私の最初の試みは間違っていました。2 番目の試みは正しいものでしたが、複雑でした。この 3 番目のアプローチは単純で正しいように思えます。

def compute_span(span_start, search_width, widths, is_marker):
    span_end       = span_start + search_width - 1
    to_consume     = span_start + search_width
    start_is_fixed = False

    for w in widths:
        if is_marker:
            # Shift start and end rightward.
            span_start += (0 if start_is_fixed else w)
            span_end   += w
        else:
            # Reduce amount of non-marker text we need to consume.
            # As that amount gets smaller, we'll first fix the
            # location of the span_start, and then stop.
            to_consume -= w
            if to_consume < search_width:
                start_is_fixed = True
                if to_consume <= 0: break
        # Toggle the flag.
        is_marker = not is_marker

    return [span_start, span_end]

そして、批評家を寄せ付けないための一連のテスト:

def main():
    tests = [
        #                0123456789012345678901234567890123456789
        ( [None, None], '' ),
        ( [ 0,  5],     'foobar' ),
        ( [ 0,  5],     'foobar###' ),
        ( [ 3,  8],     '###foobar' ),
        ( [ 2,  7],     '##foobar###' ),
        ( [25, 34],     'BLAH ##BLAH fo####o##ba##foo###b#ar' ),
        ( [12, 26],     'BLAH ##BLAH fo####o##ba###r## BL##AH' ),
        ( [None, None], 'jkh##jh#f' ),
        ( [ 1, 12],     '#f#oo##ba###r##' ),
        ( [ 4, 15],     'a##xf#oo##ba###r##' ),
        ( [ 4, 15],     'ax##f#oo##ba###r##' ),
        ( [ 7, 18],     'ab###xyf#oo##ba###r##' ),
        ( [ 7, 18],     'abx###yf#oo##ba###r##' ),
        ( [ 7, 18],     'abxy###f#oo##ba###r##' ),
        ( [ 8, 19],     'iji#hkh#f#oo##ba###r##' ),
        ( [ 8, 19],     'mn##pps#f#oo##ba###r##' ),
        ( [12, 23],     'mn##pab###xyf#oo##ba###r##' ),
        ( [12, 23],     'lmn#pab###xyf#oo##ba###r##' ),
        ( [ 0, 12],     'fo##o##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 0, 12],     'fo#o##ba####r## aaaaaBLfoob##ar#AH' ),
        ( [ 0, 12],     'f##oo##ba###r## aaaaaBLfoob##ar' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##arAH' ),
        ( [ 0, 12],     'f#oo##ba####r## aaaaBL#foob##ar#AH' ),
        ( [ 0, 12],     'foo##ba#####r## aaaaBL#foob##ar' ),
        ( [ 1, 12],     '#f#oo##ba###r## aaaBL##foob##arAH' ),
        ( [ 1, 12],     '#foo##ba####r## aaaBL##foob##ar#AH' ),
        ( [ 2, 12],     '#af#oo##ba##r## aaaBL##foob##ar' ),
        ( [ 3, 13],     '##afoo##ba###r## aaaaaBLfoob##arAH' ),
        ( [ 5, 17],     'BLAHHfo##o##ba###r aaBLfoob##ar#AH' ),
        ( [ 5, 17],     'BLAH#fo##o##ba###r aaBLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r###BLfoob##ar' ),
        ( [ 5, 17],     'BLA#Hfo##o##ba###r#BL##foob##ar' ),
    ]
    for exp, t in tests:
        span = do_it('foobar', t, verbose = True)
        if exp != span:
            print '\n0123456789012345678901234567890123456789'
            print t
            print n
            print dict(got = span, exp = exp)

main()

score 3 · Accepted Answer

状況は非常に複雑であるため、正規表現を使用しないでください。行記号を記号ごとに調べてください。

etalone = "String to find"
etalone_length = len(etalone)
counter = 0
for symbol in your_line:
    if symbol == etalone[counter]:
        counter += 1
        if counter == etalone_length:
            print("String matches")
            break
    elif symbol != " " and sybmol != "#":
        # Bad char found
        print("Does not match!")
else:  # exited 'for' before full etalone matched
    print("Does not match!")

最初に一致したシンボルが探しているものではない場合、上記は実際には機能しないことがわかりました。代わりにこれはどうですか：

文字列を複製する
クローンから「#」を削除
パターンと一致
パターンが一致する場合 - 一致した結果の場所を取得する
その場所で - 最初のシンボルの正確な出現箇所を見つけます。フルラインがa#b##ca#d#fで、探しているラインがである場合のように、 2 番目のシンボルadfからマッチングを開始します。 a
a元の行でn 番目に出現するシンボルを検索します。カウンタを設定 =
上記のアルゴリズムを使用 (スパン開始として格納し、スパン終了として前にカウンタを格納break)

score 1 · Accepted Answer

別のより簡単な解決策は、ポンドキーを削除することです

your_string.replace('#', '')

そして、replace によって返された文字列に対して (すべての #* なしで) 正規表現をテストします。

score 1 · Accepted Answer

以前の回答ではre、difflibモジュールとすべてのタグを文字に置き換えるという原則を使用しました。しかし、あなたの問題は、任意の文字で置換する必要なく、
使用するだけで解決できることに気付きました。re

輸入

import re

データ

タプルを使用して、実行中にデータをより読みやすい形式で表示できるようにしました

いくつかの問題を回避するために、データをわずかに変更したことに注意してください:フレームワークとピリオドの
間に空白を 1 つだけ配置し、2 つの文字列のProgramm 7に主要なPを配置するなど。

また、この場合でもコードが機能することを示すために、一連の文字###をphraseand (日付 2014-2015 の前) に追加したことにも注意してください。xmltext他の回答は、この不測の事態を管理しません。

段階

tu_phrase = ('Relationship to the ',
             'strategic framework ',
             'for the period ###2014-2015',
             ': Programme 7, Economic and Social Affairs, ',
             'subprogramme 3, expected accomplishment (c)')
phrase = ''.join(tu_phrase)

XML テキスト

tu_xmltext = ('EEEEEEE',
              '<w:rPr>',
              'AAAAAAA',
              '</w:rPr><w:t>',
              'Relationship to the ',
              '</w:t></w:r><w:r>',
              '<w:rPr><w:i/>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>'
              'strategic framework ',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076">',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              'for the period ###2014-2015',
              '</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr>',
              '<w:sz w:val="17"/><w:sz-cs w:val="17"/>',
              '</w:rPr><w:t>',
              ': Programme 7, Economic and Social Affairs, ',
              'subprogramme 3, expected accomplishment (c)',
              '</w:t>',
              '321354641331')
xmltext = ''.join(tu_xmltext)

働く機能

関数olding_the_new(stuvw , pat_for_sub)は、およびの共通シーケンスの位置の対応を表すトリプル(pmod,w,pori)のリストを返します。これらのシーケンスは of によってキャッチされないものです: - シーケンスを記述します -はその位置です -その幅は [re.sub(pat_for_sub, stuvw) と stuvw で同じです] -これの位置ですオリジナルのシーケンス

stuvwre.sub(pat_for_sub, stuvw)
stuvwgroup(1)pat_for_sub
(pmod,w)re.sub(pat_for_sub, stuvw)
pmodre.sub(pat_for_sub, stuvw)
w
poristuvw

def olding_the_new(stuvw,pat_for_sub):
    triples = []
    pmod = 0 # pmod = position in modified stuvw,
             # that is to say in re.sub(pat_for_sub,'',stuvw)
    for mat in re.finditer('{0}|([\s\S]+?)(?={0}|\Z)'.format(pat_for_sub),
                           stuvw):
        if mat.group(1):
            triples.append((pmod,mat.end()-mat.start(),mat.start()))
            pmod += mat.end()-mat.start()
    return triples


def finding(LITTLE,BIG,pat_for_sub,
            olding_the_new=olding_the_new):
    triples = olding_the_new(BIG,'(?:%s)+' % pat_for_sub)
    modBIG = re.sub(pat_for_sub,'',BIG)
    modLITTLE = re.escape(LITTLE)
    for mat in re.finditer(modLITTLE,modBIG):
        st,nd = mat.span() # in modBIG
        sori = -1 # start original, id est in BIG
        for tr in triples:
            if st < tr[0]+tr[1] and sori<0:
                sori = tr[2] + st - tr[0] 
            if nd<=tr[0]+tr[1]:
                yield(sori, tr[2] + nd - tr[0])
                break

実行

if __name__ == '__main__':
    print ('---------- phrase ----------\n%s\n'
           '\n------- phrase written in a readable form --------\n'
           '%s\n\n\n'
           '---------- xmltext ----------\n%s\n'
           '\n------- xmltext written in a readable form --------\n'
           '%s\n\n\n'
           %
           (phrase  , '\n'.join(tu_phrase),
            xmltext , '\n'.join(tu_xmltext))    )

    print ('*********************************************************\n'
           '********** Searching for phrase in xmltext **************\n'
           '*********************************************************')

    spans = finding(phrase,xmltext,'</?w:[^>]*>')
    if spans:
        for s,e in spans:
            print ("\nspan in string 'xmltext' :  (%d , %d)\n\n"
                   'xmltext[%d:%d] :\n%s'
                   % (s,e,s,e,xmltext[s:e]))
    else:
        print ("-::: The first string isn't in second string :::-")

結果

*********************************************************
********** Searching for phrase in xmltext **************
*********************************************************

span in string 'xmltext' :  (34 , 448)

xmltext[34:448] :
Relationship to the </w:t></w:r><w:r><w:rPr><w:i/><w:sz w:val="17"/><w:sz-cs w:val="17"/>strategic framework </w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>for the period ###2014-2015</w:t></w:r><w:r wsp:rsidRPr="00EC3076"><w:rPr><w:sz w:val="17"/><w:sz-cs w:val="17"/></w:rPr><w:t>: Programme 7, Economic and Social Affairs, subprogramme 3, expected accomplishment (c)

のたベネ

私のコードは、2 つの単語間の空白のシーケンスが句と XML テキストで正確に同じでない場合、XML ドキュメントで句を検出できません。
この可能性を取得しようとしましたが、複雑すぎます。
あなたの例では、表示する XML シーケンスでは、戦略的フレームワークと次のタグの間に空白があり、これらのタグと次のピリオドの間に別の空白があります。この状態では、私のコードは機能しませんでした (この点で他の回答がうまくいくとは思えませxmltextん)。

一方、私のコードは置換文字を使用していないため、置換文字として使用されているときに文字に問題がなくても、XML ドキュメントとフレーズに任意の文字が含まれている可能性があります。

私のコードは、置換文字で変更された中間テキストではなく、元の XML ドキュメントでスパンを直接指定します。

phrase最初のだけでなく、XML ドキュメント内のすべての出現を示します。

................................................

次のデータを使用します。

print ('\n*********************************************************\n'
       "********* Searching for 'foobar' in samples *************\n"
       '*********************************************************')

for xample in ('fo##o##ba###r## aaaaaBLfoob##arAH',
               '#fo##o##ba###r## aaaaaBLfoob##arAH',
               'BLAHHfo##o##ba###r   BLfoob##arAH',
               'BLAH#fo##o##ba###rBLUHYfoob##arAH',
               'BLA# fo##o##ba###rBLyyyfoob##ar',
               'BLA# fo##o##ba###rBLy##foob##ar',
               'kjhfqshqsk'):
    spans = list(finding('foobar',xample,'#'))
    if spans:
        print ('\n%s\n%s'
               %
               (xample,
                '\n'.join('%s  %s'
                          % (sp,xample[sp[0]:sp[1]])
                          for sp in spans))
               )
    else:
        print ("\n%s\n-::: Not found :::-" % xample)

結果は次のとおりです。

*********************************************************
********* Searching for 'foobar' in samples *************
*********************************************************

fo##o##ba###r## aaaaaBLfoob##arAH
(0, 13)  fo##o##ba###r
(23, 31)  foob##ar

#fo##o##ba###r## aaaaaBLfoob##arAH
(1, 14)  fo##o##ba###r
(24, 32)  foob##ar

BLAHHfo##o##ba###r   BLfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLAH#fo##o##ba###rBLUHYfoob##arAH
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLyyyfoob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

BLA# fo##o##ba###rBLy##foob##ar
(5, 18)  fo##o##ba###r
(23, 31)  foob##ar

kjhfqshqsk
-::: Not found :::-

...................................................

次のコードを使用して、質問を調べました。

import urllib

sock = urllib.urlopen('http://stackoverflow.com/'
                      'questions/17381982/'
                      'python-regex-catastrophic-backtracking-where')
r =sock.read()
sock.close()

i = r.find('unpredictable, such as the following')
j = r.find('in order to match the following phrase')
k = r.find('I came up with this regex ')

print 'i == %d   j== %d' % (i,j)
print repr(r[i:j])


print
print 'j == %d   k== %d' % (j,k)
print repr(r[j:k])

結果は次のとおりです。

i == 10408   j== 10714
'unpredictable, such as the following:</p>\n\n<blockquote>\n  Relationship to the #################strategic framework ################## for the period 2014-2015####################: Programme 7, Economic and Social Affairs, subprogramme 3, expected\n  \n  <p>accomplishment (c)#######</p>\n</blockquote>\n\n<p>so '

j == 10714   k== 10955
'in order to match the following phrase:</p>\n\n<blockquote>\n  <p>Relationship to the strategic framework for the period 2014-2015:\n  programme 7, Economic and Social Affairs, subprogramme 3, expected\n  accomplishment (c)</p>\n</blockquote>\n\n<p>'

program 7\nの前の追加、 achievement の前の追加、Program 7とprogram 7の違い、およびフレームワークと文字列フレームワークのピリオドの間の 2 つの空白の存在に注意してください########### ####### の期間 これは、あなたの例であなたの問題を説明するかもしれません.\n <p>

score 1 · Accepted Answer

次のコードは、FMcのコードが機能しないことを示しています。

この行
from name_of_file import olding_the_new,findingは、この質問に対するこのスレッドの私の個人的な回答の私のコードを参照しています。
*name_of_file私のコードのスクリプトを含むファイルに名前を付けます（このスレッドの他の回答にあります）。実行されます。
* または、私のコードをコピーして貼り付けるのが面倒な場合は、インポートのこの行をコメントするだけで、次のコードが実行されますolding_the_new。finding

私はFMcのコードの結果を検証するために 2 つの方法を使用しました:
-1/ 彼のコードによって返されたスパンを'f' のインデックスと'r' のインデックスと比較します。私のコードによって返された最初のスパンと比較すると、foobar -2/のもの以外にfとrはありません。したがって、上記のインポートが必要です
name_of_file

のたベネ

disp = Noneがに変更された場合disp == True、実行はアルゴリズムを理解するのに役立つ中間結果を表示します。

.

import re
from name_of_file import olding_the_new,finding

def main():
    # Two versions of the text: the original,
    # and one without any of the "#" markers.
    for text_orig  in ('BLAH ##BLAH fo####o##ba###r## BL##AH',
                       'jkh##jh#f',
                       '#f#oo##ba###r##',
                       'a##xf#oo##ba###r##',
                       'ax##f#oo##ba###r##',
                       'ab###xyf#oo##ba###r##',
                       'abx###yf#oo##ba###r##',
                       'abxy###f#oo##ba###r##',
                       'iji#hkh#f#oo##ba###r##',
                       'mn##pps#f#oo##ba###r##',
                       'mn##pab###xyf#oo##ba###r##',
                       'lmn#pab###xyf#oo##ba###r##',
                       'fo##o##ba###r## aaaaaBLfoob##arAH',
                       'fo#o##ba####r## aaaaaBLfoob##ar#AH',
                       'f##oo##ba###r## aaaaaBLfoob##ar',
                       'f#oo##ba####r## aaaaBL#foob##arAH',
                       'f#oo##ba####r## aaaaBL#foob##ar#AH',
                       'foo##ba#####r## aaaaBL#foob##ar',
                       '#f#oo##ba###r## aaaBL##foob##arAH',
                       '#foo##ba####r## aaaBL##foob##ar#AH',
                       '#af#oo##ba##r## aaaBL##foob##ar',
                       '##afoo##ba###r## aaaaaBLfoob##arAH',
                       'BLAHHfo##o##ba###r aaBLfoob##ar#AH',
                       'BLAH#fo##o##ba###r aaBLfoob##ar',
                       'BLA#Hfo##o##ba###r###BLfoob##ar',
                       'BLA#Hfo##o##ba###r#BL##foob##ar',
                       ):

        text_clean = text_orig.replace('#', '')
        # Collect data on the positions and widths
        # of the markers in the original text.
        rgx     = re.compile(r'#+')
        markers = [(m.start(), len(m.group()))
                   for m in rgx.finditer(text_orig)]

        # Find the location of the search phrase in the cleaned text.
        # At that point you'll have all the data you need to compute
        # the span of the phrase in the original text.
        search = 'foobar'
        try:
            i = text_clean.index(search)
            print ('text_clean == %s\n'
                   "text_clean.index('%s')==%d   len('%s') == %d\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,i,search,len(search),
                      text_orig,
                      markers))
            S,E = compute_span(i, len(search), markers)
            print "span = (%d,%d)  %s %s     %s"\
                  % (S,E,
                     text_orig.index('f')==S,
                     text_orig.index('r')+1==E,
                     list(finding(search,text_orig,'#+')))
        except ValueError:
            print ('text_clean == %s\n'
                   "text_clean.index('%s')   ***Not found***\n"
                   'text_orig  == %s\n'
                   'markers  == %s'
                   % (text_clean,
                      search,
                      text_orig,
                      markers))
        print '--------------------------------'

.

def compute_span(start, width, markers):
    # start and width are in expurgated text
    # markers are in original text
    disp = None # if disp==True => displaying of intermediary results
    span_start = start
    if disp:
        print ('\nAt beginning in compute_span():\n'
               '  span_start==start==%d   width==%d'
               % (start,width))
    for s, w in markers: # s and w are in original text
        if disp:
            print ('\ns,w==%d,%d'
                   '   s+w-1(%d)<start(%d) %s'
                   '   s(%d)==start(%d) %s'
                   % (s,w,s+w-1,start,s+w-1<start,s,start,s==start))
        if s + w - 1 < start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # the following if-else section is justified to be used
            # only after correction of the above line to this one:
            # if s+w-1 <= start or s==start:
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwm
            if s + w - 1 <= start and disp:
                print '  1a) s + w - 1 (%d) <= start (%d)   marker at left'\
                      % (s+w-1, start)
            elif disp:
                print '  1b) s(%d) == start(%d)' % (s,start)
            #mwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmwmmwmwmwmwmwm
            # Situation: marker fully to left of our text.
            # Adjust our start points rightward.
            start      += w
            span_start += w
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
        elif start + width - 1 < s:
            if disp:
                print ('  2) start + width - 1 (%d) < s (%d)   marker at right\n'
                       '  break' % (start+width-1, s))
            # Situation: marker fully to the right of our text.
            break
        else:
            # Situation: marker interrupts our text.
            # Advance the start point for the remaining text
            # rightward, and reduce the remaining width.
            if disp:
                print "  3) In 'else': s - start == %d   marker interrupts" % (s - start)
            start += w
            width = width - (s - start)
            if disp:
                print '  span_start == %d   start, width == %d, %d' % (span_start, start, width)
    return (span_start, start + width)

.

main()

結果

>>> 
text_clean == BLAH BLAH foobar BLAH
text_clean.index('foobar')==10   len('foobar') == 6
text_orig  == BLAH ##BLAH fo####o##ba###r## BL##AH
markers  == [(5, 2), (14, 4), (19, 2), (23, 3), (27, 2), (32, 2)]
span = (12,26)  True False     [(12, 27)]
--------------------------------
text_clean == jkhjhf
text_clean.index('foobar')   ***Not found***
text_orig  == jkh##jh#f
markers  == [(3, 2), (7, 1)]
--------------------------------
text_clean == foobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r##
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2)]
span = (0,11)  False False     [(1, 13)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == a##xf#oo##ba###r##
markers  == [(1, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,16)  False True     [(4, 16)]
--------------------------------
text_clean == axfoobar
text_clean.index('foobar')==2   len('foobar') == 6
text_orig  == ax##f#oo##ba###r##
markers  == [(2, 2), (5, 1), (8, 2), (12, 3), (16, 2)]
span = (2,15)  False False     [(4, 16)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == ab###xyf#oo##ba###r##
markers  == [(2, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abx###yf#oo##ba###r##
markers  == [(3, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,18)  False False     [(7, 19)]
--------------------------------
text_clean == abxyfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == abxy###f#oo##ba###r##
markers  == [(4, 3), (8, 1), (11, 2), (15, 3), (19, 2)]
span = (4,19)  False True     [(7, 19)]
--------------------------------
text_clean == ijihkhfoobar
text_clean.index('foobar')==6   len('foobar') == 6
text_orig  == iji#hkh#f#oo##ba###r##
markers  == [(3, 1), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnppsfoobar
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == mn##pps#f#oo##ba###r##
markers  == [(2, 2), (7, 1), (9, 1), (12, 2), (16, 3), (20, 2)]
span = (7,18)  False False     [(8, 20)]
--------------------------------
text_clean == mnpabxyfoobar
text_clean.index('foobar')==7   len('foobar') == 6
text_orig  == mn##pab###xyf#oo##ba###r##
markers  == [(2, 2), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == lmnpabxyfoobar
text_clean.index('foobar')==8   len('foobar') == 6
text_orig  == lmn#pab###xyf#oo##ba###r##
markers  == [(3, 1), (7, 3), (13, 1), (16, 2), (20, 3), (24, 2)]
span = (9,24)  False True     [(12, 24)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo##o##ba###r## aaaaaBLfoob##arAH
markers  == [(2, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,9)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == fo#o##ba####r## aaaaaBLfoob##ar#AH
markers  == [(2, 1), (4, 2), (8, 4), (13, 2), (27, 2), (31, 1)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f##oo##ba###r## aaaaaBLfoob##ar
markers  == [(1, 2), (5, 2), (9, 3), (13, 2), (27, 2)]
span = (0,11)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##arAH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == f#oo##ba####r## aaaaBL#foob##ar#AH
markers  == [(1, 1), (4, 2), (8, 4), (13, 2), (22, 1), (27, 2), (31, 1)]
span = (0,8)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaaBLfoobar
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == foo##ba#####r## aaaaBL#foob##ar
markers  == [(3, 2), (7, 5), (13, 2), (22, 1), (27, 2)]
span = (0,7)  True False     [(0, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #f#oo##ba###r## aaaBL##foob##arAH
markers  == [(0, 1), (2, 1), (5, 2), (9, 3), (13, 2), (21, 2), (27, 2)]
span = (0,11)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == foobar aaaBLfoobarAH
text_clean.index('foobar')==0   len('foobar') == 6
text_orig  == #foo##ba####r## aaaBL##foob##ar#AH
markers  == [(0, 1), (4, 2), (8, 4), (13, 2), (21, 2), (27, 2), (31, 1)]
span = (0,12)  False False     [(1, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaBLfoobar
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == #af#oo##ba##r## aaaBL##foob##ar
markers  == [(0, 1), (3, 1), (6, 2), (10, 2), (13, 2), (21, 2), (27, 2)]
span = (2,10)  True False     [(2, 13), (23, 31)]
--------------------------------
text_clean == afoobar aaaaaBLfoobarAH
text_clean.index('foobar')==1   len('foobar') == 6
text_orig  == ##afoo##ba###r## aaaaaBLfoob##arAH
markers  == [(0, 2), (6, 2), (10, 3), (14, 2), (28, 2)]
span = (1,14)  False True     [(3, 14), (24, 32)]
--------------------------------
text_clean == BLAHHfoobar aaBLfoobarAH
text_clean.index('foobar')==5   len('foobar') == 6
text_orig  == BLAHHfo##o##ba###r aaBLfoob##ar#AH
markers  == [(7, 2), (10, 2), (14, 3), (27, 2), (31, 1)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobar aaBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLAH#fo##o##ba###r aaBLfoob##ar
markers  == [(4, 1), (7, 2), (10, 2), (14, 3), (27, 2)]
span = (4,16)  False False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r###BLfoob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 3), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
text_clean == BLAHfoobarBLfoobar
text_clean.index('foobar')==4   len('foobar') == 6
text_orig  == BLA#Hfo##o##ba###r#BL##foob##ar
markers  == [(3, 1), (7, 2), (10, 2), (14, 3), (18, 1), (21, 2), (27, 2)]
span = (5,14)  True False     [(5, 18), (23, 31)]
--------------------------------
>>>

.

----------------------------------------------

FMcのコードは非常に微妙で、その原理を理解して修正するのに長い間苦労しました。
アルゴリズムを理解するタスクは誰にでも任せます。FMcのコードを正しく動作させるために必要な修正のみを述べます。

.

最初の修正:

if s + w - 1 < start:
# must be changed to  
if s + w - 1 <= start or (s==start):

編集

私の最初の現在の回答では、
私は書いていました... or (s<=start)。
それは私の間違いでした、実際には私は書くつもりでした
.. or (s==start)

この編集に関するNOTA BENE：

このエラーは、 FMcの最初のコードを修正するためにここで説明する 2 つの修正で修正されたコードには何の影響もありませんでした (現在、彼は 2 回変更しているため、最初のコードです)。
実際、コードを 2 つの修正で修正すると、とで取得した 25 の例すべてで正しい結果が得られtext_origます。したがって、最初の条件が Falseの場合、True のケースは決して発生しないと考えました。これは、おそらく常に 0 より大きいという事実と、マーカーと非マーカーシーケンスの構成によるその他の理由に基づいています... .. というわけで、この印象のデモンストレーションを見つけようとしましたが、失敗しました。... or (s <= start)... or (s==start)
s < starts+w-1 <= startw

さらに、私はFMcのアルゴリズム(彼が行った編集の前の最初のアルゴリズム) さえも理解できないという精神状態に達しました!!
それにもかかわらず、私はこの回答をそのままにして、この回答の最後に、これらの修正が必要な理由を説明しようとする説明を投稿します。しかし、私は警告します: FMc
の最初のアルゴリズムは、2 つの異なる文字列に属するインデックスを比較するため、非常に風変わりで理解しにくいものです。 .そして今、私はそれが理にかなっているかもしれないと確信していません....

.

2 番目の修正::

start += w
width = width - (s - start)
# must be changed to   
width -= (s-start) # this line MUST BE before the following one
start = s + w # because start += (s-start) + w

-------------------

間違ったコードを与えているにもかかわらず、2 人が FMc の回答に賛成票を投じたことに唖然としました。これは、指定されたコードをテストせずに回答に賛成票を投じたことを意味します。

--------------------------------------------

.

編集

条件を次のif s + w - 1 < start:ように変更する必要があるのはなぜ
if s + w - 1 <= start or (s==start):ですか?

s + w - 1 < startFalse と一緒にs等しいことが起こる可能性があるためです。この場合、実行はセクションに移動し、(修正されたコードで) 実行します。 start
else

width -= (s - start)
start = s + w

したがって、width関係するシーケンスを見ると明らかに変化するはずですが、変化しません。

このケースは、次のシーケンスのように、最初のマーカーが検査される瞬間に発生する可能性があります。

'#f#oo##ba###r##' : s,w==0,1 , 0==s==start==0  
'ax##f#oo##ba###r##' : s,w==2,2 , 2==s==start==2    
'abxy###f#oo##ba###r##' : s,w==4,3 , 4==s==start==4  
'#f#oo##ba###r## aaaBL##foob##arAH' : s,w==0,1 , 0==s==start==0  
'BLAH#fo##o##ba###r aaBLfoob##ar' : s,w==4,1 4==s==start==4

次のものでは、2 番目のマーカーの検査で発生します。

'iji#hkh#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7  
'mn##pps#f#oo##ba###r##' : s,w==7,1 , 7==s==start==7

私のコードを設定して実行すると、よりよく理解できますdisp = True。

がs + w - 1 <= start検証されると、実行がセクションに移動せず、 toと toの追加のみがある最初のセクションに移動するため、 smay equalであることは問題ありません。しかしis False while equalsの場合、命令の実行がwidthの値に何も変化しないセクションに実行が移ってしまい、それが面倒です。そのため、この宛先を防ぐために条件を追加する必要があり、いくつかの例が示すように、False の場合でもこの宛先を防ぐために条件を追加する必要があります。startelsewsstart
s + w - 1 <= startsstartelsewidth -= (s-start)
or (s==start)elseors+w-1 <= start

.

命令を(with =) にs+w-1 < start変更しなければならないのは、 (2 番目のマーカー) と(1 番目のマーカー) の場合と同様に、1 文字#のみに対応するケースのためです。s+w-1 <= start
w==1

mn##pps#f#oo##ba###r##
BLAH#fo##o##ba###r

score 0 · Accepted Answer

使用しなくてregexも、やりたいことを取得できます。

text.replace('#','').replace('  ',' ')

score 0 · Accepted Answer

XMLパーサーで深さ優先検索？

後で逆引きするために、テキストノードが見つかった xml ドキュメント内の位置を覚えておくとよいでしょう。あなたの実際の目標はまだ不明です。

python - Python正規表現の壊滅的なバックトラッキング

8 に答える 8

輸入

データ

段階

XML テキスト

働く機能

実行

結果

のたベネ

................................................

...................................................

のたベネ

結果

----------------------------------------------

編集

この編集に関するNOTA BENE：

-------------------

--------------------------------------------

編集

Related

Reference