python - Pyparsing: 特定のエンディングを持つトークンを検出する

Question

ここで何が間違っているのだろうか。誰かがこの問題についてヒントをくれるかもしれません。string で終了する pyparsing を使用して特定のトークンを検出したいと考えています_Init。

例として、次の行をtext

one
two_Init
threeInit
four_foo_Init
five_foo_bar_Init

次の行を抽出したい：

two_Init
four_foo_Init
five_foo_bar_Init

現在、問題を次の行に減らしました。

    import pyparsing as pp

    ident = pp.Word(pp.alphas, pp.alphanums + "_")
    ident_init = pp.Combine(ident + pp.Literal("_Init"))

    for detected, s, e in ident_init.scanString(text): 
        print detected

このコードを使用しても結果はありません。"_"ステートメント内のを削除するとWord、少なくとも_Init末尾にがある行を検出できます。しかし、結果は完全ではありません:

['two_Init']
['foo_Init']
['bar_Init']

ここで私が完全に間違っていることを誰かが考えていますか?

score 2 · Accepted Answer

問題は、終端の ' ' が' ' で_ない限り、' ' を受け入れたいということです。ここに 2 つの pyparsing ソリューションがあります。1 つはより「純粋な」pyparsing であり、もう 1 つは単にそれを使って組み込みの正規表現を使用するだけです。__Init

samples = """\
one
two_Init
threeInit
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init"""


from pyparsing import Combine, OneOrMore, Word, alphas, alphanums, Literal, WordEnd, Regex

# implement explicit lookahead: allow '_' as part of your Combined OneOrMore, 
# as long as it is not followed by "Init" and the end of the word
option1 = Combine(OneOrMore(Word(alphas,alphanums) | 
                            '_' + ~(Literal("Init")+WordEnd())) 
                  + "_Init")

# sometimes regular expressions and their implicit lookahead/backtracking do 
# make things easier
option2 = Regex(r'\b[a-zA-Z_][a-zA-Z0-9_]*_Init\b')

for expr in (option1, option2):
    print '\n'.join(t[0] for t in expr.searchString(samples))
    print

どちらのオプションも次のように表示されます。

two_Init
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init

python - Pyparsing: 特定のエンディングを持つトークンを検出する

1 に答える 1

Related

Reference