python - pyparsingを使用して単語のエスケープを解析する-複数の行に分割

Question

pyparsing\\nを使用して、バックスラッシュと改行の組み合わせ（ ""）を使用して複数行に分割できる単語を解析しようとしています。これが私がしたことです：

from pyparsing import *

continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')

私が得る出力はです['super']が、期待される出力は['super', 'cali', fragi', 'listic']です。それでも、それらすべてを1つの単語として結合する方がよいでしょう（これは、で実行できると思いますmulti_line_word.parseAction(lambda t: ''.join(t))。

pyparsing helperでこのコードを調べてみましたが、エラーが発生しますmaximum recursion depth exceeded。

編集2009-11-15：後で、空白に関して構文解析が少し寛大になることに気付きました。それは、私が構文解析していると思っていたものがはるかに緩いといういくつかの悪い仮定につながります。つまり、単語のどの部分、エスケープ、およびEOL文字の間に空白がないようにする必要があります。

上記の小さな例の文字列はテストケースとしては不十分であることに気づいたので、次の単体テストを作成しました。これらのテストに合格したコードは、私が直感的にエスケープスプリットワードと考えているものと一致する必要があります。エスケープスプリットワードのみです。エスケープスプリットではない基本的な単語とは一致しません。そのために別の文法構造を使用することができます（そしてそうすべきだと私は信じています）。これにより、2つを別々にすることですべてが整頓された状態に保たれます。

import unittest
import pyparsing

# Assumes you named your module 'multiline.py'
import multiline

class MultiLineTests(unittest.TestCase):

    def test_continued_ending(self):

        case = '\\\n'
        expected = ['\\', '\n']
        result = multiline.continued_ending.parseString(case).asList()
        self.assertEqual(result, expected)


    def test_continued_ending_space_between_parse_error(self):

        case = '\\ \n'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.continued_ending.parseString,
            case
        )


    def test_split_word(self):

        cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
        expected = ['shiny']
        for case in cases:
            result = multiline.split_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_split_word_no_escape_parse_error(self):

        case = 'shiny'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.split_word.parseString,
            case
        )


    def test_split_word_space_parse_error(self):

        cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.split_word.parseString,
                case
            )


    def test_multi_line_word(self):

        cases = (
                'shiny\\',
                'shi\\\nny',
                'sh\\\ni\\\nny\\\n',
                ' shi\\\nny\\',
                'shi\\\nny '
                'shi\\\nny captain'
        )
        expected = ['shiny']
        for case in cases:
            result = multiline.multi_line_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_multi_line_word_spaces_parse_error(self):

        cases = (
                'shi \\\nny',
                'shi\\ \nny',
                'sh\\\n iny',
                'shi\\\n\tny',
        )
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.multi_line_word.parseString,
                case
            )


if __name__ == '__main__':
    unittest.main()

score 6 · Accepted Answer

もう少し突っ込んだ後、この注目すべきビットがあったこのヘルプスレッドにたどり着きました

誰かが BNF 定義から直接 pyparsing 文法を実装すると、非効率的な文法がよく見られます。BNFには「1つ以上」「0個以上」「任意」という概念がありません...

それで、私はこの2行を変更するというアイデアを得ました

multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

に

multi_line_word = ZeroOrMore(split_word) + word

これにより、探していたものが出力されました: ['super', 'cali', fragi', 'listic'].

次に、これらのトークンを結合する解析アクションを追加しました。

multi_line_word.setParseAction(lambda t: ''.join(t))

これにより、の最終出力が得られます['supercalifragilistic']。

私が学んだ持ち帰りのメッセージは、モルドールに足を踏み入れるだけではないということです。

冗談だ。

持ち帰りのメッセージは、pyparsing を使用して BNF の 1 対 1 の変換を単純に実装することはできないということです。反復型を使用するいくつかのトリックを使用する必要があります。

EDIT 2009-11-25:より精力的なテストケースを補うために、コードを次のように変更しました。

no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

これには、要素間にスペースが入らないようにするという利点があります (エスケープするバックスラッシュの後の改行を除く)。

score 5 · Accepted Answer

あなたはあなたのコードにかなり近づいています。これらの mod のいずれかが動作します。

# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)

# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))

# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)

# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))

pyparsing グーグルで見つけたように、BNF->pyparsing 変換は、BNF の代わりに pyparsing 機能を使用するという特別な観点から行う必要があります。私は実際には、より長い回答を作成している最中で、BNF 翻訳の問題をさらに詳しく調べていましたが、この資料は既に見つけています (wiki で、私は推測します)。

python - pyparsingを使用して単語のエスケープを解析する-複数の行に分割

2 に答える 2

Related

Reference