python - カスタムファイル形式を解析する方法に関するヒント

Question

あいまいなタイトルで申し訳ありませんが、この問題を簡潔に説明する方法が本当にわかりません。

さまざまなエンティティ (通常は Web ページから送信されたフォーム) に適用する検証ルールを指定するために使用する、(多かれ少なかれ) シンプルなドメイン固有の言語を作成しました。この記事の最後に、言語がどのように見えるかのサンプルを含めました。

私の問題は、この言語を使用できる形式に解析する方法がわからないことです (解析には Python を使用します)。私の目標は、各オブジェクト/エンティティ (文字列、、など)'cocoa(99)'に (順番に) 適用する必要があるルール/フィルターのリスト (引数を含む文字列として、など) を作成することです。'chocolate''chocolate.lindt'

最初にどの手法を使用すればよいか、このような問題に対してどのような手法が存在するかさえわかりません。これについてどうするのが最善の方法だと思いますか? 私は完全な解決策を探しているわけではありません。正しい方向への一般的なナッジです。

ありがとう。

言語のサンプルファイル:

# Comments start with the '#' character and last until the end of the line
# Indentation is significant (as in Python)


constant NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`


*:      # Applies to all data
    isYummy             # Everything must be yummy

chocolate:              # To validate, say `validate("chocolate", object)`
    sweet               # chocolate must be sweet (but not necessarily chocolate.*)

    lindt:              # To validate, say `validate("chocolate.lindt", object)`
        tasty           # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        *:              # Applies to all data under chocolate.lindt
            smooth      # Could also be written smooth()
            creamy(1)   # Level 1 creamy
        dark:           # dark has no special validation rules
            extraDark:
                melt            # Filter that modifies the object being examined
                c:bitter        # Must be bitter, but only validated on client
                s:cocoa(NINETY_NINE)    # Must contain 99% cocoa, but only validated on server. Note constant
        milk:
            creamy(2)   # Level 2 creamy, overrides creamy(1) of chocolate.lindt.* for chocolate.lindt.milk
            creamy(3)   # Overrides creamy(2) of previous line (all but the last specification of a given rule are ignored)



ruleset food:       # To define a chunk of validation rules that can be expanded from the placeholder `food` (think macro)
    caloriesWithin(10, 2000)        # Unlimited parameters allowed
    edible
    leftovers:      # Nested rules allowed in rulesets
        stale

# Rulesets may be nested and/or include other rulesets in their definition



chocolate:              # Previously defined groups can be re-opened and expanded later
    ferrero:
        hasHazelnut



cake:
    tasty               # Same rule used for different data (see chocolate.lindt)
    isLie
    ruleset food        # Substitutes with rules defined for food; cake.leftovers must now be stale


pasta:
    ruleset food        # pasta.leftovers must also be stale




# Sample use (in JavaScript):

# var choc = {
#   lindt: {
#       cocoa: {
#           percent: 67,
#           mass:    '27g'
#       }
#   }
#   // Objects/groups that are ommitted (e.g. ferrro in this example) are not validated and raise no errors
#   // Objects that are not defined in the validation rules do not raise any errors (e.g. cocoa in this example)
# };
# validate('chocolate', choc);

# `validate` called isYummy(choc), sweet(choc), isYummy(choc.lindt), smooth(choc.lindt), creamy(choc.lindt, 1), and tasty(choc.lindt) in that order
# `validate` returned an array of any validation errors that were found

# Order of rule validation for objects:
# The current object is initially the object passed in to the validation function (second argument).
# The entry point in the rule group hierarchy is given by the first argument to the validation function.
# 1. First all rules that apply to all objects (defined using '*') are applied to the current object,
#    starting with the most global rules and ending with the most local ones.
# 2. Then all specific rules for the current object are applied.
# 3. Then a depth-first traversal of the current object is done, repeating steps 1 and 2 with each object found as the current object
# When two rules have equal priority, they are applied in the order they were defined in the file.



# No need to end on blank line

score 9 · Accepted Answer

まず、構文解析について学びたい場合は、独自の再帰降下パーサーを作成してください。あなたが定義した言語には、ほんの一握りのプロダクションしか必要ありません。tokenizeバイトストリームをトークンストリームに変換するという退屈な作業を省くために、Python のライブラリを使用することをお勧めします。

実用的な解析オプションについては、以下をお読みください...

手っ取り早い解決策は、python 自体を使用することです。

NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`

rules = {
  '*': {     # Applies to all data
    'isYummy': {},      # Everything must be yummy

    'chocolate': {        # To validate, say `validate("chocolate", object)`
      'sweet': {},        # chocolate must be sweet (but not necessarily chocolate.*)

      'lindt': {          # To validate, say `validate("chocolate.lindt", object)`
        'tasty':{}        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

        '*': {            # Applies to all data under chocolate.lindt
          'smooth': {}  # Could also be written smooth()
          'creamy': 1   # Level 1 creamy
        },
# ...
    }
  }
}

このトリックを実行するにはいくつかの方法があります。たとえば、クラスを使用したよりクリーンな (少し変わった) アプローチを次に示します。

class _:
    class isYummy: pass

    class chocolate:
        class sweet: pass

        class lindt:
            class tasty: pass

            class _:
                class smooth: pass
                class creamy: level = 1
# ...

完全なパーサーへの中間ステップとして、Python 構文を解析して AST を返す「batteries-included」Python パーサーを使用できます。AST は非常に深く、(IMO) 不必要なレベルがたくさんあります。子が 1 つしかないノードを選別することで、これらをフィルター処理して、より単純な構造にすることができます。このアプローチでは、次のようなことができます。

import parser, token, symbol, pprint

_map = dict(token.tok_name.items() + symbol.sym_name.items())

def clean_ast(ast):
    if not isinstance(ast, list):
        return ast
    elif len(ast) == 2: # Elide single-child nodes.
        return clean_ast(ast[1])
    else:
        return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]

ast = parser.expr('''{

'*': {     # Applies to all data
  isYummy: _,    # Everything must be yummy

  chocolate: {        # To validate, say `validate("chocolate", object)`
    sweet: _,        # chocolate must be sweet (but not necessarily chocolate.*)

    lindt: {          # To validate, say `validate("chocolate.lindt", object)`
      tasty: _,        # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

      '*': {            # Applies to all data under chocolate.lindt
        smooth: _,  # Could also be written smooth()
        creamy: 1   # Level 1 creamy
      }
# ...
    }
  }
}

}''').tolist()
pprint.pprint(clean_ast(ast))

このアプローチには限界があります。最終的な AST はまだ少しノイズが多く、定義する言語は有効な Python コードとして解釈可能でなければなりません。たとえば、これをサポートできませんでした...

*:
    isYummy

...この構文は Python コードとして解析されないためです。ただし、その大きな利点は、AST 変換を制御できるため、任意の Python コードを挿入できないことです。

score 5 · Accepted Answer

ここでも構文解析については説明しませんが、あなたの形式は合法的なYAMLに非常に近いため、言語を YAML のサブセットとして再定義し、標準の YAML パーサーを使用することをお勧めします。

score 3 · Accepted Answer

あなたの目標が解析について学ぶことであれば、PyParsingのような OO スタイルライブラリを強くお勧めします。これらは、より洗練された antler、lex、yac オプションほど高速ではありませんが、すぐに解析を開始できます。

score 2 · Accepted Answer

「Marcelo Cantos」が提案したように、python dict を使用できることを示唆しています。利点は、何も解析する必要がないことです。サーバー側で python dict と同じルールを使用し、クライアント側で javascript オブジェクトを使用して、それらをサーバーからJSON としてクライアントまたはその逆。

本当に自分で解析したい場合は、これを参照してください http://nedbatchelder.com/text/python-parsers.html

しかし、インデントされた言語を簡単に解析できるかどうかはわかりません。

score 1 · Accepted Answer

例を示した言語は、おそらく複雑すぎて、単純な (そしてバグのない) 解析関数を記述できません。LL(1)、LL(k) などの再帰降下またはテーブル駆動型の解析などの解析手法について読むことをお勧めします。

しかし、それは一般的すぎたり、複雑すぎたりする可能性があります。ルール言語を区切られたテキストのような単純なものに簡略化する方が簡単な場合があります。

たとえば、次のようなもの

チョコレート:甘い
チョコレート.リント:おいしい
チョコレート.リント.*:滑らか、クリーミー(1)

これは解析が簡単で、正式なパーサーなしで実行できます。

score 0 · Accepted Answer

解析を容易にするライブラリとツールがあります。よく知られているのは lex / yacc です。' lex ' と呼ばれる Python ライブラリと、それを使用するためのチュートリアルがあります。

score 0 · Accepted Answer

カスタマイズされたファイル構造の動機は何ですか? データを XML のようなよく知られた構造に改造することは可能でしょうか? もしそうなら、あなたはあなたのファイルを解析するために多数のうちの1つを使うことができます. 受け入れられている解析ツールを使用すると、デバッグにかかる時間を大幅に節約でき、ファイルを読みやすくすることができます。

python - カスタム ファイル形式を解析する方法に関するヒント

7 に答える 7

Related

Reference

python - カスタムファイル形式を解析する方法に関するヒント