c++ - 有限ステートマシンパーサー

Question

C++ でFSM のようなパーサーを使用して、自己設計したファイル形式を解析したいと思います(これはteach-myself-c++-the-hard-way-by-doing-something-big-and-difficult一種のプロジェクトです:))。euh... 行の終わりを示す改行を含むトークン化された文字列があります。入力例はこちらをご覧ください。すべてのコメントは除外され、ジャンクは除外されるため、次のような std::string があります。

global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...

構文の説明:

{ } はスコープであり、大文字の単語はオプション/ファイルのリストが続くことを示します。
\n は、オプション/ファイルのリストでのみ重要であり、リストの終わりを示します。

そのため、FSM は私のニーズや知識に対して十分にシンプルで拡張性があると考えました。私が知る限り (そして私のファイル設計をそうあるようにしたい)、並行状態やそのような凝ったものは必要ありません。いくつかの設計/実装に関する質問:

enum状態に anまたは抽象class+ 導関数を使用する必要がありますか? 最初の構文はおそらく小さな構文には適していますが、後で醜くなる可能性があり、2 番目の構文は正反対です。その単純さのために、私は最初のものに傾いています。enum例とクラスの例。編集: に対するこの提案gotoはどうですか? C++ では悪だと思いましたか?
リストを読むとき、無視する必要はありません\n。stringviaを使用する私の好みの方法は、デフォルトstringstreamで無視\nされます。stringstreamしたがって、特定の状態が有効になっているときに改行を無視しないように（同じように）伝える簡単な方法が必要です。
enum複数レベルの解析 ( scopes 内のスコープ) には単純な状態で十分です{...{...}...}か、それともハッキーな実装が必要ですか?
私が考えているドラフト状態は次のとおりです。
- upper: グローバル、exe、lib+ ターゲット名を読み取ります...
- normal: スコープ内で、SOURCES を読み取ったり、ユーザー変数を作成したりできます...
- list: 改行に遭遇するまで項目をリストに追加します。

各スコープには一種の条件 (例: win32:global { gcc:CFLAGS = ... }) があり、どこでもまったく同じ方法で処理する必要があります (list状態であっても、アイテムごとに)。

ご意見ありがとうございます。

score 18 · Accepted Answer

ネストスコープがある場合、有限状態マシンは適切な方法ではないため、文脈自由文法パーサーを確認する必要があります。LL（1）パーサーは、再帰関数のセットとして記述できます。または、LALR（1）パーサーは、Bisonなどのパーサージェネレーターを使用して記述できます。

FSMにスタックを追加すると、プッシュダウンオートマトンの領域に入ります。非決定性プッシュダウンオートマトンは、文脈自由文法と同等です（ただし、決定性プッシュダウンオートマトンは厳密にはそれほど強力ではありません）。LALR（1）パーサージェネレーターは、実際には決定性プッシュダウンオートマトンを内部で生成します。優れたコンパイラ設計の教科書は、プッシュダウンオートマトンが文法から構築される正確なアルゴリズムをカバーしています。（このように、スタックを追加することは「ハッキー」ではありません。）このウィキペディアの記事では、文法からLR（1）プッシュダウンオートマトンを構築する方法についても説明していますが、IMOの記事はそれほど明確ではありません。

スコープが有限の深さでのみネストする場合（つまり、レベルとレベルはあるがupper、ネストされたsまたはネストされたsがない場合）、スタックなしでFSMを使用できます。normallistlistnormal

score 3 · Accepted Answer

There are two stages to analyzing a text input stream for parsing:

Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.

Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.

So in response to your questions:

Should I use an enum or an abstract class + derivatives for my states?

I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.

When reading a list, I need to NOT ignore \n.

You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).

Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?

This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.

Here's the draft states I have in mind: ...

See my commments on lexical and grammatical analysis above.

score 1 · Accepted Answer

構文解析には、常に機能することが証明されているものを使用するようにしています。ANTLR with ANTLRWorksは、文法の設計とテストに非常に役立ちます。C/C++ (およびその他の言語) のコードを生成できますが、それらの言語の ANTLR ランタイムをビルドする必要があります。

もちろん、flexやbisonの方が使いやすいと思う場合は、それらも使用できます (これらが C と C++ のみを生成することは知っていますが、しばらく使用していなかったので間違っている可能性があります)。

c++ - 有限ステート マシン パーサー

3 に答える 3

Related

Reference

c++ - 有限ステートマシンパーサー