Parsing tokens into an Abstract Syntax Tree
OK, let's take your grammar
Dig ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Int ::= Dig | Dig Int
Var ::= a | b | ... z | A | B | C | ... | Z
Expr ::= Int | - Expr | + Expr Expr | * Expr Expr | Var | let Var = Expr in Expr
This is a nice easy grammar, because you can tell from the first token what sort of epression it will be.
(If there was something more complicated, like +
coming between numbers, or -
being used for subtraction as
well as negation, you'd need the list-of-successes trick, explained in
Functional Parsers.)
Let's have some sample raw input:
rawinput = "- 6 + 45 let x = - 5 in * x x"
Which I understand from the grammar represents "(- 6 (+ 45 (let x=-5 in (* x x))))"
,
and I'll assume you tokenised it as
tokenised_input' = ["-","6","+","4","5","let","x","=","-","5","in","*","x","x"]
which fits the grammar, but you might well have got
tokenised_input = ["-","6","+","45","let","x","=","-","5","in","*","x","x"]
which fits your sample AST
better. I think it's good practice to name your AST after bits of your grammar,
so I'm going to go ahead and replace
data AST = Leaf Int | Sum AST AST | Min AST | ...
with
data Expr = E_Int Int | E_Neg Expr | E_Sum Expr Expr | E_Prod Expr Expr | E_Var Char
| E_Let {letvar::Char,letequal:: Expr,letin::Expr}
deriving Show
I've named the bits of an E_Let
to make it clearer what they represent.
Writing a parsing function
You could use isDigit
by adding import Data.Char (isDigit)
to help out:
expr :: [String] -> (Expr,[String])
expr [] = error "unexpected end of input"
expr (s:ss) | all isDigit s = (E_Int (read s),ss)
| s == "-" = let (e,ss') = expr ss in (E_Neg e,ss')
| s == "+" = (E_Sum e e',ss'') where
(e,ss') = expr ss
(e',ss'') = expr ss'
-- more cases
Yikes! Too many let clauses obscuring the meaning,
and we'll be writing the same code for E_Prod
and very much worse for E_Let
.
Let's get this sorted out!
The standard way of dealing with this is to write some combinators;
instead of tiresomely threading the input [String]
s through our definition, define ways to
mess with the output of parsers (map) and combine
multiple parsers into one (lift).
Clean up the code 1: map
First we should define pmap
, our own equivalent of the map
function so we can do pmap E_Neg (expr1 ss)
instead of let (e,ss') = expr1 ss in (E_Neg e,ss')
pmap :: (a -> b) -> ([String] -> (a,[String])) -> ([String] -> (b,[String]))
nonono, I can't even read that! We need a type synonym:
type Parser a = [String] -> (a,[String])
pmap :: (a -> b) -> Parser a -> Parser b
pmap f p = \ss -> let (a,ss') = p ss
in (f a,ss')
But really this would be better if I did
data Parser a = Par [String] -> (a,[String])
so I could do
instance Functor Parser where
fmap f (Par p) = Par (pmap f p)
I'll leave that for you to figure out if you fancy.
Clean up the code 2: combining two parsers
We also need to deal with the situation when we have two parsers to run,
and we want to combine their results using a function. This is called lifting the function to parsers.
liftP2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
liftP2 f p1 p2 = \ss0 -> let
(a,ss1) = p1 ss0
(b,ss2) = p2 ss1
in (f a b,ss2)
or maybe even three parsers:
liftP3 :: (a -> b -> c -> d) -> Parser a -> Parser b -> Parser c -> Parser d
I'll let you think how to do that.
In the let statement you'll need liftP5
to parse the sections of a let statement,
lifting a function that ignores the "="
and "in"
. You could make
equals_ :: Parser ()
equals_ [] = error "equals_: expected = but got end of input"
equals_ ("=":ss) = ((),ss)
equals_ (s:ss) = error $ "equals_: expected = but got "++s
and a couple more to help out with this.
Actually, pmap
could also be called liftP1
, but map is the traditional name for that sort of thing.
Rewritten with the nice combinators
Now we're ready to clean up expr
:
expr :: [String] -> (Expr,[String])
expr [] = error "unexpected end of input"
expr (s:ss) | all isDigit s = (E_Int (read s),ss)
| s == "-" = pmap E_Neg expr ss
| s == "+" = liftP2 E_Sum expr expr ss
-- more cases
That'd all work fine. Really, it's OK. But liftP5
is going to be a bit long, and feels messy.
Taking the cleanup further - the ultra-nice Applicative way
Applicative Functors is the way to go.
Remember I suggested refactoring as
data Parser a = Par [String] -> (a,[String])
so you could make it an instance of Functor
? Perhaps you don't want to,
because all you've gained is a new name fmap
for the perfectly working pmap
and
you have to deal with all those Par
constructors cluttering up your code.
Perhaps this will make you reconsider, though; we can import Control.Applicative
,
then using the data
declaration, we can
define <*>
, which sort-of means then
and use <$>
instead of pmap
, with *>
meaning
<*>-but-forget-the-result-of-the-left-hand-side
so you would write
expr (s:ss) | s == "let" = E_Let <$> var *> equals_ <*> expr <*> in_ *> expr
Which looks a lot like your grammar definition, so it's easy to write code that works first time.
This is how I like to write Parsers. In fact, it's how I like to write an awful lot of things.
You'd only have to define fmap
, <*>
and pure
, all simple ones, and no long repetative liftP3
, liftP4
etc.
Read up about Applicative Functors. They're great.
Hints for making Parser applicative: pure
doesn't change the list.
<*>
is like liftP2
, but the function doesn't come from outside, it comes as the output from p1
.