This is intended as an introduction into parsing with parser combinators. I won’t cover more advanced techniques like memoization or lookahead, and the idea is that you’ll end up with a simple LL(1) recursive descent parser.
Parsing is an interesting topic, since it forms the basis of how programs interpret streams of characters and construct internal representations of them. Today I’m going to give a brief introduction to basic parsing theory and show you how to build a parser with parser combinators from scratch!
Parsing is when we take a grammar (a list of rules on how to interpret input) and try to match our input to the rules in the grammar. For instance, if I had the rule “match a”, then an input of “a” will match the rule and an input of “b” will fail the rule. We can implement this relatively easily:
def match_a(s): return s == "a" match_a("a") # ok match_a("b") # not ok!
Usually we want our result back, so:
def match_a(s): if s == "a": return s return None match_a("a") # returns "a" match_a("b") # returns None
However, our very first parser can only recognize the string “a”. That’s not very useful, mostly because we want to match things other than “a”. Suppose we wanted to extend this to match “ab”. We could implement a function called
match_ab and check if the input is “ab”, but we can reuse our “a” parser with a few modifications:
def match_a(s): if s == "a": # We put the result in a tuple because each individual parse result # is placed in a tuple, and it could be possible to parse input into # an empty result (which we'll see later with the use of Optional). return ("a",), s[1:] return None match_a("a") # returns (("a",), "") match_a("b") # returns None match_a("ab") # returns (("a",), "b")
match_a parser now checks if the first character is “a” — if it is, it returns a tuple containing the match and then the rest of the string; otherwise it just returns
None to indicate that parsing has failed. We can now implement a
match_b function in the same way we implemented
def match_b(s): if input == "b": return ("b",), s[1:] return None
And now we can chain our two parsers together:
def match_ab(s): fst = match_a(s) if fst is None: return None fst_match, s = fst snd = match_b(s) if snd is None: return None snd_match, s = snd return (fst_match, snd_match), s match_ab("ab") # hooray, returns ((("a",), ("b",)), "")
Whew! Although slightly verbose, we now have a very primitive way of chaining parsers together.
Right, okay, so what’s a parser combinator?
Parser combinators give us ways to chain parsers together. In the previous example, we’ve demonstrated that we can chain two parsers together sequentially relatively easily. We can generalize those patterns into something like this:
class Parser(object): def __add__(self, rhs): """Overloads the + operator for parser instances.""" return Concat(self, rhs) def __or__(self, rhs): """Overloads the | operator for parser instances.""" return Alternate(self, rhs) class Concat(Parser): """Represents the concatenation of two parsers, i.e parse using the first parser, then the second, then return both results.""" def __init__(self, fst, snd): self.fst = fst self.snd = snd def parse(self, s): fst = self.fst.parse(s) if fst is None: return None fst_match, s = fst snd = self.snd.parse(s) if snd is None: return None snd_match, s = snd return (fst_match, snd_match), s class Alternate(Parser): """Represents the alternation of two parsers, i.e. parse using the first parser and, if that doesn't match, parse with the second parser.""" def __init__(self, fst, snd): self.fst = fst self.snd = snd def parse(self, s): return self.fst.parse(s) or self.snd.parse(s) class Literal(Parser): """Parse a literal.""" def __init__(self, x): self.x = x def parse(self, s): l = len(self.x) if s[:l] != self.x: return None return (self.x,), s[l:] match_ab = Literal("a") + Literal("b") match_ab.parse("ab") # returns ((("a",), ("b",)), "") match_ab.parse("aa") # returns None, as expected
We’ve generalized our matching functions into composable parsers, which are joined using parser combinators —
Alternate! For instance, we can do:
match_abab = match_ab + match_ab match_abab.parse("abab") # returns (((("a",), ("b",)), (("a",), ("b",))), "") match_a_or_b = Literal("a") | Literal("b") match_a_or_b.parse("a") # returns (("a",), "") match_a_or_b.parse("b") # returns (("b",), "")
Wow, your examples are really contrived. How about a mandatory calculator example?
For a slightly more practical use of parser combinators, we can make a very simple calculator — one that supports
* with operator precedence. We can define a few parsers to start off with:
class OneOrMore(Parser): """Represents the Kleene star operation, i.e. attempt to parse one or more results using the given parser.""" def __init__(self, parser): self.parser = parser def parse(self, s): items =  while True: result = self.parser.parse(s) if result is None: return tuple(items), s result, s = result items.extend(result) return None class Optional(Parser): """Represents a parser that doesn't fail if there is no match.""" def __init__(self, parser): self.parser = parser def parse(self, s): result = self.parser.parse(s) if result is None: return (), s return result class Lambda(Parser): """Apply a function to parse results for processing.""" def __init__(self, parser, f): self.parser = parser self.f = f def parse(self, s): result = self.parser.parse(s) if result is None: return None result_match, s = result return self.f(result_match), s # Parse a digit. digit = Literal("0") | Literal("1") | Literal("2") | Literal("3") \ | Literal("4") | Literal("5") | Literal("6") | Literal("7") \ | Literal("8") | Literal("9") def _process_number(x): """Process a number. Our parser gives us back something silly that looks like (('-'), ('1', '2', '3')) for -123, so we just want to join that back up into -123, with the annotation NUMBER.""" neg, num = x return 'NUMBER', ''.join(neg + num) number = Lambda(Optional(Literal("-")) + OneOrMore(digit), _process_number) multiplicative = Literal("*") \ | Literal("/") additive = Literal("+") \ | Literal("-") def _process_expression(x): """Process an expression. Our parser gives an absolutely absurd looking creation for 1*1 (('NUMBER', '1'), (('*',), ('NUMBER', '1'))), so we want to turn this into reverse Polish notation to make things easier for us.""" # We want to eliminate the nesting we have, due to the fact that the Concat # parser adds a level of nesting. x = (x,) + x # If we don't have a right-hand side, just return the left-hand side. if len(x) == 1: return x # Extract our left-hand side, our operator and the rest of our expression. lhs, (op,), *tail = x tail.reverse() current = (op, lhs, tail.pop()) # We can reconstruct our equation in reverse Polish notation by popping off # the rest of the tail. while tail: op, = tail.pop() current = (op, current, tail.pop()) return current # Multiplicative expressions are parsed first, because they have a higher # precedence than additive ones. Why we use OneOrMore might not be immediately # obvious, but the idea is to associate all multiplicative operations together # at once. multiplicative_expression = Lambda(number + OneOrMore(multiplicative + number), _process_expression) # We do the same thing for additive expressions. expression = Lambda(multiplicative_expression + OneOrMore(additive + multiplicative_expression), _process_expression)
Holy crap, that’s a mouthful! Don’t worry, it doesn’t get too much more complicated after this. We can now use our parser to parse a simple expression:
expression.parse("1+1") # returns (('+', ('NUMBER', '1'), ('NUMBER', '1')), '')
We can now define an evaluation function relatively easily:
def evaluate(expr): tag, *tail = expr if tag == "NUMBER": v, = tail return int(v) lhs, rhs = tail if tag == "+": return evaluate(lhs) + evaluate(rhs) if tag == "-": return evaluate(lhs) - evaluate(rhs) if tag == "*": return evaluate(lhs) * evaluate(rhs) if tag == "/": return evaluate(lhs) / evaluate(rhs) expr, _ = expression.parse("1+2*3") evaluate(expr) # returns 7!
That pile of code was scary!
I’m sorry for spewing code out like that, but I hope you now understand how parsing works, and how to use parser combinators to do it. It doesn’t handle things like parenthesized expressions (bonus points if you know how to implement them) or whitespace, but I’m sure this is relatively well-trodden ground that you can explore yourself.
The parser given in the examples above is known as an LL(1) top-down parser, which is probably the most intuitive kind of parser there is — we go through the stream, matching rules from left-to-right until we succeed or eventually fail. However, it’s not capable of parsing more complex grammars that recurse on the left-hand side, or parse more ambiguous grammars, but I won’t go into that here.