rfw.name

Baby’s First Parser Combinators

Posted on

This is intended as an introduction into parsing with parser combinators. I won’t cover more advanced techniques like memoization or lookahead, and the idea is that you’ll end up with a simple LL(1) recursive descent parser.

Parsing is an interesting topic, since it forms the basis of how programs interpret streams of characters and construct internal representations of them. Today I’m going to give a brief introduction to basic parsing theory and show you how to build a parser with parser combinators from scratch!

What’s parsing?

Parsing is when we take a grammar (a list of rules on how to interpret input) and try to match our input to the rules in the grammar. For instance, if I had the rule “match a”, then an input of “a” will match the rule and an input of “b” will fail the rule. We can implement this relatively easily:

def match_a(s):
    return s == "a"

match_a("a")  # ok
match_a("b")  # not ok!

Usually we want our result back, so:

def match_a(s):
    if s == "a":
        return s
    return None

match_a("a")  # returns "a"
match_a("b")  # returns None

However, our very first parser can only recognize the string “a”. That’s not very useful, mostly because we want to match things other than “a”. Suppose we wanted to extend this to match “ab”. We could implement a function called match_ab and check if the input is “ab”, but we can reuse our “a” parser with a few modifications:

def match_a(s):
    if s[0] == "a":
        # We put the result in a tuple because each individual parse result
        # is placed in a tuple, and it could be possible to parse input into
        # an empty result (which we'll see later with the use of Optional).
        return ("a",), s[1:]
    return None

match_a("a")  # returns (("a",), "")
match_a("b")  # returns None
match_a("ab")  # returns (("a",), "b")

Our match_a parser now checks if the first character is “a” — if it is, it returns a tuple containing the match and then the rest of the string; otherwise it just returns None to indicate that parsing has failed. We can now implement a match_b function in the same way we implemented match_a:

def match_b(s):
    if input[0] == "b":
        return ("b",), s[1:]
    return None

And now we can chain our two parsers together:

def match_ab(s):
    fst = match_a(s)
    if fst is None:
        return None
    fst_match, s = fst

    snd = match_b(s)
    if snd is None:
        return None
    snd_match, s = snd

    return (fst_match, snd_match), s

match_ab("ab")  # hooray, returns ((("a",), ("b",)), "")

Whew! Although slightly verbose, we now have a very primitive way of chaining parsers together.

Right, okay, so what’s a parser combinator?

Parser combinators give us ways to chain parsers together. In the previous example, we’ve demonstrated that we can chain two parsers together sequentially relatively easily. We can generalize those patterns into something like this:

class Parser(object):
    def __add__(self, rhs):
        """Overloads the + operator for parser instances."""
        return Concat(self, rhs)

    def __or__(self, rhs):
        """Overloads the | operator for parser instances."""
        return Alternate(self, rhs)

class Concat(Parser):
    """Represents the concatenation of two parsers, i.e parse using the first
    parser, then the second, then return both results."""
    def __init__(self, fst, snd):
        self.fst = fst
        self.snd = snd

    def parse(self, s):
        fst = self.fst.parse(s)
        if fst is None:
            return None
        fst_match, s = fst

        snd = self.snd.parse(s)
        if snd is None:
            return None
        snd_match, s = snd

        return (fst_match, snd_match), s


class Alternate(Parser):
    """Represents the alternation of two parsers, i.e. parse using the first
    parser and, if that doesn't match, parse with the second parser."""
    def __init__(self, fst, snd):
        self.fst = fst
        self.snd = snd

    def parse(self, s):
        return self.fst.parse(s) or self.snd.parse(s)


class Literal(Parser):
    """Parse a literal."""
    def __init__(self, x):
        self.x = x

    def parse(self, s):
        l = len(self.x)
        if s[:l] != self.x:
            return None
        return (self.x,), s[l:]


match_ab = Literal("a") + Literal("b")
match_ab.parse("ab")  # returns ((("a",), ("b",)), "")
match_ab.parse("aa")  # returns None, as expected

We’ve generalized our matching functions into composable parsers, which are joined using parser combinatorsConcat and Alternate! For instance, we can do:

match_abab = match_ab + match_ab
match_abab.parse("abab")  # returns (((("a",), ("b",)), (("a",), ("b",))), "")

match_a_or_b = Literal("a") | Literal("b")
match_a_or_b.parse("a")  # returns (("a",), "")
match_a_or_b.parse("b")  # returns (("b",), "")

Wow, your examples are really contrived. How about a mandatory calculator example?

For a slightly more practical use of parser combinators, we can make a very simple calculator — one that supports +, -, / and * with operator precedence. We can define a few parsers to start off with:

class OneOrMore(Parser):
    """Represents the Kleene star operation, i.e. attempt to parse one or more
    results using the given parser."""
    def __init__(self, parser):
        self.parser = parser

    def parse(self, s):
        items = []

        while True:
            result = self.parser.parse(s)

            if result is None:
                return tuple(items), s

            result, s = result
            items.extend(result)

        return None


class Optional(Parser):
    """Represents a parser that doesn't fail if there is no match."""
    def __init__(self, parser):
        self.parser = parser

    def parse(self, s):
        result = self.parser.parse(s)
        if result is None:
            return (), s
        return result


class Lambda(Parser):
    """Apply a function to parse results for processing."""
    def __init__(self, parser, f):
        self.parser = parser
        self.f = f

    def parse(self, s):
        result = self.parser.parse(s)
        if result is None:
            return None
        result_match, s = result

        return self.f(result_match), s


# Parse a digit.
digit = Literal("0") | Literal("1") | Literal("2") | Literal("3") \
      | Literal("4") | Literal("5") | Literal("6") | Literal("7") \
      | Literal("8") | Literal("9")


def _process_number(x):
    """Process a number. Our parser gives us back something silly that looks
    like (('-'), ('1', '2', '3')) for -123, so we just want to join that
    back up into -123, with the annotation NUMBER."""
    neg, num = x
    return 'NUMBER', ''.join(neg + num)

number = Lambda(Optional(Literal("-")) + OneOrMore(digit), _process_number)

multiplicative = Literal("*") \
               | Literal("/")

additive = Literal("+") \
         | Literal("-")


def _process_expression(x):
    """Process an expression. Our parser gives an absolutely absurd looking
    creation for 1*1 (('NUMBER', '1'), (('*',), ('NUMBER', '1'))), so we
    want to turn this into reverse Polish notation to make things easier for
    us."""

    # We want to eliminate the nesting we have, due to the fact that the Concat
    # parser adds a level of nesting.
    x = (x[0],) + x[1]

    # If we don't have a right-hand side, just return the left-hand side.
    if len(x) == 1:
        return x[0]

    # Extract our left-hand side, our operator and the rest of our expression.
    lhs, (op,), *tail = x

    tail.reverse()

    current = (op, lhs, tail.pop())

    # We can reconstruct our equation in reverse Polish notation by popping off
    # the rest of the tail.
    while tail:
        op, = tail.pop()
        current = (op, current, tail.pop())

    return current

# Multiplicative expressions are parsed first, because they have a higher
# precedence than additive ones. Why we use OneOrMore might not be immediately
# obvious, but the idea is to associate all multiplicative operations together
# at once.
multiplicative_expression = Lambda(number + OneOrMore(multiplicative + number),
                                   _process_expression)

# We do the same thing for additive expressions.
expression = Lambda(multiplicative_expression + OneOrMore(additive + multiplicative_expression),
                   _process_expression)

Holy crap, that’s a mouthful! Don’t worry, it doesn’t get too much more complicated after this. We can now use our parser to parse a simple expression:

expression.parse("1+1")  # returns (('+', ('NUMBER', '1'), ('NUMBER', '1')), '')

We can now define an evaluation function relatively easily:

def evaluate(expr):
    tag, *tail = expr

    if tag == "NUMBER":
        v, = tail
        return int(v)

    lhs, rhs = tail

    if tag == "+":
        return evaluate(lhs) + evaluate(rhs)

    if tag == "-":
        return evaluate(lhs) - evaluate(rhs)

    if tag == "*":
        return evaluate(lhs) * evaluate(rhs)

    if tag == "/":
        return evaluate(lhs) / evaluate(rhs)

expr, _ = expression.parse("1+2*3")
evaluate(expr)  # returns 7!

That pile of code was scary!

I’m sorry for spewing code out like that, but I hope you now understand how parsing works, and how to use parser combinators to do it. It doesn’t handle things like parenthesized expressions (bonus points if you know how to implement them) or whitespace, but I’m sure this is relatively well-trodden ground that you can explore yourself.

The parser given in the examples above is known as an LL(1) top-down parser, which is probably the most intuitive kind of parser there is — we go through the stream, matching rules from left-to-right until we succeed or eventually fail. However, it’s not capable of parsing more complex grammars that recurse on the left-hand side, or parse more ambiguous grammars, but I won’t go into that here.

comments powered by Disqus