Wednesday, July 24, 2013

Operator parsing

Operator parsing

Interested in simple, expressive, and efficient parsing algorithms? Ever heard of "Pratt", "operator precedence" or "top-down operator" parsing? It's a great technique, whose many advantages are described in Pratt's original paper as well as in excellent articles from more recent authors, including Douglas Crockford of Javascript fame. The chief advantages are:

  • simplified description of the operator expression language portions
  • simplified implementation of parser for operator expressions
  • user extensibility of operator grammar

Despite its advantages, learning how the method works was very difficult for me. Although the papers were great resources, I found their explanations quite hard to grok. In particular, the original paper was confusing both in terminology and in explanation, while Crockford's implementation made heavy use of global variables, mutable state, and inheritance, and both lacked specific examples illustrating the key concepts and corner cases. They also used the same approach for parsing syntactic constructs that would not normally be considered as operators, which, although effective, did not help me to get "it".

In learning the method, I ended up using my own take on the algorithm and writing my own implementation in order to run examples. I ended up with something similar to work by Annika Aasa. Also, operator precedence climbing seems to be the same or similar.

With the help of this algorithm, I'd like to demonstrate:

  • what operator expressions are
  • what the problems with specifying and parsing operator expressions are
  • the precedence/associativity solution to describing operator expressions
  • the model used by my parser to parse operator expressions
  • lots and lots of examples of the parser in action to point out corner cases and confusing scenarios

What are operator expressions?

Operators can be seen as a special case of functions, being called with a different syntax, argument order, and operator and operand positions. The main advantage of operator expressions is that they don't have to be fully parenthesized, making them more compact and (arguably) more convenient to read and write. Compare:

1 + !2 * -8 * ~3 ^ x ++

(1 + (((!2) * (-8)) * ((~3) ^ (x++))))

(+ 1 
   (* (! 2) 
      (* (- 8)
         (^ (~ 3)
            (post-++ x)))))
The first version is much shorter; is it easier to read as well?

There are four main types of operators that I'm going to focus on:

prefix:   !x             =>  (! x)
          not not x      =>  (not (not x))

infix:    x + 3          =>  (x + 3)
          a * b + c      =>  ((a * b) + c)

postfix:  y ++           =>  (y ++)

mixfix:   a ? b : c      =>  (a ? b : c)
          x if y else z  =>  (x if y else z)

Problems with specifying and parsing operator expressions

It's easy to write context-free grammars that describe operator expression -- a common example for specifying simple mathematical expressions is something like:

E  :=  E  '-'  E  |
       E  '*'  E  |
       E  '^'  E  |
       int
Unfortunately, the grammar is ambiguous; even simple expressions could be parsed multiple ways:
3 - 2 - 1         =>   (3 - (2 - 1))       or       ((3 - 2) - 1)  ??

4 - 5 * 6         =>   (4 - (5 * 6))       or       ((4 - 5) * 6)  ??

To express operator precedence, we can create a rule for each precedence level in the grammar:

E       :=  Factor  ( '-'  Factor )(*)

Factor  :=  Term  ( '*'  Term )(*)

Term    :=  int  ( '^'  int )(*)
It should be clear that this is a tedious and ugly solution. Plus, extending it to also deal with prefix, postfix, and mixfix operators further complicates things.

One last point to consider is whether the set of operators and their precedences are set in stone. With the above schemes, in which the grammar is fixed long before any actual parsing is done, the user will be not able to define new operators and their precedences.

Specification solution: precedence and associativity

Fortunately, there are better ways to specify operator grammars than using context-free grammars. One such way is to specify precedence and associativity of each operator. Then, when the parser needs to decide which operator an operand belongs to, it consults the precedence and associativity tables.

For an example, let's revisit the earlier problem of parsing `4 - 5 * 6`. The crux of the association problem is whether the `5` belongs with the `-` or with the `*`. As we know from math, the `*` has higher precedence, and so the expression is parsed as `4 - (5 * 6)`. That's precedence.

Using associativity, we can resolve the other example -- `3 - 2 - 1`. In this example, the association problem is whether the `2` belongs with the first or second `-`. Obviously, `-` has the same precedence as itself, so precedence won't help us here. However, infix `-` is left-associative, which means that `2` associates to the left, for a final result of `((3 - 2) - 1)`.

Precedences must also be specified for the other operator types -- prefix, postfix, and mixfix -- since each of these may participate in an association problem. The above two examples are infix/infix problems; here are the additional possible association problems:

prefix/infix:      !x + 3               =>    !(x + 3)               or    (!x) + 3

infix/postfix:     a * b--              =>    (a * b)--              or    a * (b--)

prefix/postfix:    - c ++               =>    - (c ++)               or    (- c) ++

mixfix/mixfix:     a ? b : c ? d : e    =>    (a ? b : c) ? d : e    or    a ? b : (c ? d : e)

prefix/mixfix:     ! a ? b : c          =>    (! a) ? b : c          or    !(a ? b : c)

mixfix/postfix:    a ? b : c ++         =>    (a ? b : c) ++         or    a ? b : (c ++)

infix/mixfix:      w + x ? y : z        =>    w + (x ? y : z)        or    (w + x) ? y : z

mixfix/infix:      n ? o : p + q        =>    (n ? o : p) + q        or    n ? o : (p + q)
Note that postfix and prefix operators can't be the first or second operator, respectively, in an association problem for obvious reasons.

As the above examples demonstrated, precedence handles cases between operators with different binding priorities, while associativity handles cases between operators with with equal priorities. Associativity can be right or left:

left associativity:    a + b + c    =>    ((a + b) + c)

right associativity:   d = e = f    =>    (d = (e = f))
Now what happens if we mix operators of equal precedence but opposite associativity? In the following examples, assume `+` and `=` have equal precedences but are left- and right- associative, respectively:
operator +  precedence 50, left-associative
operator =  precedence 50, right-associative

a = b + c    =>    (a = b) + c   violates associativity of `=`
                   a = (b + c)   violates associativity of '+'

d + e = f    =>    (d + e) = f   violates associativity of `=`
             =>    d + (e = f)   violates associativity of '+'
There's no good way to parse mixed-associativity expressions; systems can arbitrarily choose a parse, or report an error. My system chooses to report an error.

The most common way to define precedence is by assigning a number to each operator; the higher the number, the higher the precedence of the operator (or sometimes vice versa, confusingly). See Java and Python examples. Another approach is to define relative precedences between some (but not necessarily all) operator pairs, so that a total precedence order need not be defined.

Associativity is defined for each infix and mixfix operator, either as left-, right-, or non-associative. Prefix and postfix operators do not need associativities because they are always right- and left-associative, respectively. As can be seen in these examples, there is only one way to parse these operators expressions:

! ! ! ! x   =>   ! (! (! (! x)))

x ++ ++ ++  =>   ((x ++) ++) ++

A parsing algorithm

Now I'd like to deomonstate a parsing algorithm that works for prefix, infix, mixfix, and postfix operators of arbitary precedence and associativity.

First, I need to give a rough definition of the expression language we'll parse. Even though, as I mentioned before, BNF-like grammars are a poor means for defining operator expressions, I'll it a bit informally just to give a succinct outline what we'll be dealing with:

Expr        ::   PrePostfix  |
                 Infix       |
                 Mixfix

Infix       ::   Expr  InfixOp  Expr

Mixfix      ::   Expr  MixfixOp1  Expr  MixfixOp2  Expr

PrePostfix  ::   PrefixOp(*)  Operand  PostfixOp(*)

Operand     ::   Number  |  Variable
(Of course, this grammar is highly ambiguous, whereas our parser will be unambiguous.) Basically, this grammar says that we can do stuff like:
infix:        x * 123

add prefix
operators:    ! ~ x * 123

add postfix
operators:    ! ~ x ++ -- * 123

add infix
operator:     ! ~ x ++ -- * 123 & y

add mixfix
operator:     ! ~ x ++ -- * 123 & y ? 4 : p
We can have as many prefix and postfix operators as we want surrounding any operand, and infix and mixfix operator expressions recursively use the `Expr` rule.

The basic parsing algorithm needs five rules which must be repeatedly applied in order to parse the operator expressions: one for each type of operator, as well as one for finding operands. Since we're working left-to-right, there's a natural order in which the rules must be applied:

  1. find prefix operators
  2. find the operand
  3. find postfix operators
  4. find an infix operator, if possible, or ...
  5. find the first part of a mixfix operator, an expression, and the second part, if possible
Applying these rules looks like this:
start:         ! ~ x ++ -- * 123 & y ? 4 : p

find prefix
  operators:       x ++ -- * 123 & y ? 4 : p     

find operand:        ++ -- * 123 & y ? 4 : p

find postfix
  operators:               * 123 & y ? 4 : p --

find infix
  or mixfix
  operator:                  123 & y ? 4 : p --
Now, if we found just an operand and postfix operators, we have a complete expression:
x ++   <-- complete expression
However, if we found any prefix operators, we may not yet have found the complete operand; using Python's rules:
not x      parse prefix operator 'not'
x          parse operand variable 'x'
-- done -- complete expression

not 3 + 2 + 1    parse prefix operator 'not'
3 + 2 + 1        parse operand number '3'
+ 2 + 1          parse infix operator '+'
2 + 1            tokens '2', '+', and '1' remaining
-- not done -- `not` has lower precedence than `+`, so we haven't yet found `not`s operand
If we found an infix or mixfix operator, we have found the left-hand operand, but we definitely haven't found the right-side operand(s):
3 + 4 * 5    parse operand number '3'
+ 4 * 5      parse infix operator '+'
4 * 5
-- not done -- we have not yet found `+`s right-hand operand

Introducing the Stack

To store the operators whose arguments we have not yet found, we'll use a stack. Each layer of the stack records the name, precedence, associativity, and any arguments already found of an operator. Of course, since it's a stack, layers are only added and removed from one end, representing where the parser is in the expression. Here's a prefix example:

tokens stack scratch space action
! ~ x [] consume prefix operator
~ x [] ! prefix operator, so push
~ x [(!, 110, right)] consume prefix operator
x [(!, 110, right)] ~ prefix operator, so push
x [(!, 110, right), (~, 110, right)]
We can see that as each prefix operator is consumed, an entry is pushed on to the top of the stack. Notice that the associativity is 'right' for prefix operators, and that we haven't found any operands for them yet.

Now let's see what happens when we continue the above example. When we do find the operands, we can start popping stack entries. If we've consumed the entire input, we pop every stack level. Meanwhile, we're also maintaining a scratch value that is the operand which we pass to the next highest stack frame, after which the frame is popped and the process is repeated:

tokens stack scratch space action
x [(!, 110, right), (~, 110, right)] consume operand
[(!, 110, right), (~, 110, right)] x tokens empty, so pop
[(!, 110, right)] (~ x) tokens empty, so pop
[] (! (~ x)) tokens and stack empty, so done

An infix example

The algorithm works similarly for infix, and mixfix operators, except that when an entry is pushed onto the stack for such an operator, the left-hand argument has already been found. Let's work through an example:

tokens stack scratch space action
a + b * 3 - 4 [] consume operand and infix operator
b * 3 - 4 [] +, a stack is empty, so push
b * 3 - 4 [(+, 80, left, a)] consume operand and infix operator
3 - 4 [(+, 80, left, a)] *, b * > +, so push
3 - 4 [(+, 80, left, a), (*, 90, left, b)] consume operand and infix operator
4 [(+, 80, left, a), (*, 90, left, b)] -, 3 - < *, so pop
4 [(+, 80, left, a)] -, (* b 3) - == +, but is left associative, so pop
4 [] -, (+ a (* b 3)) stack is empty, so push
4 [(-, 80, left, (+ a (* b 3)))] consume operand
[(-, 80, left, (+ a (* b 3)))] a input is empty, so pop
[] (- (+ a (* b 3)) 4) input and stack empty, so done
The association problem is continually decided based on operator precedences and associativities, and is implemented through pushes and pops to the pending operator stack.

A postfix example

Postfix operators do not require stack pushes, but may require pops -- since their operands are always to the left, meaning that further parsing is not needed to find them. Here's a small example; assume the precedences of +, --, and @ are 80, 120, and 50 respectively:

tokens stack scratch space action
x + y -- @ [] (none) consume operand and infix operator
y -- @ [] +, x stack empty, so push
y -- @ [(+, 80, left, x)] (none) consume operand and postfix operator
@ [(+, 80, left, x)] --, y -- > +, so apply -- to y
@ [(+, 80, left, x)] (-- y) consume postfix operator
[(+, 80, left, x)] @, (-- y) @ < +, so pop
[] @, (+ x (-- y)) stack empty, so apply @ to arg
[] (@ (+ x (-- y))) stack, input empty so done

To sum up how the algorithm works:

  • use a stack to represent operators that we're not done parsing yet
  • relative precedence and associativity of the current operator and the topmost operator on the stack tell us whether we need to push or pop
  • use a scratch space to temporarily hold an operator and operand, if necessary
  • prefix operators always push a stack frame
  • postfix operators may pop stack frames
  • infix and mixfix operators may pop 0 or more frames, followed by pushing a single new frame

Disambiguating ambiguities

It's great to be able to use `-` and `+` both as infix, binary operators and as unary prefix ones. Can this algorithm deal with those cases? Easily! Let's work through a short example:

tokens stack scratch space action
140 - - 26 [] consume operand and infix operator
- 26 [] -, 140 stack empty, so push
- 26 [(-, 80, left, 140)] consume prefix operator
26 [(-, 80, left, 140)] - prefix operator, so push
26 [(-, 80, left, 140), (-, 110, right)] consume operand
[(-, 80, left, 140), (-, 110, right)] 26 input empty, so pop
[(-, 80, left, 140)] (- 26) input empty, so pop
[] (- 140 (- 26)) input and stack empty, so done
The algorithm handles the problem by reading the two `-` tokens in two different contexts: when it reads the first one, the algorithm is expecting an infix operator; once an infix operator is found, the algorithm looks for an operand. Since an operand can have prefix operators, it next looks for those.

Operators can also be used in both prefix and postfix contexts, or prefix and mixfix, without confusing the parser. However, using an operator as both postfix and infix/mixfix *will* screw things up, since the parser will no longer know when to stop parsing postfix operators and switch over to infix/mixfix.

User extensibility

One of the major advantages of operator parsing is its extensibility: to create brand new operators, you just need to add some entry to your operator tables giving the precedence and associativity (if necessary). This allows users to define their own operators. For an example, check out the Haskell programming language, which allows user-defined operators as described here.

Limitations

For simplicity, I intentionally chose an algorithm that is more limiting than the ones used by Pratt and Crockford. The problem is caused by the forced classification of operators into prefix, postfix, infix, or mixfix. If you have an operator that doesn't fit into one of those categories exactly, you will have to extend the algorithm to deal with it. An example is Python's `lambda` operator, since it requires a parameter list after the keyword and before the `:`, but is otherwise similar to a prefix operator.

In the Pratt & Crockford method, the tokens themselves are responsible for deciding what parse rules to use. This allows any grammatical rule to be parsed as an operator expression, but again, I find it more difficult to comprehend and demonstrate, precisely because it's so flexible.

Wrap up

Well, this has been quite a long article! I hoped you enjoyed reading it as much as I did writing it!

If you'd like to see the parser in action, you can find my implementation of the algorithm on github here. In addition to the actual code, there's a decent suite of tests to ensure that tricky corner cases are handled correctly. Might be worth a look!

Lastly, I'd like to mention again that this was all inspired by the works of Pratt and Crockford, which are excellent and definitely worth reading.

3 comments:

  1. I'd got to a similar stage with Douglas Crockford's code, my JavaScript isn't good enough to "see" the algorithm in his code. You might be interested in Eli Bendersky's website. He also illustrates top down parsing using Python http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/

    Thanks for taking the time to write a really nice article.

    ReplyDelete
    Replies
    1. Hi Mike! Thanks for your comment, I'm glad you enjoyed it (Sorry I took so long to reply).

      You're right about Eli Bendersky's blog -- it's a great resource. Thanks for pointing that out!

      Delete