Wednesday, July 24, 2013

Operator parsing

Operator parsing

Interested in simple, expressive, and efficient parsing algorithms? Ever heard of "Pratt", "operator precedence" or "top-down operator" parsing? It's a great technique, whose many advantages are described in Pratt's original paper as well as in excellent articles from more recent authors, including Douglas Crockford of Javascript fame. The chief advantages are:

  • simplified description of the operator expression language portions
  • simplified implementation of parser for operator expressions
  • user extensibility of operator grammar

Despite its advantages, learning how the method works was very difficult for me. Although the papers were great resources, I found their explanations quite hard to grok. In particular, the original paper was confusing both in terminology and in explanation, while Crockford's implementation made heavy use of global variables, mutable state, and inheritance, and both lacked specific examples illustrating the key concepts and corner cases. They also used the same approach for parsing syntactic constructs that would not normally be considered as operators, which, although effective, did not help me to get "it".

In learning the method, I ended up using my own take on the algorithm and writing my own implementation in order to run examples. I ended up with something similar to work by Annika Aasa. Also, operator precedence climbing seems to be the same or similar.

With the help of this algorithm, I'd like to demonstrate:

  • what operator expressions are
  • what the problems with specifying and parsing operator expressions are
  • the precedence/associativity solution to describing operator expressions
  • the model used by my parser to parse operator expressions
  • lots and lots of examples of the parser in action to point out corner cases and confusing scenarios

What are operator expressions?

Operators can be seen as a special case of functions, being called with a different syntax, argument order, and operator and operand positions. The main advantage of operator expressions is that they don't have to be fully parenthesized, making them more compact and (arguably) more convenient to read and write. Compare:

1 + !2 * -8 * ~3 ^ x ++

(1 + (((!2) * (-8)) * ((~3) ^ (x++))))

(+ 1 
   (* (! 2) 
      (* (- 8)
         (^ (~ 3)
            (post-++ x)))))
The first version is much shorter; is it easier to read as well?

There are four main types of operators that I'm going to focus on:

prefix:   !x             =>  (! x)
          not not x      =>  (not (not x))

infix:    x + 3          =>  (x + 3)
          a * b + c      =>  ((a * b) + c)

postfix:  y ++           =>  (y ++)

mixfix:   a ? b : c      =>  (a ? b : c)
          x if y else z  =>  (x if y else z)

Problems with specifying and parsing operator expressions

It's easy to write context-free grammars that describe operator expression -- a common example for specifying simple mathematical expressions is something like:

E  :=  E  '-'  E  |
       E  '*'  E  |
       E  '^'  E  |
       int
Unfortunately, the grammar is ambiguous; even simple expressions could be parsed multiple ways:
3 - 2 - 1         =>   (3 - (2 - 1))       or       ((3 - 2) - 1)  ??

4 - 5 * 6         =>   (4 - (5 * 6))       or       ((4 - 5) * 6)  ??

To express operator precedence, we can create a rule for each precedence level in the grammar:

E       :=  Factor  ( '-'  Factor )(*)

Factor  :=  Term  ( '*'  Term )(*)

Term    :=  int  ( '^'  int )(*)
It should be clear that this is a tedious and ugly solution. Plus, extending it to also deal with prefix, postfix, and mixfix operators further complicates things.

One last point to consider is whether the set of operators and their precedences are set in stone. With the above schemes, in which the grammar is fixed long before any actual parsing is done, the user will be not able to define new operators and their precedences.

Specification solution: precedence and associativity

Fortunately, there are better ways to specify operator grammars than using context-free grammars. One such way is to specify precedence and associativity of each operator. Then, when the parser needs to decide which operator an operand belongs to, it consults the precedence and associativity tables.

For an example, let's revisit the earlier problem of parsing `4 - 5 * 6`. The crux of the association problem is whether the `5` belongs with the `-` or with the `*`. As we know from math, the `*` has higher precedence, and so the expression is parsed as `4 - (5 * 6)`. That's precedence.

Using associativity, we can resolve the other example -- `3 - 2 - 1`. In this example, the association problem is whether the `2` belongs with the first or second `-`. Obviously, `-` has the same precedence as itself, so precedence won't help us here. However, infix `-` is left-associative, which means that `2` associates to the left, for a final result of `((3 - 2) - 1)`.

Precedences must also be specified for the other operator types -- prefix, postfix, and mixfix -- since each of these may participate in an association problem. The above two examples are infix/infix problems; here are the additional possible association problems:

prefix/infix:      !x + 3               =>    !(x + 3)               or    (!x) + 3

infix/postfix:     a * b--              =>    (a * b)--              or    a * (b--)

prefix/postfix:    - c ++               =>    - (c ++)               or    (- c) ++

mixfix/mixfix:     a ? b : c ? d : e    =>    (a ? b : c) ? d : e    or    a ? b : (c ? d : e)

prefix/mixfix:     ! a ? b : c          =>    (! a) ? b : c          or    !(a ? b : c)

mixfix/postfix:    a ? b : c ++         =>    (a ? b : c) ++         or    a ? b : (c ++)

infix/mixfix:      w + x ? y : z        =>    w + (x ? y : z)        or    (w + x) ? y : z

mixfix/infix:      n ? o : p + q        =>    (n ? o : p) + q        or    n ? o : (p + q)
Note that postfix and prefix operators can't be the first or second operator, respectively, in an association problem for obvious reasons.

As the above examples demonstrated, precedence handles cases between operators with different binding priorities, while associativity handles cases between operators with with equal priorities. Associativity can be right or left:

left associativity:    a + b + c    =>    ((a + b) + c)

right associativity:   d = e = f    =>    (d = (e = f))
Now what happens if we mix operators of equal precedence but opposite associativity? In the following examples, assume `+` and `=` have equal precedences but are left- and right- associative, respectively:
operator +  precedence 50, left-associative
operator =  precedence 50, right-associative

a = b + c    =>    (a = b) + c   violates associativity of `=`
                   a = (b + c)   violates associativity of '+'

d + e = f    =>    (d + e) = f   violates associativity of `=`
             =>    d + (e = f)   violates associativity of '+'
There's no good way to parse mixed-associativity expressions; systems can arbitrarily choose a parse, or report an error. My system chooses to report an error.

The most common way to define precedence is by assigning a number to each operator; the higher the number, the higher the precedence of the operator (or sometimes vice versa, confusingly). See Java and Python examples. Another approach is to define relative precedences between some (but not necessarily all) operator pairs, so that a total precedence order need not be defined.

Associativity is defined for each infix and mixfix operator, either as left-, right-, or non-associative. Prefix and postfix operators do not need associativities because they are always right- and left-associative, respectively. As can be seen in these examples, there is only one way to parse these operators expressions:

! ! ! ! x   =>   ! (! (! (! x)))

x ++ ++ ++  =>   ((x ++) ++) ++

A parsing algorithm

Now I'd like to deomonstate a parsing algorithm that works for prefix, infix, mixfix, and postfix operators of arbitary precedence and associativity.

First, I need to give a rough definition of the expression language we'll parse. Even though, as I mentioned before, BNF-like grammars are a poor means for defining operator expressions, I'll it a bit informally just to give a succinct outline what we'll be dealing with:

Expr        ::   PrePostfix  |
                 Infix       |
                 Mixfix

Infix       ::   Expr  InfixOp  Expr

Mixfix      ::   Expr  MixfixOp1  Expr  MixfixOp2  Expr

PrePostfix  ::   PrefixOp(*)  Operand  PostfixOp(*)

Operand     ::   Number  |  Variable
(Of course, this grammar is highly ambiguous, whereas our parser will be unambiguous.) Basically, this grammar says that we can do stuff like:
infix:        x * 123

add prefix
operators:    ! ~ x * 123

add postfix
operators:    ! ~ x ++ -- * 123

add infix
operator:     ! ~ x ++ -- * 123 & y

add mixfix
operator:     ! ~ x ++ -- * 123 & y ? 4 : p
We can have as many prefix and postfix operators as we want surrounding any operand, and infix and mixfix operator expressions recursively use the `Expr` rule.

The basic parsing algorithm needs five rules which must be repeatedly applied in order to parse the operator expressions: one for each type of operator, as well as one for finding operands. Since we're working left-to-right, there's a natural order in which the rules must be applied:

  1. find prefix operators
  2. find the operand
  3. find postfix operators
  4. find an infix operator, if possible, or ...
  5. find the first part of a mixfix operator, an expression, and the second part, if possible
Applying these rules looks like this:
start:         ! ~ x ++ -- * 123 & y ? 4 : p

find prefix
  operators:       x ++ -- * 123 & y ? 4 : p     

find operand:        ++ -- * 123 & y ? 4 : p

find postfix
  operators:               * 123 & y ? 4 : p --

find infix
  or mixfix
  operator:                  123 & y ? 4 : p --
Now, if we found just an operand and postfix operators, we have a complete expression:
x ++   <-- complete expression
However, if we found any prefix operators, we may not yet have found the complete operand; using Python's rules:
not x      parse prefix operator 'not'
x          parse operand variable 'x'
-- done -- complete expression

not 3 + 2 + 1    parse prefix operator 'not'
3 + 2 + 1        parse operand number '3'
+ 2 + 1          parse infix operator '+'
2 + 1            tokens '2', '+', and '1' remaining
-- not done -- `not` has lower precedence than `+`, so we haven't yet found `not`s operand
If we found an infix or mixfix operator, we have found the left-hand operand, but we definitely haven't found the right-side operand(s):
3 + 4 * 5    parse operand number '3'
+ 4 * 5      parse infix operator '+'
4 * 5
-- not done -- we have not yet found `+`s right-hand operand

Introducing the Stack

To store the operators whose arguments we have not yet found, we'll use a stack. Each layer of the stack records the name, precedence, associativity, and any arguments already found of an operator. Of course, since it's a stack, layers are only added and removed from one end, representing where the parser is in the expression. Here's a prefix example:

tokens stack scratch space action
! ~ x [] consume prefix operator
~ x [] ! prefix operator, so push
~ x [(!, 110, right)] consume prefix operator
x [(!, 110, right)] ~ prefix operator, so push
x [(!, 110, right), (~, 110, right)]
We can see that as each prefix operator is consumed, an entry is pushed on to the top of the stack. Notice that the associativity is 'right' for prefix operators, and that we haven't found any operands for them yet.

Now let's see what happens when we continue the above example. When we do find the operands, we can start popping stack entries. If we've consumed the entire input, we pop every stack level. Meanwhile, we're also maintaining a scratch value that is the operand which we pass to the next highest stack frame, after which the frame is popped and the process is repeated:

tokens stack scratch space action
x [(!, 110, right), (~, 110, right)] consume operand
[(!, 110, right), (~, 110, right)] x tokens empty, so pop
[(!, 110, right)] (~ x) tokens empty, so pop
[] (! (~ x)) tokens and stack empty, so done

An infix example

The algorithm works similarly for infix, and mixfix operators, except that when an entry is pushed onto the stack for such an operator, the left-hand argument has already been found. Let's work through an example:

tokens stack scratch space action
a + b * 3 - 4 [] consume operand and infix operator
b * 3 - 4 [] +, a stack is empty, so push
b * 3 - 4 [(+, 80, left, a)] consume operand and infix operator
3 - 4 [(+, 80, left, a)] *, b * > +, so push
3 - 4 [(+, 80, left, a), (*, 90, left, b)] consume operand and infix operator
4 [(+, 80, left, a), (*, 90, left, b)] -, 3 - < *, so pop
4 [(+, 80, left, a)] -, (* b 3) - == +, but is left associative, so pop
4 [] -, (+ a (* b 3)) stack is empty, so push
4 [(-, 80, left, (+ a (* b 3)))] consume operand
[(-, 80, left, (+ a (* b 3)))] a input is empty, so pop
[] (- (+ a (* b 3)) 4) input and stack empty, so done
The association problem is continually decided based on operator precedences and associativities, and is implemented through pushes and pops to the pending operator stack.

A postfix example

Postfix operators do not require stack pushes, but may require pops -- since their operands are always to the left, meaning that further parsing is not needed to find them. Here's a small example; assume the precedences of +, --, and @ are 80, 120, and 50 respectively:

tokens stack scratch space action
x + y -- @ [] (none) consume operand and infix operator
y -- @ [] +, x stack empty, so push
y -- @ [(+, 80, left, x)] (none) consume operand and postfix operator
@ [(+, 80, left, x)] --, y -- > +, so apply -- to y
@ [(+, 80, left, x)] (-- y) consume postfix operator
[(+, 80, left, x)] @, (-- y) @ < +, so pop
[] @, (+ x (-- y)) stack empty, so apply @ to arg
[] (@ (+ x (-- y))) stack, input empty so done

To sum up how the algorithm works:

  • use a stack to represent operators that we're not done parsing yet
  • relative precedence and associativity of the current operator and the topmost operator on the stack tell us whether we need to push or pop
  • use a scratch space to temporarily hold an operator and operand, if necessary
  • prefix operators always push a stack frame
  • postfix operators may pop stack frames
  • infix and mixfix operators may pop 0 or more frames, followed by pushing a single new frame

Disambiguating ambiguities

It's great to be able to use `-` and `+` both as infix, binary operators and as unary prefix ones. Can this algorithm deal with those cases? Easily! Let's work through a short example:

tokens stack scratch space action
140 - - 26 [] consume operand and infix operator
- 26 [] -, 140 stack empty, so push
- 26 [(-, 80, left, 140)] consume prefix operator
26 [(-, 80, left, 140)] - prefix operator, so push
26 [(-, 80, left, 140), (-, 110, right)] consume operand
[(-, 80, left, 140), (-, 110, right)] 26 input empty, so pop
[(-, 80, left, 140)] (- 26) input empty, so pop
[] (- 140 (- 26)) input and stack empty, so done
The algorithm handles the problem by reading the two `-` tokens in two different contexts: when it reads the first one, the algorithm is expecting an infix operator; once an infix operator is found, the algorithm looks for an operand. Since an operand can have prefix operators, it next looks for those.

Operators can also be used in both prefix and postfix contexts, or prefix and mixfix, without confusing the parser. However, using an operator as both postfix and infix/mixfix *will* screw things up, since the parser will no longer know when to stop parsing postfix operators and switch over to infix/mixfix.

User extensibility

One of the major advantages of operator parsing is its extensibility: to create brand new operators, you just need to add some entry to your operator tables giving the precedence and associativity (if necessary). This allows users to define their own operators. For an example, check out the Haskell programming language, which allows user-defined operators as described here.

Limitations

For simplicity, I intentionally chose an algorithm that is more limiting than the ones used by Pratt and Crockford. The problem is caused by the forced classification of operators into prefix, postfix, infix, or mixfix. If you have an operator that doesn't fit into one of those categories exactly, you will have to extend the algorithm to deal with it. An example is Python's `lambda` operator, since it requires a parameter list after the keyword and before the `:`, but is otherwise similar to a prefix operator.

In the Pratt & Crockford method, the tokens themselves are responsible for deciding what parse rules to use. This allows any grammatical rule to be parsed as an operator expression, but again, I find it more difficult to comprehend and demonstrate, precisely because it's so flexible.

Wrap up

Well, this has been quite a long article! I hoped you enjoyed reading it as much as I did writing it!

If you'd like to see the parser in action, you can find my implementation of the algorithm on github here. In addition to the actual code, there's a decent suite of tests to ensure that tricky corner cases are handled correctly. Might be worth a look!

Lastly, I'd like to mention again that this was all inspired by the works of Pratt and Crockford, which are excellent and definitely worth reading.

Wednesday, July 3, 2013

Parsing the NMR-Star format

The NMR-Star format

NMR, or Nuclear Magnetic Resonance, is a technique for studying molecules at the molecular level; it is also the technology behind NMR machines. The Biological Magnetic Resonance Data Bank is a great resource for archived NMR data, which can be access in text files using the NMR-Star format.

However, the NMR-Star format has a number of corner cases that I had to solve while writing the NMR-Star parser which were quite tricky. This article will discuss the problems and their solutions. The full code can be found on github in my NMR-Star parser project.

The parsing strategy

I used parser combinators supporting error reporting, backtracking, and line/column position, and added a a couple of extra phases for clean-up:

  • scanner
  • token clean-up. Discards whitespace and comment tokens, and classifies unquoted strings as keywords or values
  • CST parsing. Builds a concrete syntax tree from the token sequence
  • AST construction. Checks context-sensitive constraints while building an abstract syntax tree

Problem: whitespace/comments between tokens and semicolon-delimited strings

The NMR-Star format allows insignificant whitespace and comments between tokens. A standard parsing solution is to discard whitespace and comments after every token. However, the NMR-Star language defines semicolon-delimited strings to be opened by the sequence "newline, semicolon" and ended by the same sequence, which breaks the "munch junk after every token" rule, since newlines would be counted as whitespace and be discarded by the time a semicolon-string is tried. For example:

  ;abc  # <-- not a semicolon-string because preceding character is not a newline

;abc  # <-- semicolon-string because preceding character *is* a newline
def
; 

In my first solution to this problem, I pre-munched junk before every token. Semicolon-delimited strings got a special munching rule, which verified that the last junk character before the opening semicolon was indeed a newline. However, this had a couple disadvantages: 1) position reporting was much more difficult than with post-munching, since junk had to be parsed before finding the beginning of the next token, and 2) it was a special case, which complicated the specification and implementation.

My second solution took advantage of the position information -- that the opening semicolon immediately follows a newline implies that it is in column 1 of its line. So, the parser must inspect the position before deciding to try parsing a semicolon-delimited string or not. This allowed one single rule for insignificant whitespace and comments, and also allowed me to use post-munching.

def _sc_rest(position):
    _, column = position
    # a semicolon-delimited string must be preceded by a newline -- thus, column must be 1
    if column == 1:
        return node('scstring',
                    ('open', sc),
                    ('value', many0(not1(_end_sc))),
                    ('close', cut('newline-semicolon', _end_sc)))
    return zero
    
scstring = bind(getState, _sc_rest)

Problem: semicolons inside semicolon-delimited strings

Since the newline/semicolon sequence closes semicolon-delimited strings, this means that semicolons can not be the first character on a line within a semicolon-string. For example:


; this ; starts the string
 this ; is *within* the string and does not end it
; # this ; ends the string, because it follows a newline
; # which means this is an extraneous ;
This problem is solved with a single character of lookahead: while parsing a semicolon-string, if a newline is encountered and the next character is a semicolon, end the string; otherwise, continue parsing the semicolon-string.

Problem: non-ending quotes of quoted values

NMR-Star has double-quoted values:

"abc 123"
Double-quotes can be within the string, as long as they are not followed by whitespace or EOF. Thus, this is a single, valid double-quoted value, since the second '"' is followed by '1':
"abc"123"
This is easily solved using a single character of lookahead: '"' followed by whitespace, end the string; '"' followed by not whitespace, consume the '"' and continue parsing the string.

NMR-Star also has single-quote values which work in the same way.

Problem: keyword vs. unquoted value

The lexical definitions of keywords and unquoted values in the original Star grammar is ambiguous; this is resolved with an additional rule stating that unquoted values can not be keywords. For example:

loop_  # <- 'loop_' is a keyword, so it can not be an unquoted value
 _a _b
 loop_1  # <- 'loop_1' is not a keyword, so it can be an unquoted value
 loop_2 
stop_
An alternate definition, and the one which I used in my parser, is to match the longest string of the unquoted value pattern; then classify the string as either a keyword or an unquoted value in the token clean-up phase. This avoids unnecessary grammatical ambiguity and more clearly captures the intent of the specification.

unquoted = node('unquoted',
                ('first', not1(special)),
                ('rest', many0(not1(space))))

Problem: context sensitive rules

There are a number of constructs in the NMR-Star format that are context-sensitive, meaning that they cannot be parsed correctly using a context-free grammar/parser:

  • no duplicate save frame names
  • no duplicate identifiers in save frames
  • no duplicate identifiers in loops
  • framecode references
  • the number of values in a loop must be an integer multiple of the number of identifiers
For example, the following loop follows the context-free rules but has duplicate keys:
  loop_

    _a _b _a

    v1 v2 v3
    v4 v5 v6

  stop_
What we'd like to have the parser do is report an error including location in the input and the nature of the violation.

One way to solve this problem is to use a context-sensitive grammar formalism; however, I've never tried this approach. One important disadvantage is that it would prevent context-free parsing of files that violate the context-sensitive rules -- which may be a pain if you have invalid data.

A second approach is to parse the input according to the context-free rules and generate a concrete syntax tree. Then, run a second pass over the CST to produce the abstract syntax tree. The second pass is now responsible for implementing the context-sensitive rules. This is the method I used and seems to work well in practice. It also seems to be easier to maintain due to better separation of concerns.

Wrap up

As far as formats go, the NMR-Star format is relatively clear and succinct. However, as I've shown, it does have a few pitfalls for the unwary, as well as some redundancies.

Scannerless Parsing: a JSON case study

Scannerless parsing

Scannerless parsing is not a very popular or well-understood approach to building parsers. So many parsing tools out there take for granted that parser implementors want/need/accept separate scanning and context-free parsing phases, that it's hard to get good information on the how, what, and why of scannerless parsers. That is, one would like to know:

  • What is 'scannerless parsing'?
  • What's so different about it?
  • How are scannerless parsers implemented?
  • What are some of the advantages?
  • What are some of the drawbacks, pitfalls, and difficulties?
I will explore these issues using a parser that I recently built, for the JSON data format, as a case study. I hope this will provide some practical insight into scannerless parsers!

What is scannerless parsing?

A scannerless parser is a parser that does not arbitrarily separate parsing into lexical and context-free phases; instead, these are combined into a single phase, as we'll see in the rest of this section.

Programming and data languages are often defined by grammars; parsers are programs that implement grammars; they read in strings and: 1) decide whether the string is part of the language; 2) build a structure representing the input; and 3) report any problems with the input.

Common tools for converting grammars to executable parsers (see the Wikipedia page for examples) often enforce an artificial separation of parsing into two stages. In the first stage, often called lexing (or tokenization or lexical analysis), the input string is broken up into a sequence of tokens. These tokens are then used as the input for the second stage.

I'm not sure what the next stage is called; I've seen it called context-free parsing, hierarchical parsing, and simply parsing. But the important point is that in this stage, the tokens are assembled into syntactic units to build some kind of parse tree which represents the structure that was recognized in the input string.

Each phase has its own separate grammar to describe what language it accepts. In a typical parser, the lexical grammar is (or tries to be) regular in the formal sense, and may require only minimal lookahead and/or backtracking; these characteristics help the tokenizer to do its job faster. The grammar for the second phase is context-free, and its terminals are tokens. Sometimes the grammars are ambiguous (meaning some strings can be parsed multiple ways); ambiguities are often resolved using additional rules outside the grammar.

This separation of parsing into two phases is completely arbitrary; it is perfectly possible to create a single-phase parser that accepts the exact same language as a two-phase parser. After all, it's quite easy to combine the regular token grammar into the hierarchical context-free grammar. Advantages of the two-phase approach are that it's familiar, tools implementing it are often quite fast, and the separate grammars may be simpler than a single combined grammar.

On the other hand, there are also important disadvantages when splitting parsing into two phases:

  • Artificially constraining the token grammar can create unnecessary ambiguities. See the lexer hack for an example. These types of ambiguities are not inherent in language itself, but are caused by the technology. Another example occurs with Java's templates; see this question. It's solved by reinterpreting the token stream during the second phase, if necessary.
  • More phases means more glue code between phases. Information such as token type and position must pass from the lexer to the parser; data may also be passed back to the lexer as in one resolution of the lexer hack. Also, error reporting and handling between the phases must be accounted for.
  • Language composability may be reduced.
  • Language description is more complicated. Two parallel grammars must be maintained, and the interactions between them well understood in order to effectively make changes.
  • Two tools and all their idiosyncracies, as well as the interface between them, must be mastered.

Parser combinators

The parsers will all be built using parser combinators; if you've worked with parser combinators before, it should be straightforward to follow along (aside from any idiosyncracies of my approach). If you've never used them, there are plenty of libraries in every language which are easy to download, or you can check out my library. I prefer using parser combinators because:

  • they are powerful -- they parse not only regular and context-free, but also context-sensitive grammars
  • they are expressive -- parsers are approximately as long as the grammar they implement
  • as they are expressed within a programming language, they benefit from that language's ecosystem, including functions, classes, unit testing libraries and tools, build tools, type systems, runtime, JIT compiler, etc. This also means that they can be extended when new patterns are identified.
  • they are composable. Parsers are built by successively combining parsers into ever-large parsers; combinator libraries come with a rich set of functions for combining parsers.
  • they are easy to test. Since each parser -- no matter how small -- is an object, it can be tested independently. This is a big win for me -- if I'm not confident that I have the little things right, I know I have no chance of getting the big things right.
  • easy to deploy and use. Since both parsers and the combinators are libraries, one needs only to import them, then hand them data to get started.
  • results, such as parse trees, are easy to build while parsing
  • they allow good, meaningful error messages to be produced

The combinators I'll use support four computational effects:

  1. backtracking. This allows choice between alternatives.
  2. error reporting. Errors will not be created using exceptions, but rather through a special datatype.
  3. state(1). The token sequence.
  4. state(2). This will be used to keep track of the position (line and column) in the input that the parser is at.

The how of scannerless parsers

Now let's get started! I'll point out problems that I faced, then describe how my scannerless parser solved them. The full code can be found on github here.

Problem: whitespace/comments between tokens

Many languages, including JSON, allow insignificant whitespace between tokens. This means whitespace must be discarded before and after tokens. One solution is to implement a parser for each token, and then wrap them with a combinator which runs the token parser and discards junk before or after the parser. Should we pre-munch or post-munch the junk? I found two advantages of post-munching: 1) after parsing a token, the parser always stops at the beginning of the next token, making it easy to report position; 2) if the parser must backtrack on a token, it will not have to re-parse the junk, only to discard it again. Of course, pre-munching before the very first token is still necessary.

This implements a `whitespace` parser and a `tok` combinator:

whitespace = many0(oneOf(' \t\n\r'))

def tok(parser):
    return seq2L(parser, whitespace)
The `tok` combinator implements post-munching by running its parser argument and then running the whitespace parser; its result is that of the parser argument. Note that the whitespace parser can't fail, since it matches zero or more whitespace characters -- this allows there to not be any whitespace after tokens.

Parsing a single token

By getting rid of the scanning phase, we haven't thrown the baby out with the bathwater, have we? Of course not! Since context-free grammars are more powerful than regular ones, it is no problem to express token patterns. One can simply create a parser for each token type. Since the JSON specification defines lexical syntax as a regular language, the token parsers simply don't use the more powerful features of CFGs.

The implementation uses the `literal` combinator:

os = tok(literal('['))
Which matches a given character exactly; this parser is then passed to `tok` to allow for optional trailing whitespace, which is post-munched.

Problem: position tracking

Keeping track of the line and column number while parsing allows position-specific errors to be reported, and also construction of a parse tree that includes position information. The latter is useful if you need to link later analyses to positions in the file.

One method for tracking position is to pre-process the input, converting a sequence of characters into a new sequence of annotated characters, including both the original character and the calculated position. This approach seems to work okay, but has a couple of drawbacks. First, many of the combinators have to be updated to deal with matching annotated characters, which means they are no longer compatible with simple characters, so you need two versions. Second, you have to extract the underlying characters; like the first problem, it's not a big deal but is certainly annoying. Third, you can only look at the position by monadically pulling a token into scope, which means: 1) extra parameters to pass around, and 2) you can't look at the position if the token sequence is empty.

A separate approach is to track the position using monadic state. This has the advantages that the parser combinators work the exact same way and do not need to be updated, the values don't have to be muddled with to extract out the underlying characters, and you can look at the position whenever necessary using the appropriate combinator. One disadvantage is that under backtracking, position calculations may be repeated.

Problem: parsing complex syntactic structures

The JSON format defines several multi-token syntactic structures -- key/value pairs, arrays, and objects. Again, for each of these I implemented a single parser. However, these parsers don't touch the token sequence directly, but only indirectly through the token parsers. In other words, each structural parser is implemented as a combination of token parsers.

This is similar to how the two-phase parsing strategy works. However, there is a key difference -- the structural parsers use only the token parsers because they *choose* to; it is no problem to use a different tokenization strategy at any time if necessary.

The syntax rules for array, object, and value are mutually recursive; however, Python's variable-binding semantics are not let-rec compatible, so a little trick is used to mock a forward declaration:

array = error('unimplemented')

value = alt(jsonstring, number, keyword, obj, array)

array.parse = node('array',
                   ('open', os),
                   ('body', sepBy0(value, comma)),
                   ('close', cut('close', cs))).parse
The `value` rule then refers to the `array` object; later, we replace the `parse` attribute of `array` with the actual parsing function that we want. Ugly, but effective. As for the rest of the parser, note how it doesn't touch the characters directly -- it is built in terms of the token parsers (and `value`).

Problem: keyword matching

There are three literal keywords in the JSON spec. Using the `string` parser, which matches a sequence of tokens exactly, a keyword is matched:

_keyword = node('keyword', 
                ('value', alt(*map(string, ['true', 'false', 'null']))))
If the keyword is matched, then the appropriate keyword is presented as the return value.

Strings: hex escape sequences

The JSON spec allows four-digit hexadecimal escape sequences in string literals, opened by the sequence '\u', and followed by four hexadecimal digits (case-insensitive):

_hexC = oneOf('0123456789abcdefABCDEF')

_unic = node('unicode escape',
             ('open', string('\\u')),
             ('value', cut('4 hexadecimal digits', quantity(_hexC, 4))))

What's missing

This article doesn't cover the details of error-reporting, which was extremely complicated. While correctly reporting the where and why of malformed input is critical in a real parser, the JSON spec does not explicitly say what is an error (it's implied) it is difficult to know what should be reported. This article also doesn't cover the design, implementation, and combinators of the parsing library. If you would like to see more about either of these topics, check out the code on github!