Scannerless parsing

Scannerless parsing is not a very popular or well-understood approach to building parsers. So many parsing tools out there take for granted that parser implementors want/need/accept separate scanning and context-free parsing phases, that it's hard to get good information on the how, what, and why of scannerless parsers. That is, one would like to know:

What is 'scannerless parsing'?
What's so different about it?
How are scannerless parsers implemented?
What are some of the advantages?
What are some of the drawbacks, pitfalls, and difficulties?

I will explore these issues using a parser that I recently built, for the JSON data format, as a case study. I hope this will provide some practical insight into scannerless parsers!

What is scannerless parsing?

A scannerless parser is a parser that does not arbitrarily separate parsing into lexical and context-free phases; instead, these are combined into a single phase, as we'll see in the rest of this section.

Programming and data languages are often defined by grammars; parsers are programs that implement grammars; they read in strings and: 1) decide whether the string is part of the language; 2) build a structure representing the input; and 3) report any problems with the input.

Common tools for converting grammars to executable parsers (see the Wikipedia page for examples) often enforce an artificial separation of parsing into two stages. In the first stage, often called lexing (or tokenization or lexical analysis), the input string is broken up into a sequence of tokens. These tokens are then used as the input for the second stage.

I'm not sure what the next stage is called; I've seen it called context-free parsing, hierarchical parsing, and simply parsing. But the important point is that in this stage, the tokens are assembled into syntactic units to build some kind of parse tree which represents the structure that was recognized in the input string.

Each phase has its own separate grammar to describe what language it accepts. In a typical parser, the lexical grammar is (or tries to be) regular in the formal sense, and may require only minimal lookahead and/or backtracking; these characteristics help the tokenizer to do its job faster. The grammar for the second phase is context-free, and its terminals are tokens. Sometimes the grammars are ambiguous (meaning some strings can be parsed multiple ways); ambiguities are often resolved using additional rules outside the grammar.

This separation of parsing into two phases is completely arbitrary; it is perfectly possible to create a single-phase parser that accepts the exact same language as a two-phase parser. After all, it's quite easy to combine the regular token grammar into the hierarchical context-free grammar. Advantages of the two-phase approach are that it's familiar, tools implementing it are often quite fast, and the separate grammars may be simpler than a single combined grammar.

On the other hand, there are also important disadvantages when splitting parsing into two phases:

Artificially constraining the token grammar can create unnecessary ambiguities. See the lexer hack for an example. These types of ambiguities are not inherent in language itself, but are caused by the technology. Another example occurs with Java's templates; see this question. It's solved by reinterpreting the token stream during the second phase, if necessary.
More phases means more glue code between phases. Information such as token type and position must pass from the lexer to the parser; data may also be passed back to the lexer as in one resolution of the lexer hack. Also, error reporting and handling between the phases must be accounted for.
Language composability may be reduced.
Language description is more complicated. Two parallel grammars must be maintained, and the interactions between them well understood in order to effectively make changes.
Two tools and all their idiosyncracies, as well as the interface between them, must be mastered.

Parser combinators

The parsers will all be built using parser combinators; if you've worked with parser combinators before, it should be straightforward to follow along (aside from any idiosyncracies of my approach). If you've never used them, there are plenty of libraries in every language which are easy to download, or you can check out my library. I prefer using parser combinators because:

they are powerful -- they parse not only regular and context-free, but also context-sensitive grammars
they are expressive -- parsers are approximately as long as the grammar they implement
as they are expressed within a programming language, they benefit from that language's ecosystem, including functions, classes, unit testing libraries and tools, build tools, type systems, runtime, JIT compiler, etc. This also means that they can be extended when new patterns are identified.
they are composable. Parsers are built by successively combining parsers into ever-large parsers; combinator libraries come with a rich set of functions for combining parsers.
they are easy to test. Since each parser -- no matter how small -- is an object, it can be tested independently. This is a big win for me -- if I'm not confident that I have the little things right, I know I have no chance of getting the big things right.
easy to deploy and use. Since both parsers and the combinators are libraries, one needs only to import them, then hand them data to get started.
results, such as parse trees, are easy to build while parsing
they allow good, meaningful error messages to be produced

The combinators I'll use support four computational effects:

backtracking. This allows choice between alternatives.
error reporting. Errors will not be created using exceptions, but rather through a special datatype.
state(1). The token sequence.
state(2). This will be used to keep track of the position (line and column) in the input that the parser is at.

The how of scannerless parsers

Now let's get started! I'll point out problems that I faced, then describe how my scannerless parser solved them. The full code can be found on github here.

Problem: whitespace/comments between tokens

Many languages, including JSON, allow insignificant whitespace between tokens. This means whitespace must be discarded before and after tokens. One solution is to implement a parser for each token, and then wrap them with a combinator which runs the token parser and discards junk before or after the parser. Should we pre-munch or post-munch the junk? I found two advantages of post-munching: 1) after parsing a token, the parser always stops at the beginning of the next token, making it easy to report position; 2) if the parser must backtrack on a token, it will not have to re-parse the junk, only to discard it again. Of course, pre-munching before the very first token is still necessary.

This implements a `whitespace` parser and a `tok` combinator:

whitespace = many0(oneOf(' \t\n\r'))

def tok(parser):
    return seq2L(parser, whitespace)

The `tok` combinator implements post-munching by running its parser argument and then running the whitespace parser; its result is that of the parser argument. Note that the whitespace parser can't fail, since it matches zero or more whitespace characters -- this allows there to not be any whitespace after tokens.

Parsing a single token

By getting rid of the scanning phase, we haven't thrown the baby out with the bathwater, have we? Of course not! Since context-free grammars are more powerful than regular ones, it is no problem to express token patterns. One can simply create a parser for each token type. Since the JSON specification defines lexical syntax as a regular language, the token parsers simply don't use the more powerful features of CFGs.

The implementation uses the `literal` combinator:

os = tok(literal('['))

Which matches a given character exactly; this parser is then passed to `tok` to allow for optional trailing whitespace, which is post-munched.

Problem: position tracking

Keeping track of the line and column number while parsing allows position-specific errors to be reported, and also construction of a parse tree that includes position information. The latter is useful if you need to link later analyses to positions in the file.

One method for tracking position is to pre-process the input, converting a sequence of characters into a new sequence of annotated characters, including both the original character and the calculated position. This approach seems to work okay, but has a couple of drawbacks. First, many of the combinators have to be updated to deal with matching annotated characters, which means they are no longer compatible with simple characters, so you need two versions. Second, you have to extract the underlying characters; like the first problem, it's not a big deal but is certainly annoying. Third, you can only look at the position by monadically pulling a token into scope, which means: 1) extra parameters to pass around, and 2) you can't look at the position if the token sequence is empty.

A separate approach is to track the position using monadic state. This has the advantages that the parser combinators work the exact same way and do not need to be updated, the values don't have to be muddled with to extract out the underlying characters, and you can look at the position whenever necessary using the appropriate combinator. One disadvantage is that under backtracking, position calculations may be repeated.

Problem: parsing complex syntactic structures

The JSON format defines several multi-token syntactic structures -- key/value pairs, arrays, and objects. Again, for each of these I implemented a single parser. However, these parsers don't touch the token sequence directly, but only indirectly through the token parsers. In other words, each structural parser is implemented as a combination of token parsers.

This is similar to how the two-phase parsing strategy works. However, there is a key difference -- the structural parsers use only the token parsers because they *choose* to; it is no problem to use a different tokenization strategy at any time if necessary.

The syntax rules for array, object, and value are mutually recursive; however, Python's variable-binding semantics are not let-rec compatible, so a little trick is used to mock a forward declaration:

array = error('unimplemented')

value = alt(jsonstring, number, keyword, obj, array)

array.parse = node('array',
                   ('open', os),
                   ('body', sepBy0(value, comma)),
                   ('close', cut('close', cs))).parse

The `value` rule then refers to the `array` object; later, we replace the `parse` attribute of `array` with the actual parsing function that we want. Ugly, but effective. As for the rest of the parser, note how it doesn't touch the characters directly -- it is built in terms of the token parsers (and `value`).

Problem: keyword matching

There are three literal keywords in the JSON spec. Using the `string` parser, which matches a sequence of tokens exactly, a keyword is matched:

_keyword = node('keyword', 
                ('value', alt(*map(string, ['true', 'false', 'null']))))

If the keyword is matched, then the appropriate keyword is presented as the return value.

Strings: hex escape sequences

The JSON spec allows four-digit hexadecimal escape sequences in string literals, opened by the sequence '\u', and followed by four hexadecimal digits (case-insensitive):

_hexC = oneOf('0123456789abcdefABCDEF')

_unic = node('unicode escape',
             ('open', string('\\u')),
             ('value', cut('4 hexadecimal digits', quantity(_hexC, 4))))

What's missing

This article doesn't cover the details of error-reporting, which was extremely complicated. While correctly reporting the where and why of malformed input is critical in a real parser, the JSON spec does not explicitly say what is an error (it's implied) it is difficult to know what should be reported. This article also doesn't cover the design, implementation, and combinators of the parsing library. If you would like to see more about either of these topics, check out the code on github!

Fenwick et al

Wednesday, July 3, 2013