Tuesday, November 6, 2012

A classy approach to parser combinators

Parser combinators

Parser combinators serve as a great introduction to Functional Programming, and are one of the most-studied topics in the field. Nevertheless, they are a very complex and broad topic, covering concepts such as non-determinism, monads, and higher-order functions.

What people may not get as much exposure to, at least in my experience, is many of the Haskell type classes and their relationship to parsers. As we'll see in this article, the most common and useful combinators are actually parser-specific versions of more widely useful generic operations.

Type classes

The definitions of the type classes used are based on both the standard Haskell classes of the same name (minus the '), and Brent Yorgey's Typeclassopedia. I've added a ' to the end of each of their names to indicate that they're not identical to the standard Haskell classes, and in some cases are quite different. I've also added one type class of my own -- Switch' -- which represents the ability to convert a failing computation into a successful one with a default value, and to convert a successful computation into a failing one. I wasn't able to find a type class providing this functionality on Hoogle.

Parser definition and basic combinators

Following convention, parsers are modeled as functions that operate on token streams, either producing a result paired with the rest of the token stream, or failing. A convenient choice for representing possible failure is the Maybe data type.

newtype Parser t a = Parser { 
        getParser :: [t] -> Maybe ([t], a) 
    }

run :: Parser t a -> [t] -> Maybe ([t], a)
run = getParser
In addition, we'll use these basic parsers repeatedly throughout the examples to build bigger and more exotic parsers:
-- succeeds, consuming one token, as
--   long as input is not empty
getOne :: Parser s s
getOne = Parser (\xs -> case xs of 
                        (y:ys) -> pure (ys, y);
                        _      -> empty)

-- runs the parser, and if it succeeds,
--   checks that its result satisfies a predicate
check :: (a -> Bool) -> Parser s a -> Parser s a
check f p = p >>= \x -> 
  guard (f x) >> 
  pure x

-- consumes one token if the token
--   satisfies a predicate
satisfy :: (a -> Bool) -> Parser a a
satisfy p = check p getOne

-- builds a parser that only
--   matches the given token
literal :: Eq a => a -> Parser a a
literal tok = satisfy (== tok)

Alternation and failure

Alternation and failure are covered by the semigroup and monoid classes, respectively. Semigroups are characterized by an associative, binary, closed operation.

The parser interpretation of semigroups is choice: given two parsers, use the first one if it succeeds, but use the second one if the first fails.

class Semigroup' a where
  (<|>)  :: a -> a -> a

instance Semigroup' (Parser s a) where
  Parser f <|> Parser g = Parser (\xs -> f xs <|> g xs)
This implementation exploits the fact that the Maybe datatype can also form a left-biased semigroup.

Monoids are semigroups whose binary operation has an identity element; for parsers, this means that applying the choice operator to any parser plus the identity parser will always return the result of the first parser, regardless of whether it fails or succeeds. The identity parser always ignores its input and fails:

class Semigroup' a => Monoid' a where
  empty :: a

instance Monoid' (Parser s a) where
  empty = Parser (const Nothing)

Here are some examples:

-- combining two parsers with choice:  succeeds if either parser succeeds
a :: Parser Char Char
a = literal 'a'
b :: Parser Char Char
b = literal 'b'
ab :: Parser Char Char
ab = a <|> b

$ run ab "babcd"
Just ("abcd",'b')
$ run ab "abcd"
Just ("bcd",'a')

-- the empty parser always fails
fail :: Parser Char Char
fail = empty

$ run fail "abcd"
$ Nothing

-- the empty parser is both a right and a left identity
a_ :: Parser Char Char
a_ = a    <|>  fail
_a :: Parser Char Char
_a = fail <|>  a

$ run a_ "abcd"
Just ("bcd",'a')
$ run a_ "babcd"
Nothing
$ run _a "abcd"
Just ("bcd",'a')
$ run _a "babcd"
Nothing
We're not limited to combining two parsers at a time, of course; there is also the 'mconcat' combinator:
mconcat :: Monoid' a => [a] -> a
mconcat = foldr (<|>) empty

$ run (mconcat []) "abcde"
Nothing

digits :: [Parser Char Char]
digits = map literal ['0' .. '9']

$ run (mconcat digits) "4hi!!"
Just ("hi!!", '4')
'mconcat' combines a list of monoids using the binary operation, and the identity element as the base case. This means that using 'mconcat' on an empty list will generate a parser that always fails.

Success

Similarly to the parser that always fails, we have a parser that always succeeds. This is captured by the pointed class, which is the 'pure' part of the Applicative class in the standard Haskell libraries. This class allows you to lift a value into a context; for parsers, we build a parser that always succeeds, with the specified value as its result, and consuming zero tokens.

  
class Pointed' f where
  pure :: a -> f a

instance Pointed' (Parser s) where
  pure a = Parser (\xs -> Just (xs, a))

Examples:

pass :: Parser Integer String
pass = pure "Hello, world!"


$ run pass []
Just ([],"Hello, world!")

$ run pass [1,100,31]
Just ([1,100,31],"Hello, world!")
The parser 'pass' always succeeds, even with empty input; it simply returns its input token stream along with its value.

Mapping and sequencing

It's also useful to have access to a parser's value for further processing; a common use case is building up a parse tree. This concept is captured by the Functor class, which lifts a normal function to a function that operates on the result value of a parser. The parser interpretation is that, given a function and a parser, if the parser succeeds, map the function over its results; whereas if the parser fails, just propagate the failure.

class Functor' f where
  fmap :: (a -> b) -> f a -> f b

instance Functor' (Parser s) where
  -- one 'fmap' for the Maybe, one for the tuple
  fmap f (Parser g) = Parser (fmap (fmap f) . g)

The Applicative class enables not just lifting, but application in which both the function and its arguments are in contexts. It allows parsers to be run in sequence, where the first parser is run, and if it fails, the whole chain fails; if it succeeds, the rest of the token stream is passed to the next parser and its result is collected, and so on. This implementation makes use of the Monad instance of Maybe, although it could also be implemented without such an assumption.

class Functor' f => Applicative' f where
  (<*>) :: f (a -> b) -> f a -> f b

instance Applicative' (Parser s) where
  Parser f <*> Parser x = Parser h
    where
      h xs = f xs >>= \(ys, f') -> 
        x ys >>= \(zs, x') ->
        Just (zs, f' x')

Here are some examples:

one :: Parser Char Char
one = literal '1'
oneInt :: Parser Char Int
oneInt = fmap (\x -> (read :: String -> Int) [x .. '9']) one

$ run oneInt "123"
Just ("23",123456789)

two :: Parser Char Char
two = literal '2'
twelve :: Parser Char (Char, Char)
twelve = pure (,) <*> one <*> two

$ run twelve "123"
Just ("3",('1','2'))

$ run twelve "1123"
Nothing

The first example shows a Char parser ('one') that is converted into an Int parser using 'fmap' and a function of type 'Char -> Int'. The second example applies the '(,)' function within an Applicative parser context, tupling the results of the parsers 'one' and 'two'. The third example shows that parsers run in sequence must all succeed for the entire match to succeed; although the '1' is matched, the '2' cannot be.

The power of Applicative parsers can also be harnessed to create parsers that ignore the results (but not the effects!) of some or all of their parsers:

(*>) :: Parser t a -> Parser t b -> Parser t b
l *> r = fmap (flip const) l <*> r 

(<*) :: Parser t a -> Parser t b -> Parser t a
l <* r = fmap const l <*> r
Both '(*>)' and '(<*)' will only succeed if both of their arguments succeed in sequence; the difference is that '(*>)' only returns the result of the 2nd parser, while '(<*)' only returns the result of the 1st parser. Examples, using the 'one' and 'two' parsers defined above:
$ run (two *> one) "212345"
Just ("2345",'1')

$ run (two <* one) "212345"
Just ("2345",'2')

Combining Applicatives with Semigroups, we can create repeating parsers:

many :: Parser t a -> Parser t [a]
many p = some p <|> pure []

some :: Parser t a -> Parser t [a]
some p = fmap (:) p <*> many p
(note that 'some' and 'many' are mutually recursive). 'many' tries to run its parser as many times as possible, progressively chewing up input; it always succeeds since it's fine with matching 0 times. On the other hand, 'some' matches its parser at least once, failing if it can't match it at all, but other than that is identical to 'many'. Examples (using 'one' from above):
$ run (fmap length $ many one) "111111234"
Just ("234",6)
$ run (many one) "23434593475dkljdfs"
Just ("23434593475dkljdfs","")

$ run (fmap length $ one) "111111234"
Just ("234",6)
$ run (some one) "23434593475dkljdfs"
Nothing

Negations

Oftentimes, parsing conditions are easier to state in the negative than in the positive. For instance, if you were parsing a string, you might look for a double-quote character to open the string, and another double-quote to end the string. Meanwhile, anything that's *not* a double-quote which comes after the opening will be part of the string. To capture this pattern, I created the 'Switch' class:

class Switch' f where
  switch :: f a -> f ()

instance Switch' (Parser s) where
  switch (Parser f) = Parser h
    where h xs = fmap (const (xs, ())) $ switch (f xs)
This converts a failing parser to a successful one and vice versa. Importantly, it consumes no input from the token stream -- it acts as a negative lookahead parser, which allows us to build flexible parsers on top of it. Examples:
not1 :: Parser t b -> Parser t t
not1 p = switch p *> getOne

dq :: Parser Char Char
dq = literal '"'

not_dq :: Parser Char Char
not_dq = not1 dq

dq_string :: Parser Char String
dq_string = dq *> many not_dq <* dq

$ run dq_string "\"no ending double-quote"
Nothing

$ run dq_string "\"I'm a string\"abcxyz"
Just ("abcxyz","I'm a string")
The 'not1' combinator takes a parser as input, runs that parser, and if it succeeds, 'not1' fails; if that parser fails, 'not1' then tries to consume a single token (any token). In other words, it's like saying "I want anything but ".

The 'not_dq' parser matches any character that's not a double-quote; the string parser matches a double-quote followed by any number of non-double-quotes, followed by another double-quote; it throws away the results of both double-quote parsers, only returning the body of the string.

Running many parsers in sequence

Traversable is an interesting type class. It allows you to 'commute' two functors; i.e. if you have '[Maybe Int]', it allows you to create 'Maybe [Int]' (that is, turn a list of 'Maybe Int's into a 'Maybe' list of Ints. This is also useful for parsing, where it allows one to convert a list of parsers into a (single) parser of lists. In this case, we don't need to supply an instance for 'Parser' because the Functor in question is lists:
class Functor' t => Traversable' t where
  commute :: (Pointed' f, Applicative' f) => t (f a) -> f (t a)
Here are some examples (using 'digits' from above):
six_fours :: [Parser Char Char]
six_fours = replicate 6 (literal '4')

$ run (commute digits) "0123456789abcxyz"
Just ("abcxyz","0123456789")

$ run (commute six_fours) "4444449999999"
Just ("9999999","444444")
$ run (commute six_fours) "44444 oops that was only 5 fours"
Nothing

Monads

What parsing article could be complete without mentioning monads? Monads are similar to applicatives, but add the extra ability to have computations depend on the result of previous computations. Here's the class definition and parser implementation:

class (Applicative' m, Pointed' m) => Monad' m where
  join :: m (m a) -> m a 

instance Monad' (Parser s) where
  join (Parser f) = Parser h
    where
      h xs = f xs >>= \(o, Parser g) -> g o  
A good example of putting this extra power to work is this combinator:
twice :: Eq a => Parser a a -> Parser a a
twice p = p >>= \x ->
  literal x
It runs its input parser, and if it succeeds, attempts to match the *same* output a second time. Thus, the second match depends on the results of the first. We can't build such a parser using applicatives (although we can build less general versions by enumerating multiple cases). Here's an example showing how it's different from an Applicative version, using the 'ab' parser from earlier:
ab_twice :: Parser Char Char
ab_twice = twice ab

-- using monads
$ run ab_twice "aa123"
Just ("123",'a')
$ run ab_twice "ab123"
Nothing

-- using applicatives
$ run (pure (,) <*> ab <*> ab) "aa123"
Just ("123",('a','a'))
$ run (pure (,) <*> ab <*> ab) "ab123"
Just ("123",('a','b'))
In the first example, which uses monadic parsing, 'ab_twice' parses the first input and fails on the second. However, the second example -- with applicatives -- successfully parses both inputs. It sees the two parsers as being totally independent of each other and thus isn't able to require that the second one match the same tokens as the first one.

Relationship to BNF grammars, regular expressions, etc.

Of course, all of these useful parsing combinators have also been applied in other parsing approaches, such as grammars and regular expressions. Here's a quick correspondence:
BNF/regex combinators
| <|> of semigroups
sequencing <*> of applicatives
* many
+ some
grouping always explicitly grouped

What's next & further reading

There are a few topics that weren't covered in this article. First and foremost, good error detection and reporting is a key component of a parser library that's friendly and easy to use. Second, although I chose to use the Maybe data type to model the results, this could be extended to use any arbitrary monad -- resulting in a much richer set of parsers. Two examples are the list monad, to allow non-deterministic parses, and the state monad, two allow context-dependent parses.

If you're interested in learning more about parsing, Philip Wadler, Graham Hutton, and Doaitse Swierstra have published some excellent papers over the years on the topic; reading their papers was what really helped me to understand parsing. And of course there's also the powerful Parsec tool, a Haskell-based library for parser combinators which illustrates these ideas in a practical context.