Wednesday, July 3, 2013

Parsing the NMR-Star format

The NMR-Star format

NMR, or Nuclear Magnetic Resonance, is a technique for studying molecules at the molecular level; it is also the technology behind NMR machines. The Biological Magnetic Resonance Data Bank is a great resource for archived NMR data, which can be access in text files using the NMR-Star format.

However, the NMR-Star format has a number of corner cases that I had to solve while writing the NMR-Star parser which were quite tricky. This article will discuss the problems and their solutions. The full code can be found on github in my NMR-Star parser project.

The parsing strategy

I used parser combinators supporting error reporting, backtracking, and line/column position, and added a a couple of extra phases for clean-up:

  • scanner
  • token clean-up. Discards whitespace and comment tokens, and classifies unquoted strings as keywords or values
  • CST parsing. Builds a concrete syntax tree from the token sequence
  • AST construction. Checks context-sensitive constraints while building an abstract syntax tree

Problem: whitespace/comments between tokens and semicolon-delimited strings

The NMR-Star format allows insignificant whitespace and comments between tokens. A standard parsing solution is to discard whitespace and comments after every token. However, the NMR-Star language defines semicolon-delimited strings to be opened by the sequence "newline, semicolon" and ended by the same sequence, which breaks the "munch junk after every token" rule, since newlines would be counted as whitespace and be discarded by the time a semicolon-string is tried. For example:

  ;abc  # <-- not a semicolon-string because preceding character is not a newline

;abc  # <-- semicolon-string because preceding character *is* a newline
def
; 

In my first solution to this problem, I pre-munched junk before every token. Semicolon-delimited strings got a special munching rule, which verified that the last junk character before the opening semicolon was indeed a newline. However, this had a couple disadvantages: 1) position reporting was much more difficult than with post-munching, since junk had to be parsed before finding the beginning of the next token, and 2) it was a special case, which complicated the specification and implementation.

My second solution took advantage of the position information -- that the opening semicolon immediately follows a newline implies that it is in column 1 of its line. So, the parser must inspect the position before deciding to try parsing a semicolon-delimited string or not. This allowed one single rule for insignificant whitespace and comments, and also allowed me to use post-munching.

def _sc_rest(position):
    _, column = position
    # a semicolon-delimited string must be preceded by a newline -- thus, column must be 1
    if column == 1:
        return node('scstring',
                    ('open', sc),
                    ('value', many0(not1(_end_sc))),
                    ('close', cut('newline-semicolon', _end_sc)))
    return zero
    
scstring = bind(getState, _sc_rest)

Problem: semicolons inside semicolon-delimited strings

Since the newline/semicolon sequence closes semicolon-delimited strings, this means that semicolons can not be the first character on a line within a semicolon-string. For example:


; this ; starts the string
 this ; is *within* the string and does not end it
; # this ; ends the string, because it follows a newline
; # which means this is an extraneous ;
This problem is solved with a single character of lookahead: while parsing a semicolon-string, if a newline is encountered and the next character is a semicolon, end the string; otherwise, continue parsing the semicolon-string.

Problem: non-ending quotes of quoted values

NMR-Star has double-quoted values:

"abc 123"
Double-quotes can be within the string, as long as they are not followed by whitespace or EOF. Thus, this is a single, valid double-quoted value, since the second '"' is followed by '1':
"abc"123"
This is easily solved using a single character of lookahead: '"' followed by whitespace, end the string; '"' followed by not whitespace, consume the '"' and continue parsing the string.

NMR-Star also has single-quote values which work in the same way.

Problem: keyword vs. unquoted value

The lexical definitions of keywords and unquoted values in the original Star grammar is ambiguous; this is resolved with an additional rule stating that unquoted values can not be keywords. For example:

loop_  # <- 'loop_' is a keyword, so it can not be an unquoted value
 _a _b
 loop_1  # <- 'loop_1' is not a keyword, so it can be an unquoted value
 loop_2 
stop_
An alternate definition, and the one which I used in my parser, is to match the longest string of the unquoted value pattern; then classify the string as either a keyword or an unquoted value in the token clean-up phase. This avoids unnecessary grammatical ambiguity and more clearly captures the intent of the specification.

unquoted = node('unquoted',
                ('first', not1(special)),
                ('rest', many0(not1(space))))

Problem: context sensitive rules

There are a number of constructs in the NMR-Star format that are context-sensitive, meaning that they cannot be parsed correctly using a context-free grammar/parser:

  • no duplicate save frame names
  • no duplicate identifiers in save frames
  • no duplicate identifiers in loops
  • framecode references
  • the number of values in a loop must be an integer multiple of the number of identifiers
For example, the following loop follows the context-free rules but has duplicate keys:
  loop_

    _a _b _a

    v1 v2 v3
    v4 v5 v6

  stop_
What we'd like to have the parser do is report an error including location in the input and the nature of the violation.

One way to solve this problem is to use a context-sensitive grammar formalism; however, I've never tried this approach. One important disadvantage is that it would prevent context-free parsing of files that violate the context-sensitive rules -- which may be a pain if you have invalid data.

A second approach is to parse the input according to the context-free rules and generate a concrete syntax tree. Then, run a second pass over the CST to produce the abstract syntax tree. The second pass is now responsible for implementing the context-sensitive rules. This is the method I used and seems to work well in practice. It also seems to be easier to maintain due to better separation of concerns.

Wrap up

As far as formats go, the NMR-Star format is relatively clear and succinct. However, as I've shown, it does have a few pitfalls for the unwary, as well as some redundancies.

No comments:

Post a Comment