Tuesday, March 4, 2014

Simplifying a formal grammar using transformations and BNF extensions

Integer literals

A BNF grammar is a great way to specify a language, due to its concise and declarative nature. However, one major limitation of BNF is that it's not extendable -- which means if you identify a common pattern, you can't factor it out. You're stuck with repeating the boilerplate everywhere it occurs.

For an example of boilerplate making a grammar far more difficult to understand, let's take a look at the Java Language Specification While the specification is formal, it's rather verbose and it's difficult to intuitively see what language the grammar really generates -- especially the corner cases:

DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore
    DigitsAndUnderscores DigitOrUnderscore 

DigitOrUnderscore:
    Digit
    _

Underscores:
    _
    Underscores _
That's 7 rules on 21 lines of code (27 if you include blank lines), and describes the syntax for base 10 integer literals.

The challenge: can we transform this grammar into one that generates the same language, but is clearer and more concise? (We are allowed to extend BNF with additional operators, if necessary)

Three simple transformations

Repetitions: + quantifier

First, let's borrow some regex notation -- whenever we see this pattern:
RuleName:
    terminal
    RuleName  terminal
that means there's one or more repetion of "terminal", so we can instead use the "+" quantifier:
RuleName:
    terminal(+)
That pattern appears in two places, and after substituting our new shorthand for both of them, we now have:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore(+)

DigitOrUnderscore:
    Digit
    _

Underscores:
    _(+)

Factoring in

I use the "factor in" transformation when a rule is:
  1. only used once
  2. very simple
  3. given a semantically void name such as "Underscores"
Here's the basic idea of factoring in (it's the opposite of factoring out repeated code):
// before -- need to look up `SubRule1` in order to understand `Rule1`
Rule1: 
    SubRule1

SubRule1:
    ... some pattern ...

// after -- no additional rule to look up, also shorter
Rule1:
    ... some pattern ...
Applying this transformation to "DigitsAndUnderscores" and "Underscores" yields:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitOrUnderscore:
    Digit
    _
Which saves 4 more lines, and two meaningless names. This transformation can be easily abused, leading to grammars that are more difficult to read instead of less so. The trick is to decide when "factoring in" clarifies the grammar and when it obfuscates.

Character classes

This borrows more regex short-hand -- square bracket notation and character ranges. It means the rule must match any one of the characters in the given range. I'll apply it to shorten "NonZeroDigit":
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit:
    [1-9]

DigitOrUnderscore:
    Digit
    _
Okay, that didn't help much yet -- but we'll use it again shortly.

The plot thickens

Now we can start reusing those transformations we just covered. First, let's factor in "NonZeroDigit":

DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    [1-9]

DigitOrUnderscore:
    Digit
    _
Now, combine "Digit"s two alternatives, using the square bracket notation, factor it in to "DigitOrUnderscore", and then combine "DigitOrUnderscore"s two alternatives:
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    [0-9]

DigitOrUnderscore:
    [0-9_]
Now factor in both "Digit" and "DigitOrUnderscore":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](+)(?) [0-9]
The quantifiers "+" and "?", when used together, mean the same as "*":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]
And we can get rid of the "?" quantifier using its definition, splitting an alternative into two:
DecimalNumeral:
    0
    [1-9]
    [1-9] Digits
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]

Closing steps

Let's now factor in "Digits" -- but we'll have to be careful since it has two alternative rules. This means the factored-in result we'll have two alternatives wherever "Digits" is replaced:
DecimalNumeral:
    0
    [1-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
And now let's combine the first two alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
The final transformation requires a bit of intuition -- notice that underscores are only allowed in the interior of the numeral, never on the edges. Let's combine the last 4 alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9_](*) [0-9]
And ... voila! We've gone from 21 lines, down to 3 concise and simple ones. This makes it easier to see common error cases -- for example, a number cannot end with an underscore. Try it out on your Java compiler!

Conclusion

The length and complexity of the original grammar caused quite a few problems:

  • more code means more chances for error -- misinterpretations, typos, omissions, etc.
  • we couldn't tell what the code was saying. This makes it tough to write effective tests
  • semantically void names -- "DigitsAndUnderscores" -- were distracting
  • corner cases were not obvious
  • difficult to maintain in the face of change
To deal with this, we defined several transformations, as well as extended BNF with a couple of new operators borrowed from regular expressions:
  • the "+" repetition quantifier -- one or more
  • square bracket notation and character ranges
  • factor in simple rules, one-off rules, and poor names
We then used these transformations and extensions to transform the grammar, creating a clearer, more concise grammar.

No comments:

Post a Comment