Integer literals
A BNF grammar is a great way to specify a language, due to its concise and declarative nature. However, one major limitation of BNF is that it's not extendable -- which means if you identify a common pattern, you can't factor it out. You're stuck with repeating the boilerplate everywhere it occurs.
For an example of boilerplate making a grammar far more difficult to understand, let's take a look at the Java Language Specification While the specification is formal, it's rather verbose and it's difficult to intuitively see what language the grammar really generates -- especially the corner cases:
DecimalNumeral: 0 NonZeroDigit Digits(?) NonZeroDigit Underscores Digits Digits: Digit Digit DigitsAndUnderscores(?) Digit Digit: 0 NonZeroDigit NonZeroDigit: one of 1 2 3 4 5 6 7 8 9 DigitsAndUnderscores: DigitOrUnderscore DigitsAndUnderscores DigitOrUnderscore DigitOrUnderscore: Digit _ Underscores: _ Underscores _That's 7 rules on 21 lines of code (27 if you include blank lines), and describes the syntax for base 10 integer literals.
The challenge: can we transform this grammar into one that generates the same language, but is clearer and more concise? (We are allowed to extend BNF with additional operators, if necessary)
Three simple transformations
Repetitions: + quantifier
First, let's borrow some regex notation -- whenever we see this pattern:RuleName: terminal RuleName terminalthat means there's one or more repetion of "terminal", so we can instead use the "+" quantifier:
RuleName: terminal(+)That pattern appears in two places, and after substituting our new shorthand for both of them, we now have:
DecimalNumeral: 0 NonZeroDigit Digits(?) NonZeroDigit Underscores Digits Digits: Digit Digit DigitsAndUnderscores(?) Digit Digit: 0 NonZeroDigit NonZeroDigit: one of 1 2 3 4 5 6 7 8 9 DigitsAndUnderscores: DigitOrUnderscore(+) DigitOrUnderscore: Digit _ Underscores: _(+)
Factoring in
I use the "factor in" transformation when a rule is:- only used once
- very simple
- given a semantically void name such as "Underscores"
// before -- need to look up `SubRule1` in order to understand `Rule1` Rule1: SubRule1 SubRule1: ... some pattern ... // after -- no additional rule to look up, also shorter Rule1: ... some pattern ...Applying this transformation to "DigitsAndUnderscores" and "Underscores" yields:
DecimalNumeral: 0 NonZeroDigit Digits(?) NonZeroDigit _(+) Digits Digits: Digit Digit DigitOrUnderscore(+)(?) Digit Digit: 0 NonZeroDigit NonZeroDigit: one of 1 2 3 4 5 6 7 8 9 DigitOrUnderscore: Digit _Which saves 4 more lines, and two meaningless names. This transformation can be easily abused, leading to grammars that are more difficult to read instead of less so. The trick is to decide when "factoring in" clarifies the grammar and when it obfuscates.
Character classes
This borrows more regex short-hand -- square bracket notation and character ranges. It means the rule must match any one of the characters in the given range. I'll apply it to shorten "NonZeroDigit":DecimalNumeral: 0 NonZeroDigit Digits(?) NonZeroDigit _(+) Digits Digits: Digit Digit DigitOrUnderscore(+)(?) Digit Digit: 0 NonZeroDigit NonZeroDigit: [1-9] DigitOrUnderscore: Digit _Okay, that didn't help much yet -- but we'll use it again shortly.
The plot thickens
Now we can start reusing those transformations we just covered. First, let's factor in "NonZeroDigit":
DecimalNumeral: 0 [1-9] Digits(?) [1-9] _(+) Digits Digits: Digit Digit DigitOrUnderscore(+)(?) Digit Digit: 0 [1-9] DigitOrUnderscore: Digit _Now, combine "Digit"s two alternatives, using the square bracket notation, factor it in to "DigitOrUnderscore", and then combine "DigitOrUnderscore"s two alternatives:
DecimalNumeral: 0 [1-9] Digits(?) [1-9] _(+) Digits Digits: Digit Digit DigitOrUnderscore(+)(?) Digit Digit: [0-9] DigitOrUnderscore: [0-9_]Now factor in both "Digit" and "DigitOrUnderscore":
DecimalNumeral: 0 [1-9] Digits(?) [1-9] _(+) Digits Digits: [0-9] [0-9] [0-9_](+)(?) [0-9]The quantifiers "+" and "?", when used together, mean the same as "*":
DecimalNumeral: 0 [1-9] Digits(?) [1-9] _(+) Digits Digits: [0-9] [0-9] [0-9_](*) [0-9]And we can get rid of the "?" quantifier using its definition, splitting an alternative into two:
DecimalNumeral: 0 [1-9] [1-9] Digits [1-9] _(+) Digits Digits: [0-9] [0-9] [0-9_](*) [0-9]
Closing steps
Let's now factor in "Digits" -- but we'll have to be careful since it has two alternative rules. This means the factored-in result we'll have two alternatives wherever "Digits" is replaced:DecimalNumeral: 0 [1-9] [1-9] [0-9] [1-9] [0-9] [0-9_](*) [0-9] [1-9] _(+) [0-9] [1-9] _(+) [0-9] [0-9_](*) [0-9]And now let's combine the first two alternatives:
DecimalNumeral: [0-9] [1-9] [0-9] [1-9] [0-9] [0-9_](*) [0-9] [1-9] _(+) [0-9] [1-9] _(+) [0-9] [0-9_](*) [0-9]The final transformation requires a bit of intuition -- notice that underscores are only allowed in the interior of the numeral, never on the edges. Let's combine the last 4 alternatives:
DecimalNumeral: [0-9] [1-9] [0-9_](*) [0-9]And ... voila! We've gone from 21 lines, down to 3 concise and simple ones. This makes it easier to see common error cases -- for example, a number cannot end with an underscore. Try it out on your Java compiler!
Conclusion
The length and complexity of the original grammar caused quite a few problems:
- more code means more chances for error -- misinterpretations, typos, omissions, etc.
- we couldn't tell what the code was saying. This makes it tough to write effective tests
- semantically void names -- "DigitsAndUnderscores" -- were distracting
- corner cases were not obvious
- difficult to maintain in the face of change
- the "+" repetition quantifier -- one or more
- square bracket notation and character ranges
- factor in simple rules, one-off rules, and poor names
No comments:
Post a Comment