Thursday, March 6, 2014

Four different ways to inspect types and prototypes in Javascript

What is a type in Javascript?

It's hard to succinctly and accurately define `type` in Javascript. A couple of complications are Javascript's dynamic typing (meaning that variables don't have types), and that the tools that Javascript provides don't unambiguously and uniquely divide values in separate groups (meaning that there's overlap and differences between the various ways, and no one single way is more "right" than the others). Given these difficulties, let's forget about coming up with an unambiguous definition of "type" and also forget about what "type" means in other languages. Instead, let's just say that for the purposes of this article, "type" will mean "some way to group objects based on certain similar properties".

What are some of the properties that we can use to group Javascript's values? First, we can look at whether a value is an object or a primitive. We can further divide and group the objects by their prototype chains or constructors. We can divide the primitives into smaller groups based on their primitive types.

However, as I mentioned earlier, there are alternative, meaningful ways to group values. For instance, objects with identical prototype chains can be separated (`arguments` vs `{}`). There is also some overlap between primitives and certain objects -- some primitives are automatically converted to objects under certain circumstances, similar to Java's autoboxing.

This article will take a close look at Javascript's type-inspecting tools, and the different notions of "type" that they provide. Using a small amount of code -- and a large set of test cases -- we'll gather some data about each of the tools. This data will give us insight into the pros and cons of each tool -- and perhaps help to indicate when each should be used.

What are the tools at our disposal?

Object.getPrototypeOf

Can be used recursively to get the full prototype chain of an object. See the MDN docs. Related: Object.prototype.isPrototypeOf.

Object.prototype.toString

Used by Underscore for type inspection.

instanceof

Performs inheritance checks which respect the prototype chain.

typeof

Mostly useful for distinguishing between primitives and objects.

Array.isArray

A special method for checking if an object is an array. (Since this kind of method only exists for arrays, I won't use it in the rest of this article).

The code

We want to inspect as many different kinds of values as possible -- so let's make sure that we have each of the primitives, each of the commonly-used built-in object types, functions, user-defined types, `arguments`, object wrappers for primitives, and the root object for good measure. Remember, for each of these example values, we'll try each of the four previously-mentioned tools for inspecting types.

For each example, there's a meaningful string for display purposes, an expression that will evaluate to the desired value, and also a constructor function that we'll use for an `instanceof` check. The last part is pretty arbitrary -- I just use it to show that a given value can satisfy `instanceof` for multiple different constructors.

// put examples inside a function to get access to `arguments`
function getExamples() {
    var functionText = "new Function('x', 'return x + 1;')";
    return [
        // schema:  
        //   0: human-readable text
        //   1: expression to be inspected
        //   2: (optional) constructor for instanceof-checking
        ['undefined'        , undefined                         , null     ],
        ['null'             , null                              , null     ],
        ["'abc'"            , 'abc'                             , String   ],
        ["new String('abc')", new String('abc')                 , String   ],
        ['123'              , 123                               , Number   ],
        ['new Number(123)'  , new Number(123)                   , Number   ],
        ['Infinity'         , Infinity                          , Number   ],
        ['NaN'              , NaN                               , Number   ],
        ['true'             , true                              , Boolean  ],
        ['new Boolean(true)', new Boolean(true)                 , Boolean  ],
        ['function g(x) {}' , function g(x) {}                  , Function ],
        [functionText       , new Function('x', 'return x + 1;'), Function ],
        ["{'a': 1, 'b': 2}" , {'a': 1, 'b': 2}                  , null     ],
        ['new Object()'     , new Object()                      , null     ],
        ['new ObjectExt()'  , new ObjectExt()                   , ObjectExt],
        ['[1, 2, 3]'        , [1, 2, 3]                         , Array    ],
        ['new Array()'      , new Array()                       , Array    ],
        ['new ArrayExt()'   , new ArrayExt()                    , Array    ],
        ['/a/'              , /a/                               , RegExp   ],
        ["new RegExp('a')"  , new RegExp('a')                   , RegExp   ],
        ['new RegExpExt()'  , new RegExpExt()                   , RegExp   ],
        ['new Date()'       , new Date()                        , Date     ],
        ["new Error('!')"   , new Error('!')                    , Error    ],
        ['Math'             , Math                              , null     ],
        ['JSON'             , JSON                              , null     ],
        ['arguments'        , arguments                         , null     ],
        ['this'             , this /* the root object, right? */, Window   ]
    ];
}
Here's the code for setting up the three user-defined constructors. Each of Array, Object, and RegExp are extended:
// extend Array
function ArrayExt() {}
ArrayExt.prototype = [1, 2, 3];

// extend Object 
function ObjectExt() {}
 
// extend RegExp
function RegExpExt() {}
RegExpExt.prototype = /matt/;
Finally, the function used to grab an object's prototype chain. This throws an exception if `obj` is a primitive:
function getParents(obj) {
    var parents = [],
        par = obj;
    while ( true ) {
        par = Object.getPrototypeOf(par);
        if ( par === null ) {
            break;
        }
        parents.push(par);
    }
    return parents;
}

The data

Now we take each of the example values, and apply each of the four tests to it -- plus an extra `instanceof` test to show inheritance. For each expression "e", we'll do:
typeof e

Object.prototype.toString.call(e)

e instanceof Object

e instanceof [subtype]

getParents(e)
Here are the results. Especially surprising, inconsistent, and strange results are in red:
example typeof Object.prototype.toString instanceof Object instanceof subtype prototype chain
undefined undefined [object Undefined] false -- --
null object [object Null] false -- --
'abc' string [object String] false String: false --
new String('abc') object [object String] true String: true String,Object
123 number [object Number] false Number: false --
new Number(123) object [object Number] true Number: true Number,Object
Infinity number [object Number] false Number: false --
NaN number [object Number] false Number: false --
true boolean [object Boolean] false Boolean: false --
new Boolean(true) object [object Boolean] true Boolean: true Boolean,Object
function g(x) {} function [object Function] true Function: true Function,Object
new Function('x', 'return x + 1;') function [object Function] true Function: true Function,Object
{'a': 1, 'b': 2} object [object Object] true -- Object
new Object() object [object Object] true -- Object
new ObjectExt() object [object Object] true ObjectExt: true ObjectExt,Object
[1, 2, 3] object [object Array] true Array: true Array,Object
new Array() object [object Array] true Array: true Array,Object
new ArrayExt() object [object Object] true Array: true ArrayExt,Array,Object
/a/ object [object RegExp] true RegExp: true RegExp,Object
new RegExp('a') object [object RegExp] true RegExp: true RegExp,Object
new RegExpExt() object [object Object] true RegExp: true RegExpExt,RegExp,Object
new Date() object [object Date] true Date: true Date,Object
new Error('!') object [object Error] true Error: true Error,Object
Math object [object Math] true -- Object
JSON object [object JSON] true -- Object
arguments object [object Arguments] true -- Object
this object [object global] true Window: true Window,EventTarget,Object
Notes:
  • these results were obtained in Firefox, Javascript 1.5; Chrome, Javascript 1.7
  • the results in the last row vary by implementation

The analysis

typeof

`typeof` distinguishes between primitives and objects. However:
  • `typeof null` is "object"
  • it returns "function" for functions, even though they are objects -- this is not wrong, just misleading
  • for String, Number, and Boolean: `typeof` return "object" for wrapped values
  • it doesn't distinguish between different objects -- arrays, dates, regexps, user-defined, etc.: all are just "object"

instanceof

`instanceof` checks whether a constructor's prototype property is in an object's prototype chain. However:
  • the 2nd argument must be a constructor. The constructor is used to look up a prototype. This is a problem if creating objects with `Object.create` -- there is no constructor function (that you have access to).
  • may not work if there are objects moving across frames or windows
  • different results for corresponding objects and primitives
  • doesn't tell you what the prototypes actually are

Object.prototype.toString.call(...)

This seems to differentiate between the built-ins correctly. See this for more information. However:
  • it doesn't differentiate between corresponding objects and primitives
  • it reports all primitives as objects
  • doesn't seem to work for user-defined constructors and objects. Apparently, it depends on an internal [[Class]] property which can't be touched, according to this.

prototype chain using Object.getPrototypeOf

Gets the prototype objects. However:
  • can't distinguish `arguments` from `Math`, `JSON`, and other objects. In fact, can't distinguish between any objects that share the same prototype chain.
  • doesn't work on primitives -- even those which have corresponding objects
  • may fail for passing objects between windows/frames -- like `instanceof` -- (not sure)

Conclusion

It appears that none of these approaches is capable of dealing with this problem by itself. I'm not even sure if it's possible to come up with a 100% accurate and unbreakable method for classifying Javascript values. However, using a combination of these tools will probably get you most of the way there. Good luck!

Tuesday, March 4, 2014

Simplifying a formal grammar using transformations and BNF extensions

Integer literals

A BNF grammar is a great way to specify a language, due to its concise and declarative nature. However, one major limitation of BNF is that it's not extendable -- which means if you identify a common pattern, you can't factor it out. You're stuck with repeating the boilerplate everywhere it occurs.

For an example of boilerplate making a grammar far more difficult to understand, let's take a look at the Java Language Specification While the specification is formal, it's rather verbose and it's difficult to intuitively see what language the grammar really generates -- especially the corner cases:

DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore
    DigitsAndUnderscores DigitOrUnderscore 

DigitOrUnderscore:
    Digit
    _

Underscores:
    _
    Underscores _
That's 7 rules on 21 lines of code (27 if you include blank lines), and describes the syntax for base 10 integer literals.

The challenge: can we transform this grammar into one that generates the same language, but is clearer and more concise? (We are allowed to extend BNF with additional operators, if necessary)

Three simple transformations

Repetitions: + quantifier

First, let's borrow some regex notation -- whenever we see this pattern:
RuleName:
    terminal
    RuleName  terminal
that means there's one or more repetion of "terminal", so we can instead use the "+" quantifier:
RuleName:
    terminal(+)
That pattern appears in two places, and after substituting our new shorthand for both of them, we now have:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore(+)

DigitOrUnderscore:
    Digit
    _

Underscores:
    _(+)

Factoring in

I use the "factor in" transformation when a rule is:
  1. only used once
  2. very simple
  3. given a semantically void name such as "Underscores"
Here's the basic idea of factoring in (it's the opposite of factoring out repeated code):
// before -- need to look up `SubRule1` in order to understand `Rule1`
Rule1: 
    SubRule1

SubRule1:
    ... some pattern ...

// after -- no additional rule to look up, also shorter
Rule1:
    ... some pattern ...
Applying this transformation to "DigitsAndUnderscores" and "Underscores" yields:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitOrUnderscore:
    Digit
    _
Which saves 4 more lines, and two meaningless names. This transformation can be easily abused, leading to grammars that are more difficult to read instead of less so. The trick is to decide when "factoring in" clarifies the grammar and when it obfuscates.

Character classes

This borrows more regex short-hand -- square bracket notation and character ranges. It means the rule must match any one of the characters in the given range. I'll apply it to shorten "NonZeroDigit":
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit:
    [1-9]

DigitOrUnderscore:
    Digit
    _
Okay, that didn't help much yet -- but we'll use it again shortly.

The plot thickens

Now we can start reusing those transformations we just covered. First, let's factor in "NonZeroDigit":

DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    [1-9]

DigitOrUnderscore:
    Digit
    _
Now, combine "Digit"s two alternatives, using the square bracket notation, factor it in to "DigitOrUnderscore", and then combine "DigitOrUnderscore"s two alternatives:
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    [0-9]

DigitOrUnderscore:
    [0-9_]
Now factor in both "Digit" and "DigitOrUnderscore":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](+)(?) [0-9]
The quantifiers "+" and "?", when used together, mean the same as "*":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]
And we can get rid of the "?" quantifier using its definition, splitting an alternative into two:
DecimalNumeral:
    0
    [1-9]
    [1-9] Digits
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]

Closing steps

Let's now factor in "Digits" -- but we'll have to be careful since it has two alternative rules. This means the factored-in result we'll have two alternatives wherever "Digits" is replaced:
DecimalNumeral:
    0
    [1-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
And now let's combine the first two alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
The final transformation requires a bit of intuition -- notice that underscores are only allowed in the interior of the numeral, never on the edges. Let's combine the last 4 alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9_](*) [0-9]
And ... voila! We've gone from 21 lines, down to 3 concise and simple ones. This makes it easier to see common error cases -- for example, a number cannot end with an underscore. Try it out on your Java compiler!

Conclusion

The length and complexity of the original grammar caused quite a few problems:

  • more code means more chances for error -- misinterpretations, typos, omissions, etc.
  • we couldn't tell what the code was saying. This makes it tough to write effective tests
  • semantically void names -- "DigitsAndUnderscores" -- were distracting
  • corner cases were not obvious
  • difficult to maintain in the face of change
To deal with this, we defined several transformations, as well as extended BNF with a couple of new operators borrowed from regular expressions:
  • the "+" repetition quantifier -- one or more
  • square bracket notation and character ranges
  • factor in simple rules, one-off rules, and poor names
We then used these transformations and extensions to transform the grammar, creating a clearer, more concise grammar.