Thursday, March 31, 2016

Hudl Technique and Core Data: Two Worlds Collide

For the past year-and-a-half, I've been working on the Hudl Technique (formerly Ubersense; formerly SwingReader) iOS app. Hudl Technique is an app for recording, playing, analyzing, and sharing high-quality video for getting better at your sport of choice!

About a year ago, we migrated the app from using leveldb to using core data for on-device data storage. This helped out a ton with some performance issues we were seeing, which were especially bad for power users -- but as you can imagine, moving to core data came with some issues of its own.

I learned so much about core data from this experience, and I'd like to share some of that knowledge -- I'm going to describe how our core data stack works, issues we ran into and how we solved them, and several examples.

Core data rules

Real quick, let's go over core data threading rules: you may only access MO, MOC from a single thread. In practice, this means that whenever you use core data, you must be aware of what thread you’re on; there’s one right thread, and lots of wrong ones.

Later in this article, we'll talk about how to follow these rules, and how to determine if you're following them correctly.

Our core data stack

We have one managed object context (MOC) connected to the persistent store coordinator (PSC). We'll call this the master MOC. This is a long-lived MOC, meaning that it's created when the app starts up, and hangs around until the app is terminated. It uses the NSPrivateQueueConcurrencyType, so it doesn't block the main thread.

We have a second long-lived MOC which we call the main MOC. We call it this because it uses the NSMainQueueConcurrencyType, and it's intended to be used from code which interacts with the UI. This MOC is a child of the master MOC.

We also have methods for creating additional child MOCs. These MOCs can use either NSPrivateQueueConcurrencyType or NSMainQueueConcurrencyType -- depending on whether they're going to be used from UI code or not. These are also children of the master MOC, but unlike the main MOC, they're not long-lived. They're intended to be used and then quickly discarded. They can also be used as scratchpads: you get one of these MOCs, make some changes, then change your mind and wish to just discard your changes -- no problem, all you have to do is just lose your references to the MOC!

In summary, there's one master MOC, which is long-lived and points to the PSC. There's lots of child MOCs, and of those, just one -- the main MOC -- is long-lived. The master MOC is private, so client code isn't able to use the master MOC directly -- clients must either use the main MOC, or create a new child MOC.

Saving your changes

Let's talk about how information moves between MOCs. The default behavior in core data is that, when you save a child MOC, that only pushes the changes up to its parent MOC, but doesn't push those into the PSC. This is confusing at first, because you're saving your changes, but when you quit and restart the app, your data is gone! To deal with this, we have a little bit of magic that kicks in whenever you save a child MOC: saving a child MOC changes the master MOC, and whenever the master MOC gets changed, we automatically save it, which pushes the changes out to the PSC and ensures that they'll get persisted.

What about the main MOC? It's long-lived, but it's a sibling of the other child MOCs, so when they save, those changes don't go to the main MOC. Does it end up with stale, out-of-date data?

That's a good question, but the answer is no! We have an extra bit of magic that pushes changes from the master MOC down to the main MOC, whenever the master MOC is changed.

Let's review really quick: let's say you get a child MOC and save it -- what happens? Well, after you save, those changes get pushed up to the master MOC; then they get pushed up to the PSC and persisted by our first bit of magic; next, those changes get pushed down to the master MOC by our second bit of magic.

It's important to note that this is the only case in which changes go down! The rest of the time, changes go child->parent, parent->PSC -- but never parent->child. This means that if you get a child MOC (other than the main one), it won't receive updates when the master MOC changes. So it will get stale, and have out-of-date data, and potentially cause crashes if you ask it to fulfill faults for objects that have been deleted. But we don't want this to happen for the main MOC, so we push changes to it!

Choosing a MOC

When we want to use core data within our app, we have three different choices for MOCs: the main MOC, a main queue child MOC, and a background queue child MOC. How do we decide which one to use?

  • Are we dealing with data for the UI? If so, we should the main MOC or a main queue child MOC, so that we can safely use the MOC and its MOs from UI methods, since those will be called on the main thread.
  • Do we need to avoid blocking the main thread? If so, we should use a background queue child MOC.
  • Will we need to undo changes? If so, we should use a main queue child MOC, or a background queue child MOC, because undoing changes in those MOCs is as easy as getting rid of all our references to the MOC.
  • Will we need to hold on to the MOC for a long time? ("A long time" is a bit nebulous, but in practice, it means: "after acquiring the MOC, and before getting rid of it, is any other MOC saving changes?") If yes, we'll want to use a long-lived MOC so that its snapshot of the data doesn't become stale; in our core data setup, this means we'll need to use the main MOC.
  • Will the MOC need to get updates? If yes, we'll want to use the main MOC, since it's our only publicly accessible MOC which gets updates.

Usage

Once you've obtained a MOC from our core data stack, you can do any of the CRUD (Create, Read, Update, Delete) operations! Of course, you'll have to make sure that you're using managed objects and MOCs on the right thread. Here are three strategies for doing that:

  1. use main queue MOC + MOs from UI methods
  2. use NSManagedObjectContext's performBlock or performBlockAndWait helper methods
  3. pass MOC as final parameter to method; the method assumes it's called on the correct thread, and the caller assumes responsibility for ensuring that the method is called on the correct thread, using either strategy 1 or strategy 2
Make sure to wrap all access to MO and MOC in a performBlock or performBlockAndWait. Patterns for doing this: - local performBlock - pass MOC as parameter to method, make sure method gets called from within performBlock/performBlockAndWait Then: - save - pay attention to errors (not much you can do about them other than log, though) - get rid of MOC by letting go of references to it

Pointers

It's a good idea to enable multithreading assertions in XCode. This tells XCode to check that you're using core data on the right thread, helping you to spot and fix problems with your core data usage. You can enable the assertions by editing your scheme, and adding this this argument:

Once that's set, whenever you run your app from XCode -- whether on the simulator or on a device -- XCode will be running core data threading assertions. Don't worry: this won't affect the app's behavior when it's in the AppStore! It's purely a debugging aid.

Managed Objects

We use Mogenerator to help convert our xcdatamodel files into Objective C classes. There are a couple nice reasons to use it. First, it creates pairs of classes for each entity: _EntityName and EntityName. _EntityName is where it puts all the auto-generated boilerplate, and gets updated whenever your model changes. EntityName is where you put your code, and doesn't get whacked when your model changes. Second, it creates subclasses of NSManagedObjectID, one per entity. If your entity is EntityName, then it creates EntityNameID, and sets up your EntityName class so that its objectID property is of type NSManagedObjectID. This isn't an absolutely critical issue, but I do like the additional type checking and compiler support that it provides.

Examples

If you need to use an MO on multiple threads, you can't pass the object itself between threads. That's against core data's threading rules. Instead, you'll have to pass the objectID, and on each thread on which you need to use the object, you'll have to acquire a MOC and use that to get the MO.

Sometimes, you may need to get rid of your changes. If you've made your changes in a short-lived child context, this is easy: simply throw away the MOC without saving, and get rid of your references to the MOC.

Difficulties

We changed our core data setup to see if we could clear up an issue we were facing: fetch request batch sizes are ignored for child MOCs, and since we were trying to set the batch size on our main MOC, we decided to change it to point to the PSC instead of to the master MOC. This made it a sibling of the master MOC, instead of a child:

Unfortunately, this caused a gigantic number of core data crashes, each of which mentioned 'statement is still active' somewhere in the crash report. We never were able to figure out what the underlying problem was, so we ended up reverting this change.

Saturday, June 13, 2015

iOS Video Quality

Resolution, framerate, and quality of video captured using iOS devices

This is a brief overview of my findings while working on a recent project at Hudl on the Ubersense app.

Which resolutions are available on an iOS device?

This depends on what model device you have, and whether you're talking about the front or back camera. Using Apple APIs, you can get a list of all the AVCaptureDeviceFormats for a camera -- here's a code snippet doing just that:
AVCaptureDevice *camera = ...;
for (AVCaptureDeviceFormat *format in [camera formats])
{
    CMVideoDimensions dims = CMVideoFormatDescriptionGetDimensions([format formatDescription]);
    CGFloat fps = 0;
    if ([format.videoSupportedFrameRateRanges count] > 0)
    {
        AVFrameRateRange *range = format.videoSupportedFrameRateRanges[0];
        fps = range.maxFrameRate;
    }
    NSString *title = [NSString stringWithFormat:@"%@ x %@ @ %@", @(dims.width), @(dims.height), @(fps)];
    NSLog(@"%@", title);
}
Let's see what's available on the iPhone6 back camera:
  • 192x144, 30 FPS
  • 352x288, 30 FPS
  • 480x360, 30 FPS
  • 640x480, 30 FPS
  • 960x540, 30 FPS
  • 1280x720, 30 FPS
  • 1280x720, 240 FPS
  • 1920x1080, 30 FPS
  • 1920x1080, 60 FPS
  • 2592x1936, 30 FPS
  • 3264x2448, 30 FPS
And the iPhone6 front camera:
  • 192x144, 60 FPS
  • 352x288, 60 FPS
  • 480x360, 60 FPS
  • 640x480, 60 FPS
  • 960x540, 60 FPS
  • 1280x720, 60 FPS
  • 1280x960, 60 FPS

Examples

What differences can you detect between these 5 different AVCaptureDeviceFormats?

I recorded 5 videos, each with a different AVCaptureDeviceFormat in Ubersense at 3.5x zoom. I then opened the videos in the Ubersense player, zoomed in again to 3.5x, and took a screenshot.

(Which formats I used are at the end of this post.)

Conclusions

On iOS devices, you can't use resolution as a proxy for quality. In other words, the naive assumption that the higher the resolution, the better the quality; and the lower the resolution, the worse the quality is false.

Quality depends on more than just resolution. It's a good idea to do some tests using real data.

Format used

The formats I used, in order, were:
  • 480x360, 30 FPS
  • 1280x720, 30 FPS
  • 1280x720, 240 FPS
  • 1920x1080, 30 FPS
  • 1920x1080, 60 FPS
The most obvious difference is that the 240 FPS format looks way worse than everything else; presumably a tradeoff of such a high frame rate. Otherwise, I find it difficult to choose between the various formats -- it's suprising to me that 1920x1080 isn't clearly superior to 640x480, given the gigantic difference in resolution.

What do you think?

Monday, May 11, 2015

Make git merges easier, safer, and more transparent

Merges between semantically diverged branches are difficult to resolve

Lately at work, I've had to merge git branches that have significantly diverged since their common ancestor. When I run a "git merge" command, I end up with many conflicts that I have to solve. Since I couldn't be sure I was solving them correctly, I came up with a strategy for handling difficult merges that makes it easy to see exactly what the merge conflicts were, and how I resolved them.

Briefly, the strategy is:
  • create a new branch pointing to the same commit as your target branch, and run "git merge source-branch" (the purpose of creating a new branch specifically for resolving the merge is to help prevent accidentally screwing up your branches)
  • there will be conflicts; simply add all the files with conflicts as-is and commit (I prefer not to change the default commit message)
  • in subsequent commits, manually resolve all the conflicted files
  • push the new resolution branch up to github and open a PR against the target branch, allowing your team members to review the merge conflicts and resolution (team members can now see the merge conflicts and my resolution, and can comment on it on github)
  • after addressing any comments, merge the PR (this is now a fast-forward merge, if no changes have been made to the github repo's target-branch in the meantime)

Example

You can find a complete git repo of this example here.

The git repo has a single file, "a.txt". Two branches, named "target-branch" and "source-branch", introduce conflicting changes to the same line of "a.txt".

Attempting to merge these branches results in conflicts.

Here are the contents of "a.txt" in the common ancestor (517537):

1
2
3
"a.txt" in the target branch (7ac885):
1
1.4
2
3
and "a.txt" in the source branch (d65a7e):
1
1.6
2
3

As you can see, the two versions of "a.txt" in the target and source branches differ at line 2.

I'm going to create a new branch "resolution-branch" (which initially points to the same commit as "target-branch"), and merge "source-branch" into "target-branch":

$ git checkout target-branch
Switched to branch 'target-branch'

$ git checkout -b resolution-branch
Switched to a new branch 'resolution-branch'

$ git merge source-branch
Auto-merging a.txt
CONFLICT (content): Merge conflict in a.txt
Automatic merge failed; fix conflicts and then commit the result.

$ git status
On branch resolution-branch
You have unmerged paths.
  (fix conflicts and run "git commit")

Unmerged paths:
  (use "git add ..." to mark resolution)

 both modified:   a.txt

no changes added to commit (use "git add" and/or "git commit -a")

$ git diff
diff --cc a.txt
index a41e2e2,26a22ad..0000000
--- a/a.txt
+++ b/a.txt
@@@ -1,4 -1,4 +1,8 @@@
  1
++<<<<<<< HEAD
 +1.4
++=======
+ 1.6
++>>>>>>> source-branch
  2
  3

$ git add a.txt

$ git commit
[resolution-branch f621093] Merge branch 'source-branch' into resolution-branch
I then manually resolve "a.txt", so that the diff looks like this:
$ git diff
diff --git a/a.txt b/a.txt
index ce6567d..2357f17 100644
--- a/a.txt
+++ b/a.txt
@@ -1,8 +1,4 @@
 1
-<<<<<<< HEAD
-1.4
-=======
-1.6
->>>>>>> source-branch
+1.5
 2
 3
and add and commit "a.txt":
$ git add a.txt 

$ git commit -m "resolve a.txt: compromise on 1.5"
[resolution-branch e244fea] resolve a.txt: compromise on 1.5
 1 file changed, 1 insertion(+), 5 deletions(-)
this is what the commit graph now looks like on "resolution-branch":
$ git log --graph --all
* commit e244fea9d08a3a8351b954c6ab0e71622087eae8
| Author: Matt Fenwick <...>
| Date:   Mon May 11 12:26:39 2015 -0500
| 
|     resolve a.txt: compromise on 1.5
|    
*   commit f62109387f047b16b1b0591067550e57813b8164
|\  Merge: 7ac885c d65a7ec
| | Author: Matt Fenwick <...>
| | Date:   Mon May 11 12:22:48 2015 -0500
| | 
| |     Merge branch 'source-branch' into resolution-branch
| |     
| |     Conflicts:
| |             a.txt
| |   
| * commit d65a7ecab55155ee8506a3a676f967bb1e16d87c        <-- this is "source-branch"
| | Author: Matt Fenwick <...>
| | Date:   Mon May 11 12:22:04 2015 -0500
| | 
| |     1.6
| |   
* | commit 7ac885c2f463ea24a5d40239a301d6cc4d1a421e        <-- this is "target-branch"
|/  Author: Matt Fenwick <...>
|   Date:   Mon May 11 12:21:08 2015 -0500
|   
|       1.4
|  
* commit 517537fd892e39c68917eeaf012b5ebe2c1bc374          <-- this is "common-ancestor-branch"
  Author: Matt Fenwick <...>
  Date:   Mon May 11 12:20:26 2015 -0500
  
      1,2,3

Finally, I open a pull request for merging "resolution-branch" into "target-branch". Merging this PR is a fast-forward (and thus, trivial) since there have been no updates to "target-branch" in the meantime.

You can find a complete example of this here.

Wednesday, September 3, 2014

Writing a PhD dissertation in Latex

Writing a PhD dissertation in Latex

Having just finished writing my PhD dissertation using Latex, I'd like to share my experiences -- what was easy, what was hard, and some problems to watch out for.

Writing a PhD dissertation is a daunting task. Besides the actual content, we also need to worry about formatting, references, bibliographies, tables of contents, page numbering, figure layout, dedication pages, ... Using a good document preparation program can help you with these issues, leaving you free to focus on the science. A bad program can suck up your time on annoying, trivial details.

If you've never heard of Latex before, the short version is that it's a program used to create documents, much like Microsoft Word. I decided to use Latex to write my PhD dissertation because I didn't want to waste time on side issues, and I thought Latex would be able to handle those. (Also, I didn't know how to do those in Word, which would otherwise have been the default choice).

Why Latex?

In Latex, your document is plain text. This is advantageous because of the wealth of tools that work with plain text, such as git. I knew before starting that I wanted to manage my dissertation using git (due to its project management features such as tags to help me keep track of which versions I shared with my committee, ability to compare revisions with line-based diffs, and integration with cloud-hosting services such as bitbucket and github), and it's far more pleasant to work with text files than binary ones in git. Writing my dissertation in Latex is a natural fit for git.

Latex distributions are free, lots of people use them and contribute and test code and answers to the community. There's also lots of features both built-in and available through add-on packages to help build documents efficiently.

Latex's pleasant surprises

  • the pdflatex program produces beautiful PDFs from your plain text Latex source files
  • references are easy to manage and share. The Bibtex format is pretty standard and it's easy to get references in this format from most journals
  • citing references from within the text is easy; Latex makes sure the citations are consistently numbered, formatted, and named
  • Latex can automatically generate the bibliography
  • Latex can automatically generate a table of contents
  • Latex can automatically generate internal hyperlinks within a PDF
  • beautiful rendering of mathematical equations
  • can refer to figures, tables, and other sections of text using labels and anchors

There's a learning curve

If you're coming from Word, then learning to use Latex is going to take some time -- it does things differently, and you'll have to figure out a new approach to writing documents. I expected that I would spend a decent chunk of time learning how to use Latex, debugging and fixing my mistakes, and solving problems. This did indeed turn out to be the case.

You can also expect to face your fair share of standard problems (that you'll probably run into no matter what program you use). This includes finding a version of the program that runs on your platform, figuring out where to find add-ons and how to install them, and building a conceptual understanding of how the program works -- so that you understand why errors occur and how to fix them.

Problems and issues I encountered

More of my time than expected was spent managing Latex (instead of working on the content). While there's undoubtedly solutions for each of these problems, it's tough for a beginner. Here are some of the problems that I encountered:

  • the error messages produced by pdflatex were pretty cryptic, which made it difficult to understand and google for the problem
  • default settings and parameterizations are occasionally surprising, and often difficult to discover
  • the built-in "report" document class did not exactly meet my university's formatting requirements -- that's okay, however, it took a lot of work to figure out how to fix the formatting
  • there are multiple contexts, and some characters mean different things in different contexts
  • conflicts between packages. Some Latex packages don't play nice with each other. Sometimes this means that you can't use certain packages together, other times it means that you'll silently get weird results. I believe there are also cases where packages have to be imported in a certain order to get them to work correctly
  • entries in the bibliography had different capitalization in the output PDF than what I had put into the bibliography file
  • it's hard to see where things start and end. Some commands aren't, delimited but are implicitly ended by later commands. Others have effects in some scope, which again is implicitly defined
  • margins were routinely violated. I had assumed that the default behavior would be to respect margins, but this was not the case
  • special characters. If you're not familiar with them, you may accidentally write something totally different from what you meant, without realizing it. Syntax-highlighting text editors are a big help here. Also, figuring out how to write a non-special version of the special characters
  • I had to spend time manually checking the PDF to ensure that everything turned out correctly. Sometimes, there were surprising problems in the output that I wouldn't have found except by actually looking at the PDF (that is, there wasn't an error or warning generated by pdflatex)
  • I had a very hard time finding complete, correct answers to problems. Many answers did not attempt to solve the problem, but rather argued with the premise of the OP. Many worked sometimes, but not in all contexts. Many others had unstated caveats, which later blew up in my face
  • I was unable to find complete, precise documentation for packages, macros, document classes, and commands -- what they are intended to do, and how they are intended to be used. For instance, I needed to know what the "report" document class entailed and what its options meant, so that I could compare to my university's formatting requirements. I couldn't find this information anywhere
  • it was difficult to choose between multiple competing packages solving the same problem -- it was hard to find good comparisons which included caveats, pros and cons, etc.
  • I wasn't able to find resources to help me build a conceptual understanding of how Latex works. This meant that I wasn't able to understand why errors occurred

Conclusion: was using Latex the right decision for me?

Yes. I wanted to write my dissertation in plain text, manage it using git, and have my table of contents, references, and bibliography automatically generated. Latex had no trouble handling these. It was usually fun to use Latex, and the output from pdflatex was beautiful.

On the other hand, I spent much more time than expected troubleshooting, debugging, digging through old forums looking for answers, deciphering cryptic error messages, and wondering why things didn't work. Latex can be a very frustrating and complicated tool, and it's difficult to find help when you need it. I ran into numerous problems that I just couldn't solve and couldn't find solutions to using the internet. These left a bad taste in my mouth. I feel like a lot of Latex goes against standard principles of building robust software, such as encapsulation, abstraction, composition, and invariants.

Nevertheless, it was more than worthwhile to learn to use Latex for my dissertation. I think these issues are traps for beginners, but don't prevent advanced users from getting work done. I expect the downsides of using Latex shrink as one gains more experience using it.

Disclaimer: please keep in mind that I am only reporting my experience and my thoughts, and that it is certainly possible that my conclusions are flawed. I intended for this to be a fair portrayal of Latex.

Thursday, March 6, 2014

Four different ways to inspect types and prototypes in Javascript

What is a type in Javascript?

It's hard to succinctly and accurately define `type` in Javascript. A couple of complications are Javascript's dynamic typing (meaning that variables don't have types), and that the tools that Javascript provides don't unambiguously and uniquely divide values in separate groups (meaning that there's overlap and differences between the various ways, and no one single way is more "right" than the others). Given these difficulties, let's forget about coming up with an unambiguous definition of "type" and also forget about what "type" means in other languages. Instead, let's just say that for the purposes of this article, "type" will mean "some way to group objects based on certain similar properties".

What are some of the properties that we can use to group Javascript's values? First, we can look at whether a value is an object or a primitive. We can further divide and group the objects by their prototype chains or constructors. We can divide the primitives into smaller groups based on their primitive types.

However, as I mentioned earlier, there are alternative, meaningful ways to group values. For instance, objects with identical prototype chains can be separated (`arguments` vs `{}`). There is also some overlap between primitives and certain objects -- some primitives are automatically converted to objects under certain circumstances, similar to Java's autoboxing.

This article will take a close look at Javascript's type-inspecting tools, and the different notions of "type" that they provide. Using a small amount of code -- and a large set of test cases -- we'll gather some data about each of the tools. This data will give us insight into the pros and cons of each tool -- and perhaps help to indicate when each should be used.

What are the tools at our disposal?

Object.getPrototypeOf

Can be used recursively to get the full prototype chain of an object. See the MDN docs. Related: Object.prototype.isPrototypeOf.

Object.prototype.toString

Used by Underscore for type inspection.

instanceof

Performs inheritance checks which respect the prototype chain.

typeof

Mostly useful for distinguishing between primitives and objects.

Array.isArray

A special method for checking if an object is an array. (Since this kind of method only exists for arrays, I won't use it in the rest of this article).

The code

We want to inspect as many different kinds of values as possible -- so let's make sure that we have each of the primitives, each of the commonly-used built-in object types, functions, user-defined types, `arguments`, object wrappers for primitives, and the root object for good measure. Remember, for each of these example values, we'll try each of the four previously-mentioned tools for inspecting types.

For each example, there's a meaningful string for display purposes, an expression that will evaluate to the desired value, and also a constructor function that we'll use for an `instanceof` check. The last part is pretty arbitrary -- I just use it to show that a given value can satisfy `instanceof` for multiple different constructors.

// put examples inside a function to get access to `arguments`
function getExamples() {
    var functionText = "new Function('x', 'return x + 1;')";
    return [
        // schema:  
        //   0: human-readable text
        //   1: expression to be inspected
        //   2: (optional) constructor for instanceof-checking
        ['undefined'        , undefined                         , null     ],
        ['null'             , null                              , null     ],
        ["'abc'"            , 'abc'                             , String   ],
        ["new String('abc')", new String('abc')                 , String   ],
        ['123'              , 123                               , Number   ],
        ['new Number(123)'  , new Number(123)                   , Number   ],
        ['Infinity'         , Infinity                          , Number   ],
        ['NaN'              , NaN                               , Number   ],
        ['true'             , true                              , Boolean  ],
        ['new Boolean(true)', new Boolean(true)                 , Boolean  ],
        ['function g(x) {}' , function g(x) {}                  , Function ],
        [functionText       , new Function('x', 'return x + 1;'), Function ],
        ["{'a': 1, 'b': 2}" , {'a': 1, 'b': 2}                  , null     ],
        ['new Object()'     , new Object()                      , null     ],
        ['new ObjectExt()'  , new ObjectExt()                   , ObjectExt],
        ['[1, 2, 3]'        , [1, 2, 3]                         , Array    ],
        ['new Array()'      , new Array()                       , Array    ],
        ['new ArrayExt()'   , new ArrayExt()                    , Array    ],
        ['/a/'              , /a/                               , RegExp   ],
        ["new RegExp('a')"  , new RegExp('a')                   , RegExp   ],
        ['new RegExpExt()'  , new RegExpExt()                   , RegExp   ],
        ['new Date()'       , new Date()                        , Date     ],
        ["new Error('!')"   , new Error('!')                    , Error    ],
        ['Math'             , Math                              , null     ],
        ['JSON'             , JSON                              , null     ],
        ['arguments'        , arguments                         , null     ],
        ['this'             , this /* the root object, right? */, Window   ]
    ];
}
Here's the code for setting up the three user-defined constructors. Each of Array, Object, and RegExp are extended:
// extend Array
function ArrayExt() {}
ArrayExt.prototype = [1, 2, 3];

// extend Object 
function ObjectExt() {}
 
// extend RegExp
function RegExpExt() {}
RegExpExt.prototype = /matt/;
Finally, the function used to grab an object's prototype chain. This throws an exception if `obj` is a primitive:
function getParents(obj) {
    var parents = [],
        par = obj;
    while ( true ) {
        par = Object.getPrototypeOf(par);
        if ( par === null ) {
            break;
        }
        parents.push(par);
    }
    return parents;
}

The data

Now we take each of the example values, and apply each of the four tests to it -- plus an extra `instanceof` test to show inheritance. For each expression "e", we'll do:
typeof e

Object.prototype.toString.call(e)

e instanceof Object

e instanceof [subtype]

getParents(e)
Here are the results. Especially surprising, inconsistent, and strange results are in red:
example typeof Object.prototype.toString instanceof Object instanceof subtype prototype chain
undefined undefined [object Undefined] false -- --
null object [object Null] false -- --
'abc' string [object String] false String: false --
new String('abc') object [object String] true String: true String,Object
123 number [object Number] false Number: false --
new Number(123) object [object Number] true Number: true Number,Object
Infinity number [object Number] false Number: false --
NaN number [object Number] false Number: false --
true boolean [object Boolean] false Boolean: false --
new Boolean(true) object [object Boolean] true Boolean: true Boolean,Object
function g(x) {} function [object Function] true Function: true Function,Object
new Function('x', 'return x + 1;') function [object Function] true Function: true Function,Object
{'a': 1, 'b': 2} object [object Object] true -- Object
new Object() object [object Object] true -- Object
new ObjectExt() object [object Object] true ObjectExt: true ObjectExt,Object
[1, 2, 3] object [object Array] true Array: true Array,Object
new Array() object [object Array] true Array: true Array,Object
new ArrayExt() object [object Object] true Array: true ArrayExt,Array,Object
/a/ object [object RegExp] true RegExp: true RegExp,Object
new RegExp('a') object [object RegExp] true RegExp: true RegExp,Object
new RegExpExt() object [object Object] true RegExp: true RegExpExt,RegExp,Object
new Date() object [object Date] true Date: true Date,Object
new Error('!') object [object Error] true Error: true Error,Object
Math object [object Math] true -- Object
JSON object [object JSON] true -- Object
arguments object [object Arguments] true -- Object
this object [object global] true Window: true Window,EventTarget,Object
Notes:
  • these results were obtained in Firefox, Javascript 1.5; Chrome, Javascript 1.7
  • the results in the last row vary by implementation

The analysis

typeof

`typeof` distinguishes between primitives and objects. However:
  • `typeof null` is "object"
  • it returns "function" for functions, even though they are objects -- this is not wrong, just misleading
  • for String, Number, and Boolean: `typeof` return "object" for wrapped values
  • it doesn't distinguish between different objects -- arrays, dates, regexps, user-defined, etc.: all are just "object"

instanceof

`instanceof` checks whether a constructor's prototype property is in an object's prototype chain. However:
  • the 2nd argument must be a constructor. The constructor is used to look up a prototype. This is a problem if creating objects with `Object.create` -- there is no constructor function (that you have access to).
  • may not work if there are objects moving across frames or windows
  • different results for corresponding objects and primitives
  • doesn't tell you what the prototypes actually are

Object.prototype.toString.call(...)

This seems to differentiate between the built-ins correctly. See this for more information. However:
  • it doesn't differentiate between corresponding objects and primitives
  • it reports all primitives as objects
  • doesn't seem to work for user-defined constructors and objects. Apparently, it depends on an internal [[Class]] property which can't be touched, according to this.

prototype chain using Object.getPrototypeOf

Gets the prototype objects. However:
  • can't distinguish `arguments` from `Math`, `JSON`, and other objects. In fact, can't distinguish between any objects that share the same prototype chain.
  • doesn't work on primitives -- even those which have corresponding objects
  • may fail for passing objects between windows/frames -- like `instanceof` -- (not sure)

Conclusion

It appears that none of these approaches is capable of dealing with this problem by itself. I'm not even sure if it's possible to come up with a 100% accurate and unbreakable method for classifying Javascript values. However, using a combination of these tools will probably get you most of the way there. Good luck!

Tuesday, March 4, 2014

Simplifying a formal grammar using transformations and BNF extensions

Integer literals

A BNF grammar is a great way to specify a language, due to its concise and declarative nature. However, one major limitation of BNF is that it's not extendable -- which means if you identify a common pattern, you can't factor it out. You're stuck with repeating the boilerplate everywhere it occurs.

For an example of boilerplate making a grammar far more difficult to understand, let's take a look at the Java Language Specification While the specification is formal, it's rather verbose and it's difficult to intuitively see what language the grammar really generates -- especially the corner cases:

DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore
    DigitsAndUnderscores DigitOrUnderscore 

DigitOrUnderscore:
    Digit
    _

Underscores:
    _
    Underscores _
That's 7 rules on 21 lines of code (27 if you include blank lines), and describes the syntax for base 10 integer literals.

The challenge: can we transform this grammar into one that generates the same language, but is clearer and more concise? (We are allowed to extend BNF with additional operators, if necessary)

Three simple transformations

Repetitions: + quantifier

First, let's borrow some regex notation -- whenever we see this pattern:
RuleName:
    terminal
    RuleName  terminal
that means there's one or more repetion of "terminal", so we can instead use the "+" quantifier:
RuleName:
    terminal(+)
That pattern appears in two places, and after substituting our new shorthand for both of them, we now have:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit Underscores Digits 

Digits:
    Digit
    Digit DigitsAndUnderscores(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitsAndUnderscores:
    DigitOrUnderscore(+)

DigitOrUnderscore:
    Digit
    _

Underscores:
    _(+)

Factoring in

I use the "factor in" transformation when a rule is:
  1. only used once
  2. very simple
  3. given a semantically void name such as "Underscores"
Here's the basic idea of factoring in (it's the opposite of factoring out repeated code):
// before -- need to look up `SubRule1` in order to understand `Rule1`
Rule1: 
    SubRule1

SubRule1:
    ... some pattern ...

// after -- no additional rule to look up, also shorter
Rule1:
    ... some pattern ...
Applying this transformation to "DigitsAndUnderscores" and "Underscores" yields:
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit: one of
    1 2 3 4 5 6 7 8 9

DigitOrUnderscore:
    Digit
    _
Which saves 4 more lines, and two meaningless names. This transformation can be easily abused, leading to grammars that are more difficult to read instead of less so. The trick is to decide when "factoring in" clarifies the grammar and when it obfuscates.

Character classes

This borrows more regex short-hand -- square bracket notation and character ranges. It means the rule must match any one of the characters in the given range. I'll apply it to shorten "NonZeroDigit":
DecimalNumeral:
    0
    NonZeroDigit Digits(?)
    NonZeroDigit _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    NonZeroDigit

NonZeroDigit:
    [1-9]

DigitOrUnderscore:
    Digit
    _
Okay, that didn't help much yet -- but we'll use it again shortly.

The plot thickens

Now we can start reusing those transformations we just covered. First, let's factor in "NonZeroDigit":

DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    0
    [1-9]

DigitOrUnderscore:
    Digit
    _
Now, combine "Digit"s two alternatives, using the square bracket notation, factor it in to "DigitOrUnderscore", and then combine "DigitOrUnderscore"s two alternatives:
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    Digit
    Digit DigitOrUnderscore(+)(?) Digit 

Digit:
    [0-9]

DigitOrUnderscore:
    [0-9_]
Now factor in both "Digit" and "DigitOrUnderscore":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](+)(?) [0-9]
The quantifiers "+" and "?", when used together, mean the same as "*":
DecimalNumeral:
    0
    [1-9] Digits(?)
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]
And we can get rid of the "?" quantifier using its definition, splitting an alternative into two:
DecimalNumeral:
    0
    [1-9]
    [1-9] Digits
    [1-9] _(+) Digits 

Digits:
    [0-9]
    [0-9] [0-9_](*) [0-9]

Closing steps

Let's now factor in "Digits" -- but we'll have to be careful since it has two alternative rules. This means the factored-in result we'll have two alternatives wherever "Digits" is replaced:
DecimalNumeral:
    0
    [1-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
And now let's combine the first two alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9]
    [1-9] [0-9] [0-9_](*) [0-9]
    [1-9] _(+) [0-9] 
    [1-9] _(+) [0-9] [0-9_](*) [0-9] 
The final transformation requires a bit of intuition -- notice that underscores are only allowed in the interior of the numeral, never on the edges. Let's combine the last 4 alternatives:
DecimalNumeral:
    [0-9]
    [1-9] [0-9_](*) [0-9]
And ... voila! We've gone from 21 lines, down to 3 concise and simple ones. This makes it easier to see common error cases -- for example, a number cannot end with an underscore. Try it out on your Java compiler!

Conclusion

The length and complexity of the original grammar caused quite a few problems:

  • more code means more chances for error -- misinterpretations, typos, omissions, etc.
  • we couldn't tell what the code was saying. This makes it tough to write effective tests
  • semantically void names -- "DigitsAndUnderscores" -- were distracting
  • corner cases were not obvious
  • difficult to maintain in the face of change
To deal with this, we defined several transformations, as well as extended BNF with a couple of new operators borrowed from regular expressions:
  • the "+" repetition quantifier -- one or more
  • square bracket notation and character ranges
  • factor in simple rules, one-off rules, and poor names
We then used these transformations and extensions to transform the grammar, creating a clearer, more concise grammar.

Wednesday, July 24, 2013

Operator parsing

Operator parsing

Interested in simple, expressive, and efficient parsing algorithms? Ever heard of "Pratt", "operator precedence" or "top-down operator" parsing? It's a great technique, whose many advantages are described in Pratt's original paper as well as in excellent articles from more recent authors, including Douglas Crockford of Javascript fame. The chief advantages are:

  • simplified description of the operator expression language portions
  • simplified implementation of parser for operator expressions
  • user extensibility of operator grammar

Despite its advantages, learning how the method works was very difficult for me. Although the papers were great resources, I found their explanations quite hard to grok. In particular, the original paper was confusing both in terminology and in explanation, while Crockford's implementation made heavy use of global variables, mutable state, and inheritance, and both lacked specific examples illustrating the key concepts and corner cases. They also used the same approach for parsing syntactic constructs that would not normally be considered as operators, which, although effective, did not help me to get "it".

In learning the method, I ended up using my own take on the algorithm and writing my own implementation in order to run examples. I ended up with something similar to work by Annika Aasa. Also, operator precedence climbing seems to be the same or similar.

With the help of this algorithm, I'd like to demonstrate:

  • what operator expressions are
  • what the problems with specifying and parsing operator expressions are
  • the precedence/associativity solution to describing operator expressions
  • the model used by my parser to parse operator expressions
  • lots and lots of examples of the parser in action to point out corner cases and confusing scenarios

What are operator expressions?

Operators can be seen as a special case of functions, being called with a different syntax, argument order, and operator and operand positions. The main advantage of operator expressions is that they don't have to be fully parenthesized, making them more compact and (arguably) more convenient to read and write. Compare:

1 + !2 * -8 * ~3 ^ x ++

(1 + (((!2) * (-8)) * ((~3) ^ (x++))))

(+ 1 
   (* (! 2) 
      (* (- 8)
         (^ (~ 3)
            (post-++ x)))))
The first version is much shorter; is it easier to read as well?

There are four main types of operators that I'm going to focus on:

prefix:   !x             =>  (! x)
          not not x      =>  (not (not x))

infix:    x + 3          =>  (x + 3)
          a * b + c      =>  ((a * b) + c)

postfix:  y ++           =>  (y ++)

mixfix:   a ? b : c      =>  (a ? b : c)
          x if y else z  =>  (x if y else z)

Problems with specifying and parsing operator expressions

It's easy to write context-free grammars that describe operator expression -- a common example for specifying simple mathematical expressions is something like:

E  :=  E  '-'  E  |
       E  '*'  E  |
       E  '^'  E  |
       int
Unfortunately, the grammar is ambiguous; even simple expressions could be parsed multiple ways:
3 - 2 - 1         =>   (3 - (2 - 1))       or       ((3 - 2) - 1)  ??

4 - 5 * 6         =>   (4 - (5 * 6))       or       ((4 - 5) * 6)  ??

To express operator precedence, we can create a rule for each precedence level in the grammar:

E       :=  Factor  ( '-'  Factor )(*)

Factor  :=  Term  ( '*'  Term )(*)

Term    :=  int  ( '^'  int )(*)
It should be clear that this is a tedious and ugly solution. Plus, extending it to also deal with prefix, postfix, and mixfix operators further complicates things.

One last point to consider is whether the set of operators and their precedences are set in stone. With the above schemes, in which the grammar is fixed long before any actual parsing is done, the user will be not able to define new operators and their precedences.

Specification solution: precedence and associativity

Fortunately, there are better ways to specify operator grammars than using context-free grammars. One such way is to specify precedence and associativity of each operator. Then, when the parser needs to decide which operator an operand belongs to, it consults the precedence and associativity tables.

For an example, let's revisit the earlier problem of parsing `4 - 5 * 6`. The crux of the association problem is whether the `5` belongs with the `-` or with the `*`. As we know from math, the `*` has higher precedence, and so the expression is parsed as `4 - (5 * 6)`. That's precedence.

Using associativity, we can resolve the other example -- `3 - 2 - 1`. In this example, the association problem is whether the `2` belongs with the first or second `-`. Obviously, `-` has the same precedence as itself, so precedence won't help us here. However, infix `-` is left-associative, which means that `2` associates to the left, for a final result of `((3 - 2) - 1)`.

Precedences must also be specified for the other operator types -- prefix, postfix, and mixfix -- since each of these may participate in an association problem. The above two examples are infix/infix problems; here are the additional possible association problems:

prefix/infix:      !x + 3               =>    !(x + 3)               or    (!x) + 3

infix/postfix:     a * b--              =>    (a * b)--              or    a * (b--)

prefix/postfix:    - c ++               =>    - (c ++)               or    (- c) ++

mixfix/mixfix:     a ? b : c ? d : e    =>    (a ? b : c) ? d : e    or    a ? b : (c ? d : e)

prefix/mixfix:     ! a ? b : c          =>    (! a) ? b : c          or    !(a ? b : c)

mixfix/postfix:    a ? b : c ++         =>    (a ? b : c) ++         or    a ? b : (c ++)

infix/mixfix:      w + x ? y : z        =>    w + (x ? y : z)        or    (w + x) ? y : z

mixfix/infix:      n ? o : p + q        =>    (n ? o : p) + q        or    n ? o : (p + q)
Note that postfix and prefix operators can't be the first or second operator, respectively, in an association problem for obvious reasons.

As the above examples demonstrated, precedence handles cases between operators with different binding priorities, while associativity handles cases between operators with with equal priorities. Associativity can be right or left:

left associativity:    a + b + c    =>    ((a + b) + c)

right associativity:   d = e = f    =>    (d = (e = f))
Now what happens if we mix operators of equal precedence but opposite associativity? In the following examples, assume `+` and `=` have equal precedences but are left- and right- associative, respectively:
operator +  precedence 50, left-associative
operator =  precedence 50, right-associative

a = b + c    =>    (a = b) + c   violates associativity of `=`
                   a = (b + c)   violates associativity of '+'

d + e = f    =>    (d + e) = f   violates associativity of `=`
             =>    d + (e = f)   violates associativity of '+'
There's no good way to parse mixed-associativity expressions; systems can arbitrarily choose a parse, or report an error. My system chooses to report an error.

The most common way to define precedence is by assigning a number to each operator; the higher the number, the higher the precedence of the operator (or sometimes vice versa, confusingly). See Java and Python examples. Another approach is to define relative precedences between some (but not necessarily all) operator pairs, so that a total precedence order need not be defined.

Associativity is defined for each infix and mixfix operator, either as left-, right-, or non-associative. Prefix and postfix operators do not need associativities because they are always right- and left-associative, respectively. As can be seen in these examples, there is only one way to parse these operators expressions:

! ! ! ! x   =>   ! (! (! (! x)))

x ++ ++ ++  =>   ((x ++) ++) ++

A parsing algorithm

Now I'd like to deomonstate a parsing algorithm that works for prefix, infix, mixfix, and postfix operators of arbitary precedence and associativity.

First, I need to give a rough definition of the expression language we'll parse. Even though, as I mentioned before, BNF-like grammars are a poor means for defining operator expressions, I'll it a bit informally just to give a succinct outline what we'll be dealing with:

Expr        ::   PrePostfix  |
                 Infix       |
                 Mixfix

Infix       ::   Expr  InfixOp  Expr

Mixfix      ::   Expr  MixfixOp1  Expr  MixfixOp2  Expr

PrePostfix  ::   PrefixOp(*)  Operand  PostfixOp(*)

Operand     ::   Number  |  Variable
(Of course, this grammar is highly ambiguous, whereas our parser will be unambiguous.) Basically, this grammar says that we can do stuff like:
infix:        x * 123

add prefix
operators:    ! ~ x * 123

add postfix
operators:    ! ~ x ++ -- * 123

add infix
operator:     ! ~ x ++ -- * 123 & y

add mixfix
operator:     ! ~ x ++ -- * 123 & y ? 4 : p
We can have as many prefix and postfix operators as we want surrounding any operand, and infix and mixfix operator expressions recursively use the `Expr` rule.

The basic parsing algorithm needs five rules which must be repeatedly applied in order to parse the operator expressions: one for each type of operator, as well as one for finding operands. Since we're working left-to-right, there's a natural order in which the rules must be applied:

  1. find prefix operators
  2. find the operand
  3. find postfix operators
  4. find an infix operator, if possible, or ...
  5. find the first part of a mixfix operator, an expression, and the second part, if possible
Applying these rules looks like this:
start:         ! ~ x ++ -- * 123 & y ? 4 : p

find prefix
  operators:       x ++ -- * 123 & y ? 4 : p     

find operand:        ++ -- * 123 & y ? 4 : p

find postfix
  operators:               * 123 & y ? 4 : p --

find infix
  or mixfix
  operator:                  123 & y ? 4 : p --
Now, if we found just an operand and postfix operators, we have a complete expression:
x ++   <-- complete expression
However, if we found any prefix operators, we may not yet have found the complete operand; using Python's rules:
not x      parse prefix operator 'not'
x          parse operand variable 'x'
-- done -- complete expression

not 3 + 2 + 1    parse prefix operator 'not'
3 + 2 + 1        parse operand number '3'
+ 2 + 1          parse infix operator '+'
2 + 1            tokens '2', '+', and '1' remaining
-- not done -- `not` has lower precedence than `+`, so we haven't yet found `not`s operand
If we found an infix or mixfix operator, we have found the left-hand operand, but we definitely haven't found the right-side operand(s):
3 + 4 * 5    parse operand number '3'
+ 4 * 5      parse infix operator '+'
4 * 5
-- not done -- we have not yet found `+`s right-hand operand

Introducing the Stack

To store the operators whose arguments we have not yet found, we'll use a stack. Each layer of the stack records the name, precedence, associativity, and any arguments already found of an operator. Of course, since it's a stack, layers are only added and removed from one end, representing where the parser is in the expression. Here's a prefix example:

tokens stack scratch space action
! ~ x [] consume prefix operator
~ x [] ! prefix operator, so push
~ x [(!, 110, right)] consume prefix operator
x [(!, 110, right)] ~ prefix operator, so push
x [(!, 110, right), (~, 110, right)]
We can see that as each prefix operator is consumed, an entry is pushed on to the top of the stack. Notice that the associativity is 'right' for prefix operators, and that we haven't found any operands for them yet.

Now let's see what happens when we continue the above example. When we do find the operands, we can start popping stack entries. If we've consumed the entire input, we pop every stack level. Meanwhile, we're also maintaining a scratch value that is the operand which we pass to the next highest stack frame, after which the frame is popped and the process is repeated:

tokens stack scratch space action
x [(!, 110, right), (~, 110, right)] consume operand
[(!, 110, right), (~, 110, right)] x tokens empty, so pop
[(!, 110, right)] (~ x) tokens empty, so pop
[] (! (~ x)) tokens and stack empty, so done

An infix example

The algorithm works similarly for infix, and mixfix operators, except that when an entry is pushed onto the stack for such an operator, the left-hand argument has already been found. Let's work through an example:

tokens stack scratch space action
a + b * 3 - 4 [] consume operand and infix operator
b * 3 - 4 [] +, a stack is empty, so push
b * 3 - 4 [(+, 80, left, a)] consume operand and infix operator
3 - 4 [(+, 80, left, a)] *, b * > +, so push
3 - 4 [(+, 80, left, a), (*, 90, left, b)] consume operand and infix operator
4 [(+, 80, left, a), (*, 90, left, b)] -, 3 - < *, so pop
4 [(+, 80, left, a)] -, (* b 3) - == +, but is left associative, so pop
4 [] -, (+ a (* b 3)) stack is empty, so push
4 [(-, 80, left, (+ a (* b 3)))] consume operand
[(-, 80, left, (+ a (* b 3)))] a input is empty, so pop
[] (- (+ a (* b 3)) 4) input and stack empty, so done
The association problem is continually decided based on operator precedences and associativities, and is implemented through pushes and pops to the pending operator stack.

A postfix example

Postfix operators do not require stack pushes, but may require pops -- since their operands are always to the left, meaning that further parsing is not needed to find them. Here's a small example; assume the precedences of +, --, and @ are 80, 120, and 50 respectively:

tokens stack scratch space action
x + y -- @ [] (none) consume operand and infix operator
y -- @ [] +, x stack empty, so push
y -- @ [(+, 80, left, x)] (none) consume operand and postfix operator
@ [(+, 80, left, x)] --, y -- > +, so apply -- to y
@ [(+, 80, left, x)] (-- y) consume postfix operator
[(+, 80, left, x)] @, (-- y) @ < +, so pop
[] @, (+ x (-- y)) stack empty, so apply @ to arg
[] (@ (+ x (-- y))) stack, input empty so done

To sum up how the algorithm works:

  • use a stack to represent operators that we're not done parsing yet
  • relative precedence and associativity of the current operator and the topmost operator on the stack tell us whether we need to push or pop
  • use a scratch space to temporarily hold an operator and operand, if necessary
  • prefix operators always push a stack frame
  • postfix operators may pop stack frames
  • infix and mixfix operators may pop 0 or more frames, followed by pushing a single new frame

Disambiguating ambiguities

It's great to be able to use `-` and `+` both as infix, binary operators and as unary prefix ones. Can this algorithm deal with those cases? Easily! Let's work through a short example:

tokens stack scratch space action
140 - - 26 [] consume operand and infix operator
- 26 [] -, 140 stack empty, so push
- 26 [(-, 80, left, 140)] consume prefix operator
26 [(-, 80, left, 140)] - prefix operator, so push
26 [(-, 80, left, 140), (-, 110, right)] consume operand
[(-, 80, left, 140), (-, 110, right)] 26 input empty, so pop
[(-, 80, left, 140)] (- 26) input empty, so pop
[] (- 140 (- 26)) input and stack empty, so done
The algorithm handles the problem by reading the two `-` tokens in two different contexts: when it reads the first one, the algorithm is expecting an infix operator; once an infix operator is found, the algorithm looks for an operand. Since an operand can have prefix operators, it next looks for those.

Operators can also be used in both prefix and postfix contexts, or prefix and mixfix, without confusing the parser. However, using an operator as both postfix and infix/mixfix *will* screw things up, since the parser will no longer know when to stop parsing postfix operators and switch over to infix/mixfix.

User extensibility

One of the major advantages of operator parsing is its extensibility: to create brand new operators, you just need to add some entry to your operator tables giving the precedence and associativity (if necessary). This allows users to define their own operators. For an example, check out the Haskell programming language, which allows user-defined operators as described here.

Limitations

For simplicity, I intentionally chose an algorithm that is more limiting than the ones used by Pratt and Crockford. The problem is caused by the forced classification of operators into prefix, postfix, infix, or mixfix. If you have an operator that doesn't fit into one of those categories exactly, you will have to extend the algorithm to deal with it. An example is Python's `lambda` operator, since it requires a parameter list after the keyword and before the `:`, but is otherwise similar to a prefix operator.

In the Pratt & Crockford method, the tokens themselves are responsible for deciding what parse rules to use. This allows any grammatical rule to be parsed as an operator expression, but again, I find it more difficult to comprehend and demonstrate, precisely because it's so flexible.

Wrap up

Well, this has been quite a long article! I hoped you enjoyed reading it as much as I did writing it!

If you'd like to see the parser in action, you can find my implementation of the algorithm on github here. In addition to the actual code, there's a decent suite of tests to ensure that tricky corner cases are handled correctly. Might be worth a look!

Lastly, I'd like to mention again that this was all inspired by the works of Pratt and Crockford, which are excellent and definitely worth reading.

Wednesday, July 3, 2013

Parsing the NMR-Star format

The NMR-Star format

NMR, or Nuclear Magnetic Resonance, is a technique for studying molecules at the molecular level; it is also the technology behind NMR machines. The Biological Magnetic Resonance Data Bank is a great resource for archived NMR data, which can be access in text files using the NMR-Star format.

However, the NMR-Star format has a number of corner cases that I had to solve while writing the NMR-Star parser which were quite tricky. This article will discuss the problems and their solutions. The full code can be found on github in my NMR-Star parser project.

The parsing strategy

I used parser combinators supporting error reporting, backtracking, and line/column position, and added a a couple of extra phases for clean-up:

  • scanner
  • token clean-up. Discards whitespace and comment tokens, and classifies unquoted strings as keywords or values
  • CST parsing. Builds a concrete syntax tree from the token sequence
  • AST construction. Checks context-sensitive constraints while building an abstract syntax tree

Problem: whitespace/comments between tokens and semicolon-delimited strings

The NMR-Star format allows insignificant whitespace and comments between tokens. A standard parsing solution is to discard whitespace and comments after every token. However, the NMR-Star language defines semicolon-delimited strings to be opened by the sequence "newline, semicolon" and ended by the same sequence, which breaks the "munch junk after every token" rule, since newlines would be counted as whitespace and be discarded by the time a semicolon-string is tried. For example:

  ;abc  # <-- not a semicolon-string because preceding character is not a newline

;abc  # <-- semicolon-string because preceding character *is* a newline
def
; 

In my first solution to this problem, I pre-munched junk before every token. Semicolon-delimited strings got a special munching rule, which verified that the last junk character before the opening semicolon was indeed a newline. However, this had a couple disadvantages: 1) position reporting was much more difficult than with post-munching, since junk had to be parsed before finding the beginning of the next token, and 2) it was a special case, which complicated the specification and implementation.

My second solution took advantage of the position information -- that the opening semicolon immediately follows a newline implies that it is in column 1 of its line. So, the parser must inspect the position before deciding to try parsing a semicolon-delimited string or not. This allowed one single rule for insignificant whitespace and comments, and also allowed me to use post-munching.

def _sc_rest(position):
    _, column = position
    # a semicolon-delimited string must be preceded by a newline -- thus, column must be 1
    if column == 1:
        return node('scstring',
                    ('open', sc),
                    ('value', many0(not1(_end_sc))),
                    ('close', cut('newline-semicolon', _end_sc)))
    return zero
    
scstring = bind(getState, _sc_rest)

Problem: semicolons inside semicolon-delimited strings

Since the newline/semicolon sequence closes semicolon-delimited strings, this means that semicolons can not be the first character on a line within a semicolon-string. For example:


; this ; starts the string
 this ; is *within* the string and does not end it
; # this ; ends the string, because it follows a newline
; # which means this is an extraneous ;
This problem is solved with a single character of lookahead: while parsing a semicolon-string, if a newline is encountered and the next character is a semicolon, end the string; otherwise, continue parsing the semicolon-string.

Problem: non-ending quotes of quoted values

NMR-Star has double-quoted values:

"abc 123"
Double-quotes can be within the string, as long as they are not followed by whitespace or EOF. Thus, this is a single, valid double-quoted value, since the second '"' is followed by '1':
"abc"123"
This is easily solved using a single character of lookahead: '"' followed by whitespace, end the string; '"' followed by not whitespace, consume the '"' and continue parsing the string.

NMR-Star also has single-quote values which work in the same way.

Problem: keyword vs. unquoted value

The lexical definitions of keywords and unquoted values in the original Star grammar is ambiguous; this is resolved with an additional rule stating that unquoted values can not be keywords. For example:

loop_  # <- 'loop_' is a keyword, so it can not be an unquoted value
 _a _b
 loop_1  # <- 'loop_1' is not a keyword, so it can be an unquoted value
 loop_2 
stop_
An alternate definition, and the one which I used in my parser, is to match the longest string of the unquoted value pattern; then classify the string as either a keyword or an unquoted value in the token clean-up phase. This avoids unnecessary grammatical ambiguity and more clearly captures the intent of the specification.

unquoted = node('unquoted',
                ('first', not1(special)),
                ('rest', many0(not1(space))))

Problem: context sensitive rules

There are a number of constructs in the NMR-Star format that are context-sensitive, meaning that they cannot be parsed correctly using a context-free grammar/parser:

  • no duplicate save frame names
  • no duplicate identifiers in save frames
  • no duplicate identifiers in loops
  • framecode references
  • the number of values in a loop must be an integer multiple of the number of identifiers
For example, the following loop follows the context-free rules but has duplicate keys:
  loop_

    _a _b _a

    v1 v2 v3
    v4 v5 v6

  stop_
What we'd like to have the parser do is report an error including location in the input and the nature of the violation.

One way to solve this problem is to use a context-sensitive grammar formalism; however, I've never tried this approach. One important disadvantage is that it would prevent context-free parsing of files that violate the context-sensitive rules -- which may be a pain if you have invalid data.

A second approach is to parse the input according to the context-free rules and generate a concrete syntax tree. Then, run a second pass over the CST to produce the abstract syntax tree. The second pass is now responsible for implementing the context-sensitive rules. This is the method I used and seems to work well in practice. It also seems to be easier to maintain due to better separation of concerns.

Wrap up

As far as formats go, the NMR-Star format is relatively clear and succinct. However, as I've shown, it does have a few pitfalls for the unwary, as well as some redundancies.

Scannerless Parsing: a JSON case study

Scannerless parsing

Scannerless parsing is not a very popular or well-understood approach to building parsers. So many parsing tools out there take for granted that parser implementors want/need/accept separate scanning and context-free parsing phases, that it's hard to get good information on the how, what, and why of scannerless parsers. That is, one would like to know:

  • What is 'scannerless parsing'?
  • What's so different about it?
  • How are scannerless parsers implemented?
  • What are some of the advantages?
  • What are some of the drawbacks, pitfalls, and difficulties?
I will explore these issues using a parser that I recently built, for the JSON data format, as a case study. I hope this will provide some practical insight into scannerless parsers!

What is scannerless parsing?

A scannerless parser is a parser that does not arbitrarily separate parsing into lexical and context-free phases; instead, these are combined into a single phase, as we'll see in the rest of this section.

Programming and data languages are often defined by grammars; parsers are programs that implement grammars; they read in strings and: 1) decide whether the string is part of the language; 2) build a structure representing the input; and 3) report any problems with the input.

Common tools for converting grammars to executable parsers (see the Wikipedia page for examples) often enforce an artificial separation of parsing into two stages. In the first stage, often called lexing (or tokenization or lexical analysis), the input string is broken up into a sequence of tokens. These tokens are then used as the input for the second stage.

I'm not sure what the next stage is called; I've seen it called context-free parsing, hierarchical parsing, and simply parsing. But the important point is that in this stage, the tokens are assembled into syntactic units to build some kind of parse tree which represents the structure that was recognized in the input string.

Each phase has its own separate grammar to describe what language it accepts. In a typical parser, the lexical grammar is (or tries to be) regular in the formal sense, and may require only minimal lookahead and/or backtracking; these characteristics help the tokenizer to do its job faster. The grammar for the second phase is context-free, and its terminals are tokens. Sometimes the grammars are ambiguous (meaning some strings can be parsed multiple ways); ambiguities are often resolved using additional rules outside the grammar.

This separation of parsing into two phases is completely arbitrary; it is perfectly possible to create a single-phase parser that accepts the exact same language as a two-phase parser. After all, it's quite easy to combine the regular token grammar into the hierarchical context-free grammar. Advantages of the two-phase approach are that it's familiar, tools implementing it are often quite fast, and the separate grammars may be simpler than a single combined grammar.

On the other hand, there are also important disadvantages when splitting parsing into two phases:

  • Artificially constraining the token grammar can create unnecessary ambiguities. See the lexer hack for an example. These types of ambiguities are not inherent in language itself, but are caused by the technology. Another example occurs with Java's templates; see this question. It's solved by reinterpreting the token stream during the second phase, if necessary.
  • More phases means more glue code between phases. Information such as token type and position must pass from the lexer to the parser; data may also be passed back to the lexer as in one resolution of the lexer hack. Also, error reporting and handling between the phases must be accounted for.
  • Language composability may be reduced.
  • Language description is more complicated. Two parallel grammars must be maintained, and the interactions between them well understood in order to effectively make changes.
  • Two tools and all their idiosyncracies, as well as the interface between them, must be mastered.

Parser combinators

The parsers will all be built using parser combinators; if you've worked with parser combinators before, it should be straightforward to follow along (aside from any idiosyncracies of my approach). If you've never used them, there are plenty of libraries in every language which are easy to download, or you can check out my library. I prefer using parser combinators because:

  • they are powerful -- they parse not only regular and context-free, but also context-sensitive grammars
  • they are expressive -- parsers are approximately as long as the grammar they implement
  • as they are expressed within a programming language, they benefit from that language's ecosystem, including functions, classes, unit testing libraries and tools, build tools, type systems, runtime, JIT compiler, etc. This also means that they can be extended when new patterns are identified.
  • they are composable. Parsers are built by successively combining parsers into ever-large parsers; combinator libraries come with a rich set of functions for combining parsers.
  • they are easy to test. Since each parser -- no matter how small -- is an object, it can be tested independently. This is a big win for me -- if I'm not confident that I have the little things right, I know I have no chance of getting the big things right.
  • easy to deploy and use. Since both parsers and the combinators are libraries, one needs only to import them, then hand them data to get started.
  • results, such as parse trees, are easy to build while parsing
  • they allow good, meaningful error messages to be produced

The combinators I'll use support four computational effects:

  1. backtracking. This allows choice between alternatives.
  2. error reporting. Errors will not be created using exceptions, but rather through a special datatype.
  3. state(1). The token sequence.
  4. state(2). This will be used to keep track of the position (line and column) in the input that the parser is at.

The how of scannerless parsers

Now let's get started! I'll point out problems that I faced, then describe how my scannerless parser solved them. The full code can be found on github here.

Problem: whitespace/comments between tokens

Many languages, including JSON, allow insignificant whitespace between tokens. This means whitespace must be discarded before and after tokens. One solution is to implement a parser for each token, and then wrap them with a combinator which runs the token parser and discards junk before or after the parser. Should we pre-munch or post-munch the junk? I found two advantages of post-munching: 1) after parsing a token, the parser always stops at the beginning of the next token, making it easy to report position; 2) if the parser must backtrack on a token, it will not have to re-parse the junk, only to discard it again. Of course, pre-munching before the very first token is still necessary.

This implements a `whitespace` parser and a `tok` combinator:

whitespace = many0(oneOf(' \t\n\r'))

def tok(parser):
    return seq2L(parser, whitespace)
The `tok` combinator implements post-munching by running its parser argument and then running the whitespace parser; its result is that of the parser argument. Note that the whitespace parser can't fail, since it matches zero or more whitespace characters -- this allows there to not be any whitespace after tokens.

Parsing a single token

By getting rid of the scanning phase, we haven't thrown the baby out with the bathwater, have we? Of course not! Since context-free grammars are more powerful than regular ones, it is no problem to express token patterns. One can simply create a parser for each token type. Since the JSON specification defines lexical syntax as a regular language, the token parsers simply don't use the more powerful features of CFGs.

The implementation uses the `literal` combinator:

os = tok(literal('['))
Which matches a given character exactly; this parser is then passed to `tok` to allow for optional trailing whitespace, which is post-munched.

Problem: position tracking

Keeping track of the line and column number while parsing allows position-specific errors to be reported, and also construction of a parse tree that includes position information. The latter is useful if you need to link later analyses to positions in the file.

One method for tracking position is to pre-process the input, converting a sequence of characters into a new sequence of annotated characters, including both the original character and the calculated position. This approach seems to work okay, but has a couple of drawbacks. First, many of the combinators have to be updated to deal with matching annotated characters, which means they are no longer compatible with simple characters, so you need two versions. Second, you have to extract the underlying characters; like the first problem, it's not a big deal but is certainly annoying. Third, you can only look at the position by monadically pulling a token into scope, which means: 1) extra parameters to pass around, and 2) you can't look at the position if the token sequence is empty.

A separate approach is to track the position using monadic state. This has the advantages that the parser combinators work the exact same way and do not need to be updated, the values don't have to be muddled with to extract out the underlying characters, and you can look at the position whenever necessary using the appropriate combinator. One disadvantage is that under backtracking, position calculations may be repeated.

Problem: parsing complex syntactic structures

The JSON format defines several multi-token syntactic structures -- key/value pairs, arrays, and objects. Again, for each of these I implemented a single parser. However, these parsers don't touch the token sequence directly, but only indirectly through the token parsers. In other words, each structural parser is implemented as a combination of token parsers.

This is similar to how the two-phase parsing strategy works. However, there is a key difference -- the structural parsers use only the token parsers because they *choose* to; it is no problem to use a different tokenization strategy at any time if necessary.

The syntax rules for array, object, and value are mutually recursive; however, Python's variable-binding semantics are not let-rec compatible, so a little trick is used to mock a forward declaration:

array = error('unimplemented')

value = alt(jsonstring, number, keyword, obj, array)

array.parse = node('array',
                   ('open', os),
                   ('body', sepBy0(value, comma)),
                   ('close', cut('close', cs))).parse
The `value` rule then refers to the `array` object; later, we replace the `parse` attribute of `array` with the actual parsing function that we want. Ugly, but effective. As for the rest of the parser, note how it doesn't touch the characters directly -- it is built in terms of the token parsers (and `value`).

Problem: keyword matching

There are three literal keywords in the JSON spec. Using the `string` parser, which matches a sequence of tokens exactly, a keyword is matched:

_keyword = node('keyword', 
                ('value', alt(*map(string, ['true', 'false', 'null']))))
If the keyword is matched, then the appropriate keyword is presented as the return value.

Strings: hex escape sequences

The JSON spec allows four-digit hexadecimal escape sequences in string literals, opened by the sequence '\u', and followed by four hexadecimal digits (case-insensitive):

_hexC = oneOf('0123456789abcdefABCDEF')

_unic = node('unicode escape',
             ('open', string('\\u')),
             ('value', cut('4 hexadecimal digits', quantity(_hexC, 4))))

What's missing

This article doesn't cover the details of error-reporting, which was extremely complicated. While correctly reporting the where and why of malformed input is critical in a real parser, the JSON spec does not explicitly say what is an error (it's implied) it is difficult to know what should be reported. This article also doesn't cover the design, implementation, and combinators of the parsing library. If you would like to see more about either of these topics, check out the code on github!