Tuesday, November 6, 2012

A classy approach to parser combinators

Parser combinators

Parser combinators serve as a great introduction to Functional Programming, and are one of the most-studied topics in the field. Nevertheless, they are a very complex and broad topic, covering concepts such as non-determinism, monads, and higher-order functions.

What people may not get as much exposure to, at least in my experience, is many of the Haskell type classes and their relationship to parsers. As we'll see in this article, the most common and useful combinators are actually parser-specific versions of more widely useful generic operations.

Type classes

The definitions of the type classes used are based on both the standard Haskell classes of the same name (minus the '), and Brent Yorgey's Typeclassopedia. I've added a ' to the end of each of their names to indicate that they're not identical to the standard Haskell classes, and in some cases are quite different. I've also added one type class of my own -- Switch' -- which represents the ability to convert a failing computation into a successful one with a default value, and to convert a successful computation into a failing one. I wasn't able to find a type class providing this functionality on Hoogle.

Parser definition and basic combinators

Following convention, parsers are modeled as functions that operate on token streams, either producing a result paired with the rest of the token stream, or failing. A convenient choice for representing possible failure is the Maybe data type.

newtype Parser t a = Parser { 
        getParser :: [t] -> Maybe ([t], a) 
    }

run :: Parser t a -> [t] -> Maybe ([t], a)
run = getParser
In addition, we'll use these basic parsers repeatedly throughout the examples to build bigger and more exotic parsers:
-- succeeds, consuming one token, as
--   long as input is not empty
getOne :: Parser s s
getOne = Parser (\xs -> case xs of 
                        (y:ys) -> pure (ys, y);
                        _      -> empty)

-- runs the parser, and if it succeeds,
--   checks that its result satisfies a predicate
check :: (a -> Bool) -> Parser s a -> Parser s a
check f p = p >>= \x -> 
  guard (f x) >> 
  pure x

-- consumes one token if the token
--   satisfies a predicate
satisfy :: (a -> Bool) -> Parser a a
satisfy p = check p getOne

-- builds a parser that only
--   matches the given token
literal :: Eq a => a -> Parser a a
literal tok = satisfy (== tok)

Alternation and failure

Alternation and failure are covered by the semigroup and monoid classes, respectively. Semigroups are characterized by an associative, binary, closed operation.

The parser interpretation of semigroups is choice: given two parsers, use the first one if it succeeds, but use the second one if the first fails.

class Semigroup' a where
  (<|>)  :: a -> a -> a

instance Semigroup' (Parser s a) where
  Parser f <|> Parser g = Parser (\xs -> f xs <|> g xs)
This implementation exploits the fact that the Maybe datatype can also form a left-biased semigroup.

Monoids are semigroups whose binary operation has an identity element; for parsers, this means that applying the choice operator to any parser plus the identity parser will always return the result of the first parser, regardless of whether it fails or succeeds. The identity parser always ignores its input and fails:

class Semigroup' a => Monoid' a where
  empty :: a

instance Monoid' (Parser s a) where
  empty = Parser (const Nothing)

Here are some examples:

-- combining two parsers with choice:  succeeds if either parser succeeds
a :: Parser Char Char
a = literal 'a'
b :: Parser Char Char
b = literal 'b'
ab :: Parser Char Char
ab = a <|> b

$ run ab "babcd"
Just ("abcd",'b')
$ run ab "abcd"
Just ("bcd",'a')

-- the empty parser always fails
fail :: Parser Char Char
fail = empty

$ run fail "abcd"
$ Nothing

-- the empty parser is both a right and a left identity
a_ :: Parser Char Char
a_ = a    <|>  fail
_a :: Parser Char Char
_a = fail <|>  a

$ run a_ "abcd"
Just ("bcd",'a')
$ run a_ "babcd"
Nothing
$ run _a "abcd"
Just ("bcd",'a')
$ run _a "babcd"
Nothing
We're not limited to combining two parsers at a time, of course; there is also the 'mconcat' combinator:
mconcat :: Monoid' a => [a] -> a
mconcat = foldr (<|>) empty

$ run (mconcat []) "abcde"
Nothing

digits :: [Parser Char Char]
digits = map literal ['0' .. '9']

$ run (mconcat digits) "4hi!!"
Just ("hi!!", '4')
'mconcat' combines a list of monoids using the binary operation, and the identity element as the base case. This means that using 'mconcat' on an empty list will generate a parser that always fails.

Success

Similarly to the parser that always fails, we have a parser that always succeeds. This is captured by the pointed class, which is the 'pure' part of the Applicative class in the standard Haskell libraries. This class allows you to lift a value into a context; for parsers, we build a parser that always succeeds, with the specified value as its result, and consuming zero tokens.

  
class Pointed' f where
  pure :: a -> f a

instance Pointed' (Parser s) where
  pure a = Parser (\xs -> Just (xs, a))

Examples:

pass :: Parser Integer String
pass = pure "Hello, world!"


$ run pass []
Just ([],"Hello, world!")

$ run pass [1,100,31]
Just ([1,100,31],"Hello, world!")
The parser 'pass' always succeeds, even with empty input; it simply returns its input token stream along with its value.

Mapping and sequencing

It's also useful to have access to a parser's value for further processing; a common use case is building up a parse tree. This concept is captured by the Functor class, which lifts a normal function to a function that operates on the result value of a parser. The parser interpretation is that, given a function and a parser, if the parser succeeds, map the function over its results; whereas if the parser fails, just propagate the failure.

class Functor' f where
  fmap :: (a -> b) -> f a -> f b

instance Functor' (Parser s) where
  -- one 'fmap' for the Maybe, one for the tuple
  fmap f (Parser g) = Parser (fmap (fmap f) . g)

The Applicative class enables not just lifting, but application in which both the function and its arguments are in contexts. It allows parsers to be run in sequence, where the first parser is run, and if it fails, the whole chain fails; if it succeeds, the rest of the token stream is passed to the next parser and its result is collected, and so on. This implementation makes use of the Monad instance of Maybe, although it could also be implemented without such an assumption.

class Functor' f => Applicative' f where
  (<*>) :: f (a -> b) -> f a -> f b

instance Applicative' (Parser s) where
  Parser f <*> Parser x = Parser h
    where
      h xs = f xs >>= \(ys, f') -> 
        x ys >>= \(zs, x') ->
        Just (zs, f' x')

Here are some examples:

one :: Parser Char Char
one = literal '1'
oneInt :: Parser Char Int
oneInt = fmap (\x -> (read :: String -> Int) [x .. '9']) one

$ run oneInt "123"
Just ("23",123456789)

two :: Parser Char Char
two = literal '2'
twelve :: Parser Char (Char, Char)
twelve = pure (,) <*> one <*> two

$ run twelve "123"
Just ("3",('1','2'))

$ run twelve "1123"
Nothing

The first example shows a Char parser ('one') that is converted into an Int parser using 'fmap' and a function of type 'Char -> Int'. The second example applies the '(,)' function within an Applicative parser context, tupling the results of the parsers 'one' and 'two'. The third example shows that parsers run in sequence must all succeed for the entire match to succeed; although the '1' is matched, the '2' cannot be.

The power of Applicative parsers can also be harnessed to create parsers that ignore the results (but not the effects!) of some or all of their parsers:

(*>) :: Parser t a -> Parser t b -> Parser t b
l *> r = fmap (flip const) l <*> r 

(<*) :: Parser t a -> Parser t b -> Parser t a
l <* r = fmap const l <*> r
Both '(*>)' and '(<*)' will only succeed if both of their arguments succeed in sequence; the difference is that '(*>)' only returns the result of the 2nd parser, while '(<*)' only returns the result of the 1st parser. Examples, using the 'one' and 'two' parsers defined above:
$ run (two *> one) "212345"
Just ("2345",'1')

$ run (two <* one) "212345"
Just ("2345",'2')

Combining Applicatives with Semigroups, we can create repeating parsers:

many :: Parser t a -> Parser t [a]
many p = some p <|> pure []

some :: Parser t a -> Parser t [a]
some p = fmap (:) p <*> many p
(note that 'some' and 'many' are mutually recursive). 'many' tries to run its parser as many times as possible, progressively chewing up input; it always succeeds since it's fine with matching 0 times. On the other hand, 'some' matches its parser at least once, failing if it can't match it at all, but other than that is identical to 'many'. Examples (using 'one' from above):
$ run (fmap length $ many one) "111111234"
Just ("234",6)
$ run (many one) "23434593475dkljdfs"
Just ("23434593475dkljdfs","")

$ run (fmap length $ one) "111111234"
Just ("234",6)
$ run (some one) "23434593475dkljdfs"
Nothing

Negations

Oftentimes, parsing conditions are easier to state in the negative than in the positive. For instance, if you were parsing a string, you might look for a double-quote character to open the string, and another double-quote to end the string. Meanwhile, anything that's *not* a double-quote which comes after the opening will be part of the string. To capture this pattern, I created the 'Switch' class:

class Switch' f where
  switch :: f a -> f ()

instance Switch' (Parser s) where
  switch (Parser f) = Parser h
    where h xs = fmap (const (xs, ())) $ switch (f xs)
This converts a failing parser to a successful one and vice versa. Importantly, it consumes no input from the token stream -- it acts as a negative lookahead parser, which allows us to build flexible parsers on top of it. Examples:
not1 :: Parser t b -> Parser t t
not1 p = switch p *> getOne

dq :: Parser Char Char
dq = literal '"'

not_dq :: Parser Char Char
not_dq = not1 dq

dq_string :: Parser Char String
dq_string = dq *> many not_dq <* dq

$ run dq_string "\"no ending double-quote"
Nothing

$ run dq_string "\"I'm a string\"abcxyz"
Just ("abcxyz","I'm a string")
The 'not1' combinator takes a parser as input, runs that parser, and if it succeeds, 'not1' fails; if that parser fails, 'not1' then tries to consume a single token (any token). In other words, it's like saying "I want anything but ".

The 'not_dq' parser matches any character that's not a double-quote; the string parser matches a double-quote followed by any number of non-double-quotes, followed by another double-quote; it throws away the results of both double-quote parsers, only returning the body of the string.

Running many parsers in sequence

Traversable is an interesting type class. It allows you to 'commute' two functors; i.e. if you have '[Maybe Int]', it allows you to create 'Maybe [Int]' (that is, turn a list of 'Maybe Int's into a 'Maybe' list of Ints. This is also useful for parsing, where it allows one to convert a list of parsers into a (single) parser of lists. In this case, we don't need to supply an instance for 'Parser' because the Functor in question is lists:
class Functor' t => Traversable' t where
  commute :: (Pointed' f, Applicative' f) => t (f a) -> f (t a)
Here are some examples (using 'digits' from above):
six_fours :: [Parser Char Char]
six_fours = replicate 6 (literal '4')

$ run (commute digits) "0123456789abcxyz"
Just ("abcxyz","0123456789")

$ run (commute six_fours) "4444449999999"
Just ("9999999","444444")
$ run (commute six_fours) "44444 oops that was only 5 fours"
Nothing

Monads

What parsing article could be complete without mentioning monads? Monads are similar to applicatives, but add the extra ability to have computations depend on the result of previous computations. Here's the class definition and parser implementation:

class (Applicative' m, Pointed' m) => Monad' m where
  join :: m (m a) -> m a 

instance Monad' (Parser s) where
  join (Parser f) = Parser h
    where
      h xs = f xs >>= \(o, Parser g) -> g o  
A good example of putting this extra power to work is this combinator:
twice :: Eq a => Parser a a -> Parser a a
twice p = p >>= \x ->
  literal x
It runs its input parser, and if it succeeds, attempts to match the *same* output a second time. Thus, the second match depends on the results of the first. We can't build such a parser using applicatives (although we can build less general versions by enumerating multiple cases). Here's an example showing how it's different from an Applicative version, using the 'ab' parser from earlier:
ab_twice :: Parser Char Char
ab_twice = twice ab

-- using monads
$ run ab_twice "aa123"
Just ("123",'a')
$ run ab_twice "ab123"
Nothing

-- using applicatives
$ run (pure (,) <*> ab <*> ab) "aa123"
Just ("123",('a','a'))
$ run (pure (,) <*> ab <*> ab) "ab123"
Just ("123",('a','b'))
In the first example, which uses monadic parsing, 'ab_twice' parses the first input and fails on the second. However, the second example -- with applicatives -- successfully parses both inputs. It sees the two parsers as being totally independent of each other and thus isn't able to require that the second one match the same tokens as the first one.

Relationship to BNF grammars, regular expressions, etc.

Of course, all of these useful parsing combinators have also been applied in other parsing approaches, such as grammars and regular expressions. Here's a quick correspondence:
BNF/regex combinators
| <|> of semigroups
sequencing <*> of applicatives
* many
+ some
grouping always explicitly grouped

What's next & further reading

There are a few topics that weren't covered in this article. First and foremost, good error detection and reporting is a key component of a parser library that's friendly and easy to use. Second, although I chose to use the Maybe data type to model the results, this could be extended to use any arbitrary monad -- resulting in a much richer set of parsers. Two examples are the list monad, to allow non-deterministic parses, and the state monad, two allow context-dependent parses.

If you're interested in learning more about parsing, Philip Wadler, Graham Hutton, and Doaitse Swierstra have published some excellent papers over the years on the topic; reading their papers was what really helped me to understand parsing. And of course there's also the powerful Parsec tool, a Haskell-based library for parser combinators which illustrates these ideas in a practical context.

Wednesday, October 10, 2012

A Haskell library for relational algebra

Why Haskell? Why relational algebra?

If you've never heard of Haskell before, and you like programming languages, you might want to check it out. It's a functional, statically-typed, type-inferred, elegant language in the ML family. The most important benefit that I've received from working with Haskell is a much better understanding of the static typing discipline, and the practice of designing a program or a library by figuring out a few concepts, capturing them as types -- either data or functions -- and building the rest of the code around them.

Relational algebra (RA) is a powerful mathematical tool for working with sets and functions on sets. Most common database products implement some flavor of SQL, a practical and standard language which contains many constructs from RA. Using SQL, a programmer can easily accomplish many complex data querying, manipulation, and transformation operations.

Unfortunately, applying RA often necessitates the use of a database. For whatever reason, there seem to be very few 'pure' RA libraries available for common languages. However, there are enormous benefits to be gained from integrating RA, instead of using a database: 1) not coupled to a database product, or its failure modes; 2) one less dependency for a deployed program; 3) code can mix RA with general purpose code, enhancing the effectiveness and efficiency of RA.

This post discusses the basic design and implementation of an RA library in Haskell.

Data model

The basic data types in RA are tuples, fixed-size units of n primitive values, and relations, or unordered sets composed of multiple named n-tuples of the same type. To distinguish RA tuples from Haskell tuples, I'll call them 'rows' for the rest of this post. Here's an example:

(first name: "Matthew", country: "USA", age: 25) <-- a named 3-tuple

a relation:

 first name | country | age   <-- the schema
-----------------------------
 "Matthew"  |  "USA"  |  25   <-- a tuple
  "Jimbo"   | "Spain" |  32   <-- another tuple of the same type
 "Jessica"  |  "USA"  |  27   <-- a third compatible tuple

However, the Haskell data model is a little bit different. Instead of restricting our row representation to just Haskell n-tuples, we'll let *any type* be a row, as long as we can compare any two instances of it for equality and ordering. And instead of using sets, we'll use lists -- Haskell has a large number of functions for working with lists, so it's a lot more convenient to use them than sets (although lists are intrinsically ordered, and do allow duplicates. We can ignore the first problem by never assuming the ordering is meaningful, but the second is more dangerous potentially). So what we have is:

-- the type of a relation:
:: (Eq a, Ord a) => [a]

-- and some examples of relations:
r1 :: Num a => [a]
r1 = [1,2,3]

r2 :: Num a => [(String, a)]
r2 = [("Matt", 1), ("Kevin", 14)]

r3 :: [a]
r3 = [] 

r4 :: [[String]]
r4 = [["abc", "xyz"], ["ghi"], []]
Note how the types of the rows can be almost anything -- numbers, Haskell tuples, polymorphic, or even lists.

Primitive RA operators

Rename: change attribute names, without changing any values. This operator is meaningless in the Haskell version, since it's not restricted to named tuples.

(name: "Matt", total: 32) -> (first name: "Matt", sum: 32)

Projection: apply a function to each tuple in a relation. Unfortunately, this could result in duplicates for either of two reasons: 1) the input contained duplicates, or 2) the mapping function created duplicates. The 2nd case has to be dealt with.

$ project (length . fst) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[4,3,5]

Row selection: select some rows, discarding the rest; none of the rows are changed.

$ rfilter (\(name, age) -> age >= 30) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[("Matt",30),("Jimbo",39)]

Cartesian product: combine two relations of size m and n rows respectively, resulting in a relation of size (m * n) rows, where an output row consists of a row from each of the input tables glued together. SQL and RA typically restrict result sets to *flat* tuples; since we've already gotten rid of this restriction, we're free to allow this to produce nested tuples.

$ rproduct "abc" [1..4]
[('a',1),('a',2),('a',3),('a',4),('b',1),('b',2),
 ('b',3),('b',4),('c',1),('c',2),('c',3),('c',4)]

Union: given two relations of the same type with m and n rows respectively, combine them by removing duplicates, resulting in a new relation of the *same* type with no more than (m + n) rows.

$ union [1..10] [5..15]
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Difference: given two relations of the same type, remove all elements found in the second relation from the first relation.

$ difference [1..10] [5..15]
[1,2,3,4]

Here are the Haskell implementations (note that these are by no means as efficient as possible):
project :: Eq b => (a -> b) -> [a] -> [b]
project f = nub . map f

rfilter :: (a -> Bool) -> [a] -> [a]
rfilter = filter
    
rproduct :: [a] -> [b] -> [(a, b)]
rproduct = liftM2 (,)

intersect :: Eq a => [a] -> [a] -> [a]
intersect xs ys = filter (\x -> x `elem` ys) xs

union :: Eq a => [a] -> [a] -> [a]
union xs ys = nub (xs ++ ys)

difference :: Eq a => [a] -> [a] -> [a]
difference r1 r2 = filter (\x -> not $ elem x r2) r1

Extending the library with some useful SQL operators

Inner, outer and left (outer) joins: these combine a cartesian product operation with a filtering operation. Outer joins are augmented inner joins, in that there is one additional result row for each unmatched row on one or both sides. Requires a default value of each row type to combine with the unmatched rows -- we're avoiding 'NULL's (although we could use Just/Nothing instead).

predicate l r = fst l == fst r
left = [(1, "Matt"), (2, "Jackie"), (3, "Gilligan")]
right = [(1, "hammer"), (1, "saw"), (1, "screwdriver"), (3, "boat"), (4, "wrench")]

$ innerJoin predicate left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((3,"Gilligan"), (3,"boat"))]

$ leftJoin predicate (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),    <--- an extra row!!
 ((3,"Gilligan"), (3,"boat"))]

$ outerJoin predicate (0, "nobody") (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),
 ((3,"Gilligan"), (3,"boat")),
 ((0,"nobody"),   (4,"wrench"))]    <--- another extra row!!

Grouping, group processing, and aggregation: it's often useful to separate a relation into groups based on values of certain attribute(s), and then to continue processing with the grouped data.

-- group some words by their length
$ let g1 = groupBy length $ words "this is an article for my blog that I hope is interesting"
[(1,["I"]),       (2,["is","my","an","is"]),
 (3,["for"]),     (4,["hope","that","blog","this"]),
 (7,["article"]), (11,["interesting"])]

-- the first letters of words, in each group
$ groupLift (project head) g1
[(1,"I"), (2,"ima"),
 (3,"f"), (4,"htb"),
 (7,"a"), (11,"i")]

-- the number of words, in each group
$ groupLift length g1
[(1,1), (2,4),
 (3,1), (4,4),
 (7,1), (11,1)]

-- the unique letters, in each group
$ groupLift (nub . concat) g1
[(1,"I"),       (2,"ismyan"),
 (3,"for"),     (4,"hopetablgis"),
 (7,"article"), (11,"intersg")]

-- aggregation ignoring groups
$ aggregate (length . snd) sum g1
12

-- aggregation within groups
$ groupLift (aggregate (ord . head) sum) g1
[(1,73),  (2,416),
 (3,102), (4,434),
 (7,97),  (11,105)]

And the implementations:
innerJoin :: (a -> b -> Bool) -> [a] -> [b] -> [(a, b)]
innerJoin f ls rs = rfilter (uncurry f) (rproduct ls rs)

leftJoin :: forall a b. (a -> b -> Bool) -> b -> [a] -> [b] -> [(a, b)]
leftJoin p null rl rr = concatMap f rl
  where 
    -- go through all the a's 
    --   match each a with all b's
    --   if no matches, match it with the default
    --   otherwise keep all matches
    f :: a -> [(a, b)]
    f a = map ((,) a) $ addNull $ filter (p a) rr
      where
        addNull :: [b] -> [b]
        addNull [] = [null]
        addNull bs = bs

outerJoin :: (Eq a, Eq b) => (a -> b -> Bool) -> a -> b -> [a] -> [b] -> [(a, b)]
outerJoin p anull bnull as bs = union left right
  where 
    left = leftJoin p bnull as bs
    right = project swap $ leftJoin (flip p) anull bs as

groupBy :: (Ord b) => (a -> b) -> [a] -> [(b, [a])]
groupBy f rel = toList grouped
  where
    grouped = foldl f' (fromList []) rel
    f' mp next = addRow (f next) next mp
    -- check whether the key's already in the map:
    -- if it is, stick 'next' on the existing list
    -- if not, create a new, single-element list for that key
    addRow :: Ord k => k -> v -> Map k [v] -> Map k [v]
    addRow k v mp = case lookup k mp of    
                        (Just oldval) -> insert k (v:oldval) mp;  
                        _ -> insert k [v] mp;  

groupLift :: ([a] -> c) -> ([(b, [a])] -> [(b, c)])
groupLift f = map (fmap f)  

aggregate :: (a -> b) -> ([b] -> c) -> [a] -> c
aggregate proj f = f . map proj

Some notes about the design goals of the library

was shooting for flexibility, simplicity, and minimality, not efficiency. thus, it lacks many 'composite operators' that SQL has, for instance: in a SQL join, you both join and project all in one operation. in this library, instead, you'd do the join, then the projection separately ... maybe less efficient, but more expressive and compositional. another example, is that it lacks things like equi-joins that would make joining more efficient; it doesn't have them b/c that's covered by inner joins.

Tuesday, October 9, 2012

The limitations of typical relational algebra and SQL

A brief review of SQL and RA

Many programmers are familiar with some flavor of Structured Query Language (SQL), used for querying and administrating many common database products, and relational algebra (RA), which is the mathematical underpinnings of SQL and relational databases as we currently know them. Relational algebra deals with relations comprised of tuples, and operations for transforming and combining them. Here is an example of a relation:
  name  |  weight
 -----------------
  Mary  |   110
  Kyle  |   165
  Kevin |   205
  Lucy  |   120
In this relation, there are two columns. Each of the four rows has a value for each column; the rows are tuples. Furthermore, RA defines six basic operators: projection, rename, product, selection, union, and difference. From these, many more complicated and more useful operators can be derived. SQL is a query language heavily based on RA that makes it very convenient to work with relationally structured data. However, both RA and SQL have a number of shortcomings which make some things difficult and obtuse to accomplish. I'll spend the rest of this post discussing some of these things.

Unable to step outside of SQL/RA

SQL and RA are great for working purely with relations, but are otherwise severely limited. For example, what if you need to generate a list of numbers from 1 to 10 for a filter predicate -- it's possible but it's not easy. Or what if you need to create a datatype with an API that disallows certain values? That's going to be nearly impossible. Or if you have to rank all tuples in a relation -- the standard, pure SQL solution involves an O(n**2) self-join solution, which is absurd considering it's at worst an n * log(n) problem. Unfortunately for SQL/RA, such situations -- in which they are not sufficient or particularly well-suited to the task -- are very often encountered, and the resulting solutions are necessarily work-arounds, which means they're harder to debug, understand, and maintain.

Tuples have to be flat

The relational model of data views tuples as being constructed out of n values of primitive types, where n >= 1, and the primitive types are typically numbers, booleans, strings, and times. This model is important for data storage, as its flat nature allows any data to be immediately obtained without any 'digging' through hierarchical layers. However, when creating complex queries, strict adherence to flat tuples is more painful than helpful. For example, a 'group by' query returns one row per group, each row typically including the grouped-on column(s) and aggregate values calculated for some other column(s). However, this is far less useful than allowing a list of all the rows in that group, as is clear when one tries to get the 'top n' rows from each group:

  student | score
 ------------------
   Jake   |  84
  Blake   |  93
   Jake   |  62
   Jake   |  79
  Blake   |  93
  Blake   |  84

-- goal: calculate the average of the top two scores for each student

-- first, let's group by 'student'
  student | scores
 ------------------
   Jake   | [84, 62, 79]
  Blake   | [93, 93, 84]

-- now take the top two
  student | top 2 scores
 ------------------
   Jake   | [84, 79]
  Blake   | [93, 93]

-- and average them
  student | scores
 ------------------
   Jake   |   81.5
  Blake   |    93   

This is a straightforward problem to solve, if richer tuple types are allowed, but as they are not, the solutions become convoluted and obtuse. Again, I'd just like to emphasize that the flat model is great for data storage, but unnecessarily restricting when it comes to building queries.

User-defined abstractions are limited

If you've read SICP, you may remember the authors stating that the three main aspects of programming are primitives, means of abstraction, and means of composition. SQL and RA certainly have primitives, and composition isn't impossible, but they both fall flat on abstraction. Users are unable to create single-valued functions, aggregate functions, table-valued functions, or to encapsulate a table-manipulation as a single procedure. (For example, most if not all database products lack a 'divide by' operator; this is a complex, multi-step procedure, and very difficult to get right. It can't be implemented as a function, and thus must be painstakingly reimplemented time and again ... if it were available as a simple library, imagine how easy it would be to use.) The lack of easily user-defined string-processing functions is often bothersome, and results in gigantic and unwieldy built-in libraries of obtuse string processing functions. If the user could define, and share, them such built-in libraries would be unnecessary. The basic -- and only?-- unit of abstraction is a view, but views are not parameterizable, and thus can't be reused to perform a similar operation on multiple tables. SQL/RA also lack common organization features found in modern programming languages, such as modules, classes, namespaces ...

SQL and RA are typically first order

I've seen this mentioned in a few places, and the poster child example is that you can't calculate transitive closure using pure SQL/RA. I've also read the lack of recursion is limiting, and that relations aren't first class values. As I don't completely understand these points, I'll wait until later to flesh out this section.

Conclusion

To be fair, many of these problems have already been addressed by widely used database products. The solutions typically involve embedding SQL in some kind of general purpose procedural or imperative language, which is then harnessed to create useful abstractions and extend SQL's functionality. I think this is a good approach and will continue to become more popular as the benefits of tight integration between non-procedural SQL-like code and procedural code are better understood. SQL and RA are awesome tools for working with structured data and sets. But by themselves, they're not anywhere near as powerful as they can be when embedded in a general-purpose programming language.

Friday, May 25, 2012

Useful git commands: part 2

In this second part of the article, we'll see more of the git commands that I find most useful, covering remotes, tags, and rebasing:

Remote

# see remotes
git remote -v

# make a new remote
git remote add [name] [location]

# get data from a remote -- everything that you don't yet have locally
git fetch [remote]

# show info about a remote
git remote show [remote]

# rename a remote (i.e. change the local alias -- doesn't change anything on the actual remote server)
#   also renames remote branch names
git remote rename [old] [new]
 
# remove a remote
git remote rm [remote]

Tags

# see all tags
git tag

# see tags matching a pattern
#   this doesn't seem to work
git tag -l [pattern]

# make a lightweight tag
git tag [tagname]

# make an annotated tag
#   what does '-a' do?
git tag -a [tagname] -m 'commit message'
 
# show a single tag
git show [tagname]

# make a signed tag
git tag -s [tagname] -m 'commit message'

# verify a signed tag
#   need signer's public key ... somewhere?  in keyring?
git tag -v [tagname]
 
# make a tag of a previous commit
git tag -a [tagname] [commit's sha-1 checksum]

# push one tag to a remote
git push [remote] [tagname]

# push all tags to a remote
git push [remote] --tags

Rebasing

# rebase the current branch onto another branch
git rebase [base-branch]

# rebase a branch onto another branch
git rebase [base-branch] [rebasee-branch]

# typical rebasing workflow:
#   switch to rebasee
git checkout [rebasee-branch]
#   put rebasee on base
git rebase [base-branch]
#   switch to base
git checkout [base-branch]
#   fast-forward merge rebasee into base 
git merge [rebasee-branch]

# more complicated rebase ... don't understand it
git rebase --onto [base-branch] [upstream-branch] [rebasee-branch]

Tuesday, May 22, 2012

Useful git commands: part 1

Here are some of the git commands that I find most useful, covering configuration, staging, and history:

Configuration

# change settings for all users
#   affects file in /etc/gitconfig
git config --system [??].[attribute] [value]

# change settings for one user
#   affects file in $HOME/.gitconfig
git config --global [??].[attribute] [value]

# change settings for one project
#   affects file in [projectroot]/.git/config
git config [??].[attribute] [value]

# change editor
git config core.editor emacs

# check all settings
git config --list

# check specific setting
git config [??].[attribute]

Staging area

# remove file from staging area
git rm --cached [file]

# see diff of files in staging area
git diff --staged

# see tracked files
git ls-files

# see information about tree-ish objects
#   what is a tree-ish object?  the man page doesn't explain it
git ls-tree [sha-1] [path]

History

# see a "pretty" commit history
#   setting: (oneline|short|full|fuller|format:"format string").  note that the --pretty option can interfere with other log options
git log --pretty=[setting]

# see only a limited number of commits
#   accepts any (?) positive integer
git log -[n]

# show diffs with each commit
git log -p
 
# see history of a specific file
#   is this any different from just 'git log [file]'?
git log --follow [file]

# see a branch graph
git log --graph

# display file statistics with each commit:  number of additions, deletions, etc. 
#   see also --shortstat ??
git log --stat

# more log options
git log (--name-only|--name-status|--abbrev-commit|--relative-date|--since|--until|--grep)

# make *all* criteria match
#   instead of *any* predicate causing a match, *all* predicates must be true to match
git log --all-match [... conditions ...]

# show only/don't show merge commits
git log --merges
git log --no-merges

Miscellaneous

# amend previous commit.  uses staging area for commit
#   what happens with sha-1 business?  does this command take more arguments (i.e. for commit message)?
git commit --amend
 
# make a command alias
#   ... not sure if those are supposed to be quotation marks or backticks or what ...
git config --global alias.[thealias] ['the original command']

Monday, April 30, 2012

Why is Clojure hard to learn?

I've spent a good deal of time in the last few months using Clojure, and my experiences have been pleasant. As a Lisp dialect, it's automically a very interesting programming language. But even better, it's also a JVM language, and benefits from very tight integration with the JVM and with Java code itself.

However, all is not perfect in the Cloverse. Clojure has a very steep learning curve. While for the masters, it can be a powerful, supple, flexible tool, for the rest of us, it's esoteric, foreign, and opaque.

Why is this?

My belief is that this difficulty is, to a large degree, caused by the natural structure and organization of Clojure code, which is very different from the natural structure and organization of projects in mainstream programming languages. I will differentiate between four models of code organization, and show how Clojure's is hardest for the novice (but not for the master).

language type system paradigm list of methods/functions object supports methods/functions are typed interactive access to docs easy to use REPL
Java static object-oriented yes: compile-time (IDE autocompletion) yes yes: IDE feature no
Python dynamic object-oriented yes: run-time (limited IDE support possible) no yes: interactive `help` function yes
Haskell static functional no yes yes (limited) yes
Clojure dynamic function no no yes yes

Analysis/interpretation

What's missing? Note that with both Haskell and Clojure, since they're not object-oriented, it's not possible to easily find all of the functions that can be invoked on an object. Why? In object-oriented languages, the most useful methods are members of a class or object, and can be found "through" the object; whereas functions are not organized as "belonging" to an object; a function accepting an X could appear in any module or file (and indeed, it may make sense to do so). Although note that methods can be part of other classes (think 'util' or 'helper' classes), and indeed, such an organization can be difficult to grok.

This problem is mitigated to a certain extent in Haskell because functions are statically typed, thus, given the type of an object, all relevant functions can be looked up (check out Hoogle for an example).

There's no common REPL for Java (that I know of); I count this as a major negative for Java, because it makes it much harder to interact with code. However, this is more than offset by the fantastic IDEs, such as Eclipse and NetBeans, that have been created to help manage Java code bases. The key features of these programs are interactive access to lists of applicable methods, documentation, imports, automatic refactoring ...

Clojure enjoys neither Haskell's typing, which provides an important modicum of documentation, nor the luxury of specialized IDEs. Thus it's very difficult to find all the functions that an object supports.

How can this be fixed? I don't know, but I think the key lack is that of access to relevant information from within the programming environment, whether it applicable functions of docs. When someone figures out an effective way to implement this, expect Clojure to become very popular.

Summary

The problem that I believe Clojure is facing, and that likely all languages face in their infancy, is how to lower the barrier to entry, and make it easy for newcomers to effectively learn to use the language. An important aspect of this is how the information contained in function, object, and module documentation is accessed, indexed, and searched by the programmer. Clojure is lacking in this area, and therefore presents difficulties to newcomers.

Thursday, April 26, 2012

Relational division

Intuitive meaning

Relational division is one of the core operators in relational algebra. Using it, one can answer questions such as "What are all of the that have/did/associate with ?"

Examples

Example 1: what are all the combinations of (n1, n2) that have values of (1, 1) and (1, 2) for (n3, n4)?
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 |     +----+----+     +----+----+
|  1 |  1 |  2 |  2 |     | n3 | n4 |     | n1 | n2 |
|  1 |  1 |  2 |  3 |  %  +----+----+  =  +----+----+
|  1 |  2 |  1 |  1 |     |  1 |  1 |     |  1 |  1 |
|  1 |  2 |  1 |  2 |     |  1 |  2 |     |  1 |  2 |
|  1 |  2 |  2 |  3 |     +----+----+     +----+----+
|  1 |  2 |  2 |  4 |
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+
Example 2: what are all the combinations of (n1, n2) that have values of (1, 1) for (n3, n4)?
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 |     +----+----+     +----+----+
|  1 |  1 |  2 |  2 |     | n3 | n4 |     | n1 | n2 |
|  1 |  1 |  2 |  3 |  %  +----+----+  =  +----+----+
|  1 |  2 |  1 |  1 |     |  1 |  1 |     |  1 |  1 |
|  1 |  2 |  1 |  2 |     +----+----+     |  1 |  2 |
|  1 |  2 |  2 |  3 |                     |  2 |  1 |
|  1 |  2 |  2 |  4 |                     +----+----+
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+

The code: template

Here's a template for relational division. Each numbered box must be filled according to the exact tables used and columns desired in the result:
select 1 from (
  select 
    2.*  -- don't forget the ".*" !!
  from 2
  inner join 3
  4
) as r
group by 5
having count(*) = (select count(*) from 3);
Explanation of the numbered boxes:
  1. columns in outermost select: set of dividend's columns that ARE NOT in the divisor. Also known as "quotient" columns
  2. dividend table: there should be no duplicates
  3. divisor table: its set of columns must form a PROPER subset of dividend's columns*
  4. join condition: all divisor columns must compare equal to their corresponding dividend columns
  5. group by columns: same as 1
Note that the purpose of the "having" clause is to allow selection only of those groups with number of rows equal to the number of rows in divisor.

How it works: (Example 2 from above)

dividend divisor Step 1: join dividend to divisor on divisor's columns (n3, n4) Step 2: group join results by quotient columns Step 3: select groups that have the right "count", and project onto quotient columns
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 | 
|  1 |  1 |  2 |  2 | 
|  1 |  1 |  2 |  3 | 
|  1 |  2 |  1 |  1 | 
|  1 |  2 |  1 |  2 | 
|  1 |  2 |  2 |  3 | 
|  1 |  2 |  2 |  4 |
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+
+----+----+
| n3 | n4 |
+----+----+
|  1 |  1 |
|  1 |  2 |
+----+----+
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  2 |  1 |  1 |
|  1 |  2 |  1 |  2 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
+----+----+----+----+
+----+----+----------+
| n1 | n2 | count(*) |
+----+----+----------+
|  1 |  1 |        2 |
|  1 |  2 |        2 |
|  2 |  1 |        1 |
|  2 |  2 |        1 |
+----+----+----------+
+----+----+
| n1 | n2 |
+----+----+
|  1 |  1 |
|  1 |  2 |
+----+----+

The code: filled-in template

Using the tables shown in the examples above, and "dividend" and "divisor" as the names of the dividend and divisor tables, respectively, we can fill in the template:
select r.n1, r.n2 from (
  select 
    dividend.*
  from dividend
  inner join divisor
  using (n3, n4)
) as r 
group by r.n1, r.n2
having count(*) = (select count(*) from divisor);

Other methods

There is another popular way to do relational division, involving a double-negative ("... where not exists (where not in ...) ..."). Although others find it easier to understand, I do not. Also, one code use a code generator or stored procedure that parameterizes production of a SQL string, automatically filling in the template. I didn't use this approach because that opens the door to a host of SQL injection-related problems.

Wednesday, April 25, 2012

SQL join types

Introduction

There are many kinds of joins in SQL. I often have trouble keeping them straight. This article divides the joins into pairs of opposites, giving brief definitions of each along with their opposites. For the rest of this article, we'll be joining two tables, which we'll call the "left table" and the "right table". We'll be using these tables:
create table rel1 (
  id int primary key, 
  letter varchar(1)
);

create table rel2 (
  id int primary key, 
  fk int, 
    foreign key (fk) references rel1(id),
  mychar varchar(1)
);

select * from rel1;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  3 | c      |
|  4 | d      |
+----+--------+

select * from rel2;
+----+--------+--------+
| id |    fk  | mychar |
+----+--------+--------+
|  1 |      1 | z      |
|  3 |      4 | o      |
|  5 |      2 | m      |
|  8 |      2 | q      |
+----+--------+--------+
`rel1` will be used as the left table, and `rel2` as the right table unless otherwise noted.

Inner vs. outer

Inner joins return only matching pairs of rows between the two tables. Outer joins also return rows from one or both tables that match 0 rows from the other table. (Note that `on rel1.id = rel2.id` is the "join predicate" in these examples)
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

select * 
from rel1 
outer join rel2 
  on rel1.id = rel2.id;
+------+--------+------+------+--------+
| id   | letter | id   | fk   | mychar |
+------+--------+------+------+--------+
|    1 | a      |    1 |    1 | z      |
|    2 | b      | NULL | NULL | NULL   |
|    3 | c      |    3 |    4 | o      |
|    4 | d      | NULL | NULL | NULL   |
| NULL | NULL   |    5 |    2 | m      |
| NULL | NULL   |    8 |    2 | q      |
+------+--------+------+------+--------+

Outer join: left and right outer join

Left outer joins return, in addition to the rows returned by the corresponding inner join, rows from the left table that don't match any rows in the right table. Right outer joins are identical to inner joins except that they also return unmatched rows from the right table.
select * 
from rel1 
left outer join rel2 
  on rel1.id = rel2.id;
+----+--------+------+------+--------+
| id | letter | id   | fk   | mychar |
+----+--------+------+------+--------+
|  1 | a      |    1 |    1 | z      |
|  2 | b      | NULL | NULL | NULL   |
|  3 | c      |    3 |    4 | o      |
|  4 | d      | NULL | NULL | NULL   |
+----+--------+------+------+--------+

select * 
from rel1 
right outer join rel2 
  on rel1.id = rel2.id;
+------+--------+----+------+--------+
| id   | letter | id | fk   | mychar |
+------+--------+----+------+--------+
|    1 | a      |  1 |    1 | z      |
|    3 | c      |  3 |    4 | o      |
| NULL | NULL   |  5 |    2 | m      |
| NULL | NULL   |  8 |    2 | q      |
+------+--------+----+------+--------+
Note that the union of a left join and a right join is a full outer join; their intersection is an inner join.

Semi vs. anti

"Semi"-joins return rows from the left table that match rows in the right table. "Anti"-joins return rows from the left table that don't match rows in the right table. Both of these join forms act as a sort of filter for the left table, since no columns from the right table are returned.
-- all rows in rel1 that have a matching row (by fk value) in rel2:
select * 
from rel1 
where id in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- all rel1 rows that don't have a matching row (by fk value) in rel2:
select * 
from rel1 
where id not in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+
Note that there are multiple ways to implement semi- and anti-joins in SQL:
-- "semi join" implemented using "inner join" and "distinct"
select distinct rel1.* 
from rel1 
inner join rel2 
  on rel1.id = rel2.fk;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- "anti join" implemented using "left join" and "is null"
select distinct rel1.* 
from rel1 
left join rel2 
  on rel1.id = rel2.fk 
where rel2.id is null;
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+

Equi vs. theta

"Equi"-joins are those where the join predicate is an equality. "Theta"-joins are those where the join predicate is an inequality.
-- equi-join
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

-- theta-join
select * 
from rel1 
inner join rel2 
  on rel1.id != rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+
Note that the union of an equi-join and a theta-join is a cartesian product, if the same tables and join predicate are used.

Implicit vs. explicit

source Explicit joins use one of the "join" keywords, such as "inner join" or "left join", combined with a join predicate:
select * from rel1 inner join rel2 on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Implicit joins do not use a "join" keyword, simply listing the tables in the "from" clause, and place the join predicate in the "where" clause:
select * from rel1, rel2 where rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Explicit joins are generally recommended, as they are thought to be more maintainable and more clearly express the programmer's intent. Note that this is purely a syntactic difference.

Cross join

This produces a cartesian product of the tables, joining every row in the left table with every row in the right table.
select * from rel1 inner join rel2 on 1;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  3 | c      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+

Natural join

The result is the set of rows, from the two tables, that have the same values for columns of the same name. If the tables share all columns names, the result is an intersection. If the tables share no column names, the result is a cartesian product. The number of columns in the result is the number of distinct column names in the tables.
select * 
from rel1 
natural join rel2;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select * 
from rel1 
inner join rel2 
  using(id);
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select rel1.*, rel2.fk, rel2.mychar 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+
This join is not useful in practice, as it can be completely replaced by queries using (more easily understandable) inner joins. Additionally, it may produce non-intuitive results in that tables are not necessarily joined on foreign keys, which to my mind is the actual "natural" way to join tables. (Recall that rel1.id <=> rel2.fk was the foreign key relationship we defined).

Self join

This is just a special case of a join where the same table is used for both sides of the join. It can be combined with any of the preceding "join" forms.
select * 
from rel2 a 
left join rel2 b 
  on a.id < b.fk;
+----+------+--------+------+------+--------+
| id | fk   | mychar | id   | fk   | mychar |
+----+------+--------+------+------+--------+
|  1 |    1 | z      |    3 |    4 | o      |
|  1 |    1 | z      |    5 |    2 | m      |
|  1 |    1 | z      |    8 |    2 | q      |
|  3 |    4 | o      |    3 |    4 | o      |
|  5 |    2 | m      | NULL | NULL | NULL   |
|  8 |    2 | q      | NULL | NULL | NULL   |
+----+------+--------+------+------+--------+
Note the table aliases, used to distinguish between the two uses of the same table.

Summary

A convenient way to understand joins is in pairs, as I showed:
join typeoppositemajor point of difference
innerouterdo/don't keep non-matching rows
left outerright outeraugment inner join with unmatched rows from left/right table
semiantifilter rows from a single table, keeping/dropping those that match rows in another table
equithetajoin predicate is equality/inequality
implicitexplicitsyntax
Note that cross, natural, and self joins don't have opposites in this sense.

Sources

article about joins
Wikipedia's "join" article

Monday, April 23, 2012

MySQL: investigating performance with large tables

MySQL: performance of a large table

This article examines how well MySQL deals with a large (50 million row) table, covering indexes, inserts, selects, deletes, and updates.

The set-up

MySQL version5.5.11
operating systemMac OS X 10.5.8
processor2 x 2.8 GHz Quad-Core Intel Xeon
RAM6 GB 800 MHz DDR2
hard drive300 GB, about 75% free
programming languageClojure
MySQL driverJDBC
number of rows50 million
approximate average row size115 bytes
Here's the schema:
create table mediumtable (
 id                 int primary key auto_increment,
 decimalfield1      decimal(10,2),
 indexedstringfield varchar(10),
   index isf (indexedstringfield(4)),
 indexfield1        int,
 indexfield2        int,
   index if1if2 (indexfield1, indexfield2)
);


create table largetable (
 id             int primary key auto_increment,
 intfield1      int,
 indexedintfield int,
   index iif (indexedintfield),
 stringfield    varchar(50),
 foreignkeyfield int,
   foreign key (foreignkeyfield) references mediumtable (id),
 indexfield1    int,
 indexfield2    int,
 indexfield3    int,
   index if1if2if3 (indexfield1, indexfield2, indexfield3),
 notnullint     int not null,
 notnullstring  varchar(20) not null
);

Building the database

I used Clojure to generate and insert 50 million rows for "largetable" and 50000 rows for "mediumtable". For each column, a random string or integer was generated depending on the type. Here's the code that was used:
(defn make-random-record
  []
  (let [f1 (rand-int 10000000)
        f2 (rand-int 1000)
        f3 (get-rand-string 40)
        f4 (+ 1 (rand-int 50005)) ;; because the keys are 1 to 50005 
        f5 (rand-int 100000)
        f6 (rand-int 100000)
        f7 (rand-int 100000)
        f8 (rand-int 100000000)
        f9 (get-rand-string 15)]
    {:intfield1       f1 
     :indexedintfield f2
     :stringfield     f3
     :foreignkeyfield f4
     :indexfield1     f5
     :indexfield2     f6
     :indexfield3     f7
     :notnullint      f8
     :notnullstring   f9}))

(defn insert-random-record
  """inserts n randomly generated records into 'largetable'"""
  [n]
  (with-connection db
   (dotimes [_ n]
    (insert-records "largetable" (make-random-record)))))
Row insertion, done with `(time (insert-random-record 50000000))`, took 33.5 hours.

Index statistics

mysql> select * from information_schema.statistics where table_schema = 'mysqlclojure';
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME  | NON_UNIQUE | INDEX_SCHEMA | INDEX_NAME      | SEQ_IN_INDEX | COLUMN_NAME        | COLLATION | CARDINALITY | SUB_PART | PACKED | NULLABLE | INDEX_TYPE | COMMENT | INDEX_COMMENT |
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| def           | mysqlclojure | largetable  |          0 | mysqlclojure | PRIMARY         |            1 | id                 | A         |    50000169 |     NULL | NULL   |          | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | iif             |            1 | indexedintfield    | A         |          18 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | foreignkeyfield |            1 | foreignkeyfield    | A         |       82372 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            1 | indexfield1        | A         |      217392 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            2 | indexfield2        | A         |    50000169 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            3 | indexfield3        | A         |    50000169 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          0 | mysqlclojure | PRIMARY         |            1 | id                 | A         |       50242 |     NULL | NULL   |          | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | isf             |            1 | indexedstringfield | A         |       50242 |        4 | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | if1if2          |            1 | indexfield1        | A         |       25121 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | if1if2          |            2 | indexfield2        | A         |       50242 |     NULL | NULL   | YES      | BTREE      |         |               |
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
10 rows in set (1.65 sec)


mysql> show indexes from largetable;
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name        | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| largetable |          0 | PRIMARY         |            1 | id              | A         |    50000169 |     NULL | NULL   |      | BTREE      |         |               |
| largetable |          1 | iif             |            1 | indexedintfield | A         |          18 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | foreignkeyfield |            1 | foreignkeyfield | A         |       65876 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            1 | indexfield1     | A         |      245098 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            2 | indexfield2     | A         |    50000169 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            3 | indexfield3     | A         |    50000169 |     NULL | NULL   | YES  | BTREE      |         |               |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (1.54 sec)

Performance of "select" statements, part 1: using the primary key

Selecting 1 row:
select * from largetable where id = 45000000;
...
1 row in set (0.02 sec)
Select 15 non-consecutive rows:
select * from largetable where id in (900, 1000, 10000, 100000, 1000000, 10000000, 48000000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000);
...
15 rows in set (0.14 sec)
Selecting 10000 consecutive rows:
select * from largetable where id between 2000000 and 2010000;
...
10001 rows in set (0.06 sec)
Note that it's faster to select many consecutive rows than few non-consecutive rows. This is because random I/Os are more expensive than sequential I/Os. Also, whereas presumably each of the non-consecutive rows is on a separate disk page, the pages filled by the consecutive rows are probably filled *only* with those rows. Also note that re-running each of these queries results in much faster execution the second time, presumably due to page caching. Since there are no concurrent inserts, this doesn't cause any concurrency issues.

Performance of "select" statements, part 2: using secondary indexes

Selecting a single row by secondary index value:
select * from largetable where indexedintfield = 155 limit 1;
...
1 row in set (0.11 sec)
Selecting 15 rows by secondary index value:
select * from largetable where indexfield1 = 14900 limit 15;
...
15 rows in set (0.13 sec)
Selecting many rows by secondary index value:
select * from largetable where indexfield1 = 149;
...
502 rows in set (5.95 sec)
Select a range of rows by secondary index value:
select * from largetable where indexfield1 between 149 and 152;
...
2058 rows in set (17.97 sec)
Note the similarity in execution time between query 2 and the second query of the primary key section: 0.14 secs vs 0.13 secs. Both have to read 15 rows that are essentially randomly distributed on disk. Also note that reading 2000 rows from a secondary index takes about 1000 times as long per row as reading from a primary index. Again, this is because InnoDB stores rows more or less consecutively by primary key value.

Performance of "select" statements, part 3: aggregates

Grouping on a secondary index:
select indexedintfield, count(*) from largetable group by indexedintfield;
...
1000 rows in set (7 min 59.51 sec)
To answer this query, even though MySQL doesn't have to read the table itself, it still has to read the entire index on "indexedintfield", which is on the order of hundreds of millions of bytes (at the leaf level of the index, there are 50 million entries, each with the index value and the primary key value -- even allowing for compression of identical index values, there are still over 50 million integers, and this is ignoring overhead.
Counting the number of rows in the table:
mysql> select count(*) from largetable;
...
1 row in set (5 min 48.61 sec)

mysql> select count(*) from largetable force index (primary);
...
1 row in set (2 min 22.84 sec)
Apparently, the performance of this query depends on which storage engine is in use. InnoDB has to read the entire table. Also note that forcing use of the primary (clustered) index speeds up the query. It's unclear to me why MySQL uses a secondary index when given the choice.

Performance of "insert" statements

The speed of insert operations may have slowed down at this size. When the table was nearly empty, approximately 500 rows/second could be inserted from Clojure. With 50 million rows already in the table, approximately 350 rows/second are inserted using the same function. I am not sure if this is due to the table size or to some other factor.

Performance of "update" statements

Updating by primary key value:
mysql> update largetable set notnullint = notnullint + 2 where id > 49950000;
Query OK, 50000 rows affected (0.61 sec)
Rows matched: 50000  Changed: 50000  Warnings: 0
Updating by secondary key value:
mysql> update largetable set notnullint = notnullint + 2 where indexedintfield = 999;
Query OK, 49744 rows affected (5 min 25.59 sec)
Rows matched: 49744  Changed: 49744  Warnings: 0
No surprises here: the I/O necessary to find the rows completely dwarfs all other costs.

Performance of "delete" statements

Deleting rows by primary key value:
mysql> delete from largetable where id = 50000001;
Query OK, 1 row affected (0.23 sec)
Deleting rows by secondary key value:
mysql> delete from largetable where indexedintfield = 500 and id > 50000000 limit 1;
Query OK, 1 row affected (1.50 sec)
Deleting ~170,000 rows by primary key range:
mysql> delete from largetable where id > 50000000;
Query OK, 1 row affected (2 min 17.28 sec)
As expected, it's faster to delete by primary key than secondary. What is surprising is how long it takes to delete many rows by primary key value.

Performance of "select" statements, part 4: joins

Since there is a one-to-many relation between "mediumtable" and "largetable", joins with the "where" clause restricting "largetable" are fast:
select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id = 56;
...
1 row in set (0.00 sec)
while joins with the "where" clause restricting "mediumtable" are slow:
select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where m.id = 56;
...
982 rows in set (9.72 sec)
This is because each row of "mediumtable" corresponds to many rows in "largetable" (as shown in the query results), which are spread around in various disk pages.
mysql> select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id <= 500;
...
500 rows in set (5.99 sec)

mysql> select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id between 150000 and 150499;
...
500 rows in set (0.09 sec)
I'm not sure why the second query is so much faster than the first, but I assume it's because "mediumtable" is loaded into cache during execution of the first query, so that the second query doesn't need to do any (or very little) I/O on "mediumtable".

Performance of "select" statements, part 5: subqueries

Check out these two queries:
mysql> select max(indexedintfield), min(indexedintfield) from largetable;
...
1 row in set (0.00 sec)

mysql> explain select max(a), min(a) from (select indexedintfield a from largetable) q;
...
??
What's the difference? The second one uses a subquery to rename fields, but it takes much longer to execute. Looks like MySQL's historical problems with subqueries are still ongoing.

Conclusions

What we've seen leads me to the conclusions that MySQL works fine as an OLTP database with 50 million rows, if the queries are restricted to simple joins primary key selects (note that all of this was done on a PC, without any performance tuning -- by no means a high-end server). However, aggregate queries and subqueries do not seem to work well and would probably cause an unacceptable performance hit if this were a "live" database. If that were the case, denormalization might be necessary. Therefore, I would be very hesitant to do any data analysis with such a setup. It's also to note that, with such a large database, full backups and dumps take a very long time; sharding and partitioning become important.

Wednesday, April 18, 2012

Understanding MySQL: execution plans, optimization, and table statistics

In my previous article, I showed some statistics and tables of rankings data from almost 40 years of rankings data provided by the ATP website (http://www.atpworldtour.com). In this article, I'll look at the code that generated those numbers, and how MySQL executes that code efficiently.

MySQL query execution: brief introduction

  1. a query is sent to the MySQL server (somehow: from a program, from the console, etc.)
  2. the query is parsed. if any errors occur, execution aborts
  3. relevant table statistics and indexes are located
  4. the optimizer generates an execution plan
  5. the plan is executed and the result set is returned to the client

The data model

I'll look at queries covering three tables:
create table country (
  name  varchar(3) primary key
);

create table player (
  pid int primary key,
  fname varchar(30),
  lname varchar(30),
  country varchar(3),
  foreign key (country) references country(name)
);

create table ranking (
  monday date,
  pid int,
  primary key (monday, pid),
  foreign key (pid) references player(pid),
  rank int
);
So what we have is a set of rankings every week, where typically each rank from 1 to 100 has a player at that ranking. And each player is associated with a country. The only indexes are those created by default by MySQL for primary keys and foreign keys.

1: query sent to MySQL

I'll use this query for the rest of this article:
  select
    p.fname,
    p.lname,
    r.*,
    c.*
  from ranking  r
  inner join player p
    on r.pid = p.pid
  inner join country c
    on p.country = c.abbreviation;

2: MySQL parses the query, checking for syntax errors, etc.

Since this query happens to be syntactically correct, and doesn't mention non-existent tables, MySQL accepts it cheerfully.

3: statistics and indexes are located

I can find these for myself using this query (see explanation of information_schema.statistics table in the docs here):
mysql> select * from information_schema.statistics where table_name in ('player', 'country', 'ranking');
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | NON_UNIQUE | INDEX_SCHEMA | INDEX_NAME | SEQ_IN_INDEX | COLUMN_NAME  | COLLATION | CARDINALITY | SUB_PART | PACKED | NULLABLE | INDEX_TYPE | COMMENT | INDEX_COMMENT |
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| def           | tennis       | country    |          0 | tennis       | PRIMARY    |            1 | name         | A         |         202 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | player     |          0 | tennis       | PRIMARY    |            1 | pid          | A         |        1069 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | player     |          1 | tennis       | country    |            1 | country      | A         |         152 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          0 | tennis       | PRIMARY    |            1 | monday       | A         |        3401 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          0 | tennis       | PRIMARY    |            2 | pid          | A         |      166696 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          1 | tennis       | pid        |            1 | pid          | A         |        1424 |     NULL | NULL   |          | BTREE      |         |               |
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
(TODO: figure out why there is an extra index on ranking.pid. I think it's because it's a foreign key) The important columns here are:
  • INDEX_NAME: it's "PRIMARY" for indexes on primary keys, and the column name for foreign keys. Note that I didn't specify column names, which allowed MySQL to default to this.
  • SEQ_IN_INDEX: indexes can cover more than one column. Since ranking has a two-column primary key, its second column (pid) has SEQ_IN_INDEX = 2
  • CARDINALITY: this is a very important piece of information for performance reasons. Cardinality is MySQL's estimate of the number of distinct values for that index. Indexes with low cardinality will be less selective. Note that the cardinality of a primary key should always be equal to the number of rows in the table, by definition. Also note that these statistics are just estimates.

4: execution plan is generated

MySQL will then use the statistics to come up with a pretty good execution plan. It may not be the best, but it's quite often very good. Here's what it came up with in this case (see the docs here for a more complete explanation of this output):
mysql> explain select p.fname,
    p.lname,
    r.*,
    c.*
  from ranking  r
  inner join player p
    on r.pid = p.pid
  inner join country c
    on p.country = c.abbreviation;
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
| id | select_type | table | type   | possible_keys   | key     | key_len | ref              | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
|  1 | SIMPLE      | p     | ALL    | PRIMARY,country | NULL    | NULL    | NULL             | 1097 |       |
|  1 | SIMPLE      | c     | eq_ref | PRIMARY         | PRIMARY | 5       | tennis.p.country |    1 |       |
|  1 | SIMPLE      | r     | ref    | pid             | pid     | 4       | tennis.p.pid     |  117 |       |
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
3 rows in set (0.00 sec)
This output shows one row for each table, view, or subquery that is accessed in the query. Important columns to pay attention to:
  • type: this is known as the "join type", but the name is somewhat deceiving because some queries don't involve joins. This column says how the table was accessed. "ALL" generally means that a tablescan is used to read all rows. "eq_ref" means that rows in the table match at most a single row from the previous tables. "ref" is similar to "eq_ref" but means that rows can match multiple rows from earlier tables.
  • ref: the column, from the previous tables, used for comparison when joining. For example, it's "tennis.p.country" for the second row, which means that rows from "country" need to match the "country" column of "player"
  • rows: this is MySQL's estimate for the number of rows of the previous tables that each row of the current table will match. Note that for the first table, there are no previous tables to match, so the number if simply MySQL's estimate of the number of rows in "player".
Also note that the order the tables are accessed in does not have to match the order of the tables as given in the query. That's okay; MySQL is free to optimize the table access order, as long as it doesn't change the final result.

5: query evaluated and result set returned

The result is too big to show, but it takes less than a second:
166208 rows in set (0.79 sec)

Monday, April 16, 2012

Tennis rankings statistics

Introduction

Total number of weeks spent in the top 10

+------------+------------+-----------------+
| first name | last name  | number of weeks |
+------------+------------+-----------------+
| Andre      | Agassi     |             747 |
| Pete       | Sampras    |             586 |
| Boris      | Becker     |             576 |
| Roger      | Federer    |             506 |
| Stefan     | Edberg     |             497 |
| Jimmy      | Connors    |             489 |
| Ivan       | Lendl      |             482 |
| Andy       | Roddick    |             440 |
| Yevgeny    | Kafelnikov |             390 |
| Michael    | Chang      |             369 |
| Rafael     | Nadal      |             364 |
| Goran      | Ivanisevic |             328 |
+------------+------------+-----------------+

Number of different years ranked in top 10

+----------------+-------------------+-------------------------+
| first name     | last name         | years ranked in top 100 |
+----------------+-------------------+-------------------------+
| Andre          | Agassi            |                      21 |
| Fabrice        | Santoro           |                      21 |
| Jimmy          | Connors           |                      20 |
| Ivan           | Lendl             |                      17 |
| John           | McEnroe           |                      17 |
| Magnus         | Gustafsson        |                      16 |
| Jonas          | Bjorkman          |                      16 |
| Boris          | Becker            |                      16 |
| Pete           | Sampras           |                      16 |
| Carlos         | Moya              |                      15 |
| Nicolas        | Lapentti          |                      15 |
| Vincent        | Spadea            |                      15 |
| Michael        | Chang             |                      15 |
| Thomas         | Muster            |                      15 |
| Arnaud         | Clement           |                      15 |
| Guy            | Forget            |                      15 |
| Wayne          | Ferreira          |                      14 |
| Guillermo      | Vilas             |                      14 |
| Jason          | Stoltenberg       |                      14 |
| Lleyton        | Hewitt            |                      14 |
| Roger          | Federer           |                      14 |
| Mark           | Woodforde         |                      14 |
| Tommy          | Haas              |                      14 |
| Stefan         | Edberg            |                      14 |
| Marc           | Rosset            |                      14 |
| Anders         | Jarryd            |                      14 |
| Kenneth        | Carlsen           |                      14 |
| Jakob          | Hlasek            |                      14 |
| Ivan           | Ljubicic          |                      14 |
| Goran          | Ivanisevic        |                      14 |
| Juan Carlos    | Ferrero           |                      14 |
| Thomas         | Enqvist           |                      14 |
| Greg           | Rusedski          |                      14 |
+----------------+-------------------+-------------------------+

Longevity: number of years between first and last top 100 ranking

+----------------+-------------------+-----------+
| first name     | last name         | longevity |
+----------------+-------------------+-----------+
| Andre          | Agassi            |   19.9068 |
| Fabrice        | Santoro           |   19.8493 |
| Jimmy          | Connors           |   19.4849 |
| Ivan           | Lendl             |   16.1288 |
| John           | McEnroe           |   16.0137 |
| Ronald         | Agenor            |   14.9781 |
| Guillermo      | Vilas             |   14.9014 |
| Pete           | Sampras           |   14.7863 |
| Vincent        | Spadea            |   14.6904 |
| Boris          | Becker            |   14.6137 |
| Jonas          | Bjorkman          |   14.5753 |
| Mats           | Wilander          |   14.5370 |
| Magnus         | Gustafsson        |   14.4411 |
| Michael        | Chang             |   14.1918 |
| Guy            | Forget            |   14.1342 |
| Nicolas        | Lapentti          |   14.1151 |
+----------------+-------------------+-----------+

Top players by points

Points are awarded weekly. If a player is ranked n for one week, he gets (100 - n) points. Thus, the top-ranked player gets 99 points for that week.
+----------------+-------------------+-------+-------+---------+
| first name     | last name         | total | weeks | average |
+----------------+-------------------+-------+-------+---------+
| Andre          | Agassi            | 89112 |  1019 | 87.4504 |
| Pete           | Sampras           | 68204 |   767 | 88.9231 |
| Boris          | Becker            | 65467 |   762 | 85.9147 |
| Stefan         | Edberg            | 62604 |   690 | 90.7304 |
| Roger          | Federer           | 59964 |   655 | 91.5481 |
| Michael        | Chang             | 57769 |   736 | 78.4905 |
| Jimmy          | Connors           | 55662 |   636 | 87.5189 |
| Carlos         | Moya              | 55515 |   723 | 76.7842 |
| Ivan           | Lendl             | 54568 |   596 | 91.5570 |
| Goran          | Ivanisevic        | 53588 |   665 | 80.5835 |
| Thomas         | Muster            | 53352 |   716 | 74.5140 |
| Andy           | Roddick           | 52508 |   576 | 91.1597 |
| Lleyton        | Hewitt            | 51084 |   646 | 79.0774 |
| Wayne          | Ferreira          | 49854 |   699 | 71.3219 |
| Fabrice        | Santoro           | 49553 |   942 | 52.6040 |
| Juan Carlos    | Ferrero           | 48744 |   655 | 74.4183 |
| Yevgeny        | Kafelnikov        | 47749 |   542 | 88.0978 |
| Tim            | Henman            | 47517 |   618 | 76.8883 |
| Jim            | Courier           | 47256 |   625 | 75.6096 |
| John           | McEnroe           | 47115 |   537 | 87.7374 |
| Tommy          | Haas              | 45905 |   643 | 71.3919 |
| Richard        | Krajicek          | 44432 |   578 | 76.8720 |
| Brad           | Gilbert           | 44093 |   558 | 79.0197 |
+----------------+-------------------+-------+-------+---------+

Effect of time on height and weight

+------+----------+---------+
| year | height   | weight  |
+------+----------+---------+
| 1973 | 181.8135 | 76.3189 |
| 1974 | 181.7914 | 76.4164 |
| 1975 | 181.3225 | 75.8736 |
| 1976 | 181.7393 | 76.2832 |
| 1977 | 182.1372 | 76.9437 |
| 1978 | 182.5895 | 76.6278 |
| 1979 | 182.9831 | 76.5511 |
| 1980 | 182.8359 | 76.1083 |
| 1981 | 179.1964 | 72.9821 |
| 1982 | 181.6087 | 75.9022 |
| 1983 | 183.2094 | 76.4364 |
| 1984 | 183.6162 | 76.7500 |
| 1985 | 183.6465 | 76.1334 |
| 1986 | 183.2473 | 76.1448 |
| 1987 | 183.3163 | 76.3040 |
| 1988 | 183.2512 | 76.4948 |
| 1989 | 183.3446 | 76.6306 |
| 1990 | 183.5925 | 76.5753 |
| 1991 | 183.7840 | 76.7710 |
| 1992 | 184.2427 | 77.1044 |
| 1993 | 184.3413 | 77.7862 |
| 1994 | 184.5725 | 78.1500 |
| 1995 | 184.9135 | 78.4188 |
| 1996 | 185.4239 | 79.3361 |
| 1997 | 185.3892 | 79.6341 |
| 1998 | 185.0104 | 79.7397 |
| 1999 | 184.5593 | 79.7152 |
| 2000 | 184.7325 | 80.1408 |
| 2001 | 184.1270 | 79.5645 |
| 2002 | 184.2771 | 79.3858 |
| 2003 | 184.5173 | 79.7938 |
| 2004 | 185.2883 | 80.3637 |
| 2005 | 184.7863 | 80.2042 |
| 2006 | 184.5260 | 79.5190 |
| 2007 | 184.9772 | 79.6904 |
| 2008 | 185.4658 | 79.8740 |
| 2009 | 185.2888 | 79.6398 |
| 2010 | 185.7619 | 79.9013 |
| 2011 | 185.8758 | 79.0297 |
| 2012 | 186.1962 | 79.1639 |
+------+----------+---------+

Number of weeks players ranked #1 in a year

+---------------------+------+----------+
| name                | year | weeks #1 |
+---------------------+------+----------+
| Ilie Nastase        | 1973 |       20 |
| Jimmy Connors       | 1974 |       23 |
| Ilie Nastase        | 1974 |       21 |
| John Newcombe       | 1974 |        8 |
| Jimmy Connors       | 1975 |       47 |
| Jimmy Connors       | 1976 |       24 |
| Jimmy Connors       | 1977 |       37 |
| Bjorn Borg          | 1977 |        1 |
| Jimmy Connors       | 1978 |       10 |
| Jimmy Connors       | 1979 |       21 |
| Bjorn Borg          | 1979 |        7 |
| Bjorn Borg          | 1980 |       14 |
| John McEnroe        | 1980 |        4 |
| John McEnroe        | 1981 |        3 |
| Bjorn Borg          | 1981 |        2 |
| Jimmy Connors       | 1982 |        8 |
| John McEnroe        | 1982 |        6 |
| John McEnroe        | 1983 |       11 |
| Jimmy Connors       | 1983 |        6 |
| Ivan Lendl          | 1983 |        4 |
| John McEnroe        | 1984 |       27 |
| Ivan Lendl          | 1984 |       12 |
| John McEnroe        | 1985 |       34 |
| Ivan Lendl          | 1985 |       18 |
| Ivan Lendl          | 1986 |       52 |
| Ivan Lendl          | 1987 |       52 |
| Ivan Lendl          | 1988 |       36 |
| Mats Wilander       | 1988 |       16 |
| Ivan Lendl          | 1989 |       48 |
| Mats Wilander       | 1989 |        4 |
| Ivan Lendl          | 1990 |       32 |
| Stefan Edberg       | 1990 |       21 |
| Stefan Edberg       | 1991 |       40 |
| Boris Becker        | 1991 |       12 |
| Jim Courier         | 1992 |       41 |
| Stefan Edberg       | 1992 |       11 |
| Pete Sampras        | 1993 |       35 |
| Jim Courier         | 1993 |       17 |
| Pete Sampras        | 1994 |       52 |
| Andre Agassi        | 1995 |       30 |
| Pete Sampras        | 1995 |       22 |
| Pete Sampras        | 1996 |       45 |
| Thomas Muster       | 1996 |        6 |
| Andre Agassi        | 1996 |        2 |
| Pete Sampras        | 1997 |       52 |
| Pete Sampras        | 1998 |       46 |
| Marcelo Rios        | 1998 |        6 |
| Pete Sampras        | 1999 |       24 |
| Andre Agassi        | 1999 |       19 |
| Yevgeny Kafelnikov  | 1999 |        6 |
| Carlos Moya         | 1999 |        2 |
| Patrick Rafter      | 1999 |        1 |
| Andre Agassi        | 2000 |       36 |
| Pete Sampras        | 2000 |       10 |
| Gustavo Kuerten     | 2000 |        4 |
| Marat Safin         | 2000 |        2 |
| Gustavo Kuerten     | 2001 |       39 |
| Lleyton Hewitt      | 2001 |        7 |
| Marat Safin         | 2001 |        7 |
| Lleyton Hewitt      | 2002 |       52 |
| Lleyton Hewitt      | 2003 |       21 |
| Andre Agassi        | 2003 |       14 |
| Andy Roddick        | 2003 |        9 |
| Juan Carlos Ferrero | 2003 |        8 |
| Roger Federer       | 2004 |       48 |
| Andy Roddick        | 2004 |        4 |
| Roger Federer       | 2005 |       52 |
| Roger Federer       | 2006 |       52 |
| Roger Federer       | 2007 |       53 |
| Roger Federer       | 2008 |       32 |
| Rafael Nadal        | 2008 |       20 |
| Rafael Nadal        | 2009 |       26 |
| Roger Federer       | 2009 |       26 |
| Rafael Nadal        | 2010 |       30 |
| Roger Federer       | 2010 |       22 |
| Rafael Nadal        | 2011 |       26 |
| Novak Djokovic      | 2011 |       26 |
| Novak Djokovic      | 2012 |       15 |
+---------------------+------+----------+

Notes

All data was obtained from the ATP's website, where it is freely available. The data is incomplete in many cases, so while all efforts were made to ensure correctness of results, their accuracy can not be guaranteed.

Wednesday, March 14, 2012

Recipes for `git` branches

Recipes for working with branches in `git`

If you've ever used `git`, and had to create and manage branches, you may have been impressed by how flexible and accomodating it is, but discouraged by how hard it can be to remember all the incantations. That's why I've created a list of recipes for basic branching and merging operations. Enjoy!

These recipes were distilled by me based on information found in an awesome book by Scott Chacon.

# create a branch
git branch [new-branch] [base-branch]

# switch to an existing branch
git checkout [branch]

# push branch to remote
git push [remote] [branch]

# pull changes from remote
#   pulls down remote branches,
#   but doesn't set up tracking branches
git fetch [remote]

# create and switch to a new branch
git checkout -b [branch]

# create a new branch from a specific commit (starting point)
git branch [branch] [sha-1 hash]

# create new local branch from remote branch
git checkout -b [localbranch] [remote]/[remotebranch]
# or
git checkout --track [remote]/[branch]
#   see http://git-scm.com/book/en/Git-Branching-Remote-Branches#Tracking-Branches
#   for more information about tracking branches


# merge branch into master
#   have to first switch to master
git checkout master
git merge [branch]

# check that the merge worked
#   shows an "unmerged" section if any files failed
git status

# delete a branch (after merging it)
git branch -d newbranch

# see last commit on each branch
git branch -v

# see merged branches (to current branch)
git branch --merged

# see unmerged branches (to current branch)
git branch --no-merged

# delete remote branch
#   the syntax is widely acknowledged to be obtuse
git push origin :newbranch

Sunday, March 11, 2012

Analyzing data using pivot tables

Introduction

Pivot tables are a useful tool in data analysis, allowing a user to quickly and easily understand the relationship between two attributes. However, they aren't straightforward to create in many database systems. In this article, we'll look at:

  1. What is a pivot table?
  2. Why would I need a pivot table?
  3. How do I create a pivot table?
To answer these questions, we'll create and use pivot tables with MySQL.



Background

1. What is a pivot table? It's a summary table used for reporting, analysis, and quick data inspection purposes. The name comes from the fact that the columns in the pivot table were row values in the base table; these values are pivoted into columns headers.
Terminology used in this article:

  • base table: the table of interest that we'll analyze
  • columns of interest: the two columns from the base table, whose relationship is shown through the pivot table
  • y-values: values from the first column of interest. The pivot table will have one row for each distinct y-value
  • x-values: values from the second column of interest. The pivot table will typically have one column for each distinct x-value

2. Why would I need a pivot table? Pivot tables provide summaries of data that are much easier to grok than the full table. They typically focus on two columns, providing aggregate data which can help indicate the relationships between them.

3. How do I create a pivot table? There is a straight-forward procedure for creating pivot tables: (indicate no. of rows, columns after each step)

  1. select the two columns of interest. Call them x and y: the y-values become values in the first column, while the x-values become column headers. The pivot table ends up with 1 y column, plus 1 column for each x-value
  2. create a view/query that adds extra columns to each row of the base table, indicating the type of the row
  3. aggregate the extended table using appropriate aggregate functions, grouping by the y-value column

2. and 3. will be covered in more detail in the remainder of this article.



Example: pivoting investment data

In this example, we'll look at data showing shares purchased during a number of investments made over three years (note: these are hypothetical companies -- if the names bear any resemblance to real companies' names, it is purely coincidental :). We'll start with the table schema and some sample data:

create table investment (
  id       int primary key auto_increment, 
  year     int, 
  person   varchar(50), 
  stock    varchar(50), 
  shares   int
);

mysql> select * from investment;
+----+------+---------+--------+--------+
| id | year | person  | stock  | shares |
+----+------+---------+--------+--------+
|  1 | 2009 | Matt    | Google |     27 |
|  2 | 2009 | Jeffrey | Google |     13 |
|  3 | 2010 | Timothy | Yahoo  |     31 |
|  4 | 2009 | Matt    | Yahoo  |      4 |
|  5 | 2011 | Timothy | IBM    |    100 |
|  6 | 2011 | Timothy | Google |     34 |
|  7 | 2010 | Jeffrey | Yahoo  |     68 |
|  8 | 2010 | Jeffrey | Yahoo  |     18 |
|  9 | 2011 | Matt    | IBM    |     49 |
| 10 | 2011 | Matt    | Google |     22 |
| 11 | 2011 | Timothy | Yahoo  |     51 |
| 12 | 2009 | Jeffrey | Yahoo  |     63 |
+----+------+---------+--------+--------+

Requirements -- our final goal is queries showing:
  1. the number of shares of each company bought by each investor
  2. the number of times each investor has bought stock in each company
We'll do this by creating two separate, but related, pivot tables.


Step 1: choose columns of interest

According to the requirements, we need to see investor vs. stock. Thus, those will be our two columns. I'll choose `person` as the y-value column, and `stock` as the x-value column. (For this example, the choice is arbitrary, but see the "Limitations" section for a reason to choose more carefully).


Step 2: extend

Extend the base table with extra columns, one for each x-value of interest. Put an appropriate value in each of the columns, or a neutral value if the column doesn't apply:
create view investment_extended as (
  select
    investment.*, 
    case when stock = "Google"  then shares end as Google,
    case when stock = "IBM"     then shares end as IBM,
    case when stock = "Yahoo"   then shares end as Yahoo 
  from investment
);

mysql> select * from investment_extended;
+----+------+---------+--------+--------+--------+------+-------+
| id | year | person  | stock  | shares | Google | IBM  | Yahoo |
+----+------+---------+--------+--------+--------+------+-------+
|  1 | 2009 | Matt    | Google |     27 |     27 | NULL |  NULL |
|  2 | 2009 | Jeffrey | Google |     13 |     13 | NULL |  NULL |
|  3 | 2010 | Timothy | Yahoo  |     31 |   NULL | NULL |    31 |
|  4 | 2009 | Matt    | Yahoo  |      4 |   NULL | NULL |     4 |
|  5 | 2011 | Timothy | IBM    |    100 |   NULL |  100 |  NULL |
|  6 | 2011 | Timothy | Google |     34 |     34 | NULL |  NULL |
|  7 | 2010 | Jeffrey | Yahoo  |     68 |   NULL | NULL |    68 |
|  8 | 2010 | Jeffrey | Yahoo  |     18 |   NULL | NULL |    18 |
|  9 | 2011 | Matt    | IBM    |     49 |   NULL |   49 |  NULL |
| 10 | 2011 | Matt    | Google |     22 |     22 | NULL |  NULL |
| 11 | 2011 | Timothy | Yahoo  |     51 |   NULL | NULL |    51 |
| 12 | 2009 | Jeffrey | Yahoo  |     63 |   NULL | NULL |    63 |
+----+------+---------+--------+--------+--------+------+-------+
Note that:
  1. I'm using NULL as the neutral value
  2. whether a new column has a non-null value depends on its x-value: rows with `stock = "Google"` have a value for new column `Google` and `NULL`s for the other new columns
  3. this distinction "marks" each row with its type -- and is the key to getting Step 3 to work
  4. whenever a new column is non-null: its value is always equal to column `shares` -- so we can meet the requirements
  5. we didn't change the number of rows
  6. we added one column per x-value of interest


Step 3: group by y-values and apply aggregate function

Our first requirement was to total shares bought per investor at each company. To accomplish that, we start with the view from Step 2, and remember that column `person` provides the y-values, so we'll need to `group by person`; also, we need the `sum` aggregate function to total the rows.

create view pivot_company_shares as (
  select 
    person, 
    sum(Google)            as Google, 
    coalesce(sum(IBM), 0)  as IBM, 
    sum(Yahoo)             as Yahoo 
  from investment_extended 
  group by person
);

mysql> select * from pivot_company_shares;
+---------+--------+------+-------+
| person  | Google | IBM  | Yahoo |
+---------+--------+------+-------+
| Jeffrey |     13 |    0 |   149 |
| Matt    |     49 |   49 |     4 |
| Timothy |     34 |  100 |    82 |
+---------+--------+------+-------+
Look at that -- how beautiful! Note that in `pivot_company_shares`, we use `coalesce` to check the IBM column for null, returning 0 if it is and the column value otherwise. This is because one of the investors didn't buy any IBM, and using `sum` on a bunch of null values returns null! We could have just left it as null, but the 0 looks a lot nicer.
To meet the second requirement, which is how many times each investor bought each stock, we just need to know how many rows from the extended view have non-null values. An easy way to do that is with the aggregate function `count(n)` -- it returns the number of rows with a non-null value for column `n`.
create view pivot_company_transactions as (
  select 
    person, 
    count(Google)  as Google, 
    count(IBM)     as IBM, 
    count(Yahoo)   as Yahoo 
  from investment_extended 
  group by person
);

mysql> select * from pivot_company_transactions;
+---------+--------+-----+-------+
| person  | Google | IBM | Yahoo |
+---------+--------+-----+-------+
| Jeffrey |      1 |   0 |     3 |
| Matt    |      2 |   1 |     1 |
| Timothy |      1 |   1 |     2 |
+---------+--------+-----+-------+

And ... voila! We didn't need to check for NULL this time because count returns 0 if it only find NULLs.


Important points

  • what aggregate function to use when grouping. I used sum, but count and max are also often used (max is often used when building one-row "objects" that had been spread across many rows)
  • what value to use in the extra columns -- the requirements dictated `shares` in this case
  • what "neutral" value to use in the extra columns. I used NULL, but it could also be 0 or ""
  • using multiple columns for y-values. This solution isn't limited to using a single column for the y-values -- just plug the extra columns into the group by clause (and don't forget to select them)

Conclusion and limitations

Pivot tables are quite useful and relatively straightforward. However, one downside to the approach I've outlined is that you have to know how many columns the pivot table will have in advance. The frustrations this causes can range from mildly annoying (adds extra typing) to severely limiting, if there are hundreds of values.

The inspiration for this post came from reading the SQL Cookbook by Anthony Molinaro, chapter 12. I highly recommend this book to anyone interested in improving their SQL chops, from beginner to advanced.