Wednesday, October 10, 2012

A Haskell library for relational algebra

Why Haskell? Why relational algebra?

If you've never heard of Haskell before, and you like programming languages, you might want to check it out. It's a functional, statically-typed, type-inferred, elegant language in the ML family. The most important benefit that I've received from working with Haskell is a much better understanding of the static typing discipline, and the practice of designing a program or a library by figuring out a few concepts, capturing them as types -- either data or functions -- and building the rest of the code around them.

Relational algebra (RA) is a powerful mathematical tool for working with sets and functions on sets. Most common database products implement some flavor of SQL, a practical and standard language which contains many constructs from RA. Using SQL, a programmer can easily accomplish many complex data querying, manipulation, and transformation operations.

Unfortunately, applying RA often necessitates the use of a database. For whatever reason, there seem to be very few 'pure' RA libraries available for common languages. However, there are enormous benefits to be gained from integrating RA, instead of using a database: 1) not coupled to a database product, or its failure modes; 2) one less dependency for a deployed program; 3) code can mix RA with general purpose code, enhancing the effectiveness and efficiency of RA.

This post discusses the basic design and implementation of an RA library in Haskell.

Data model

The basic data types in RA are tuples, fixed-size units of n primitive values, and relations, or unordered sets composed of multiple named n-tuples of the same type. To distinguish RA tuples from Haskell tuples, I'll call them 'rows' for the rest of this post. Here's an example:

(first name: "Matthew", country: "USA", age: 25) <-- a named 3-tuple

a relation:

 first name | country | age   <-- the schema
-----------------------------
 "Matthew"  |  "USA"  |  25   <-- a tuple
  "Jimbo"   | "Spain" |  32   <-- another tuple of the same type
 "Jessica"  |  "USA"  |  27   <-- a third compatible tuple

However, the Haskell data model is a little bit different. Instead of restricting our row representation to just Haskell n-tuples, we'll let *any type* be a row, as long as we can compare any two instances of it for equality and ordering. And instead of using sets, we'll use lists -- Haskell has a large number of functions for working with lists, so it's a lot more convenient to use them than sets (although lists are intrinsically ordered, and do allow duplicates. We can ignore the first problem by never assuming the ordering is meaningful, but the second is more dangerous potentially). So what we have is:

-- the type of a relation:
:: (Eq a, Ord a) => [a]

-- and some examples of relations:
r1 :: Num a => [a]
r1 = [1,2,3]

r2 :: Num a => [(String, a)]
r2 = [("Matt", 1), ("Kevin", 14)]

r3 :: [a]
r3 = [] 

r4 :: [[String]]
r4 = [["abc", "xyz"], ["ghi"], []]
Note how the types of the rows can be almost anything -- numbers, Haskell tuples, polymorphic, or even lists.

Primitive RA operators

Rename: change attribute names, without changing any values. This operator is meaningless in the Haskell version, since it's not restricted to named tuples.

(name: "Matt", total: 32) -> (first name: "Matt", sum: 32)

Projection: apply a function to each tuple in a relation. Unfortunately, this could result in duplicates for either of two reasons: 1) the input contained duplicates, or 2) the mapping function created duplicates. The 2nd case has to be dealt with.

$ project (length . fst) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[4,3,5]

Row selection: select some rows, discarding the rest; none of the rows are changed.

$ rfilter (\(name, age) -> age >= 30) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[("Matt",30),("Jimbo",39)]

Cartesian product: combine two relations of size m and n rows respectively, resulting in a relation of size (m * n) rows, where an output row consists of a row from each of the input tables glued together. SQL and RA typically restrict result sets to *flat* tuples; since we've already gotten rid of this restriction, we're free to allow this to produce nested tuples.

$ rproduct "abc" [1..4]
[('a',1),('a',2),('a',3),('a',4),('b',1),('b',2),
 ('b',3),('b',4),('c',1),('c',2),('c',3),('c',4)]

Union: given two relations of the same type with m and n rows respectively, combine them by removing duplicates, resulting in a new relation of the *same* type with no more than (m + n) rows.

$ union [1..10] [5..15]
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Difference: given two relations of the same type, remove all elements found in the second relation from the first relation.

$ difference [1..10] [5..15]
[1,2,3,4]

Here are the Haskell implementations (note that these are by no means as efficient as possible):
project :: Eq b => (a -> b) -> [a] -> [b]
project f = nub . map f

rfilter :: (a -> Bool) -> [a] -> [a]
rfilter = filter
    
rproduct :: [a] -> [b] -> [(a, b)]
rproduct = liftM2 (,)

intersect :: Eq a => [a] -> [a] -> [a]
intersect xs ys = filter (\x -> x `elem` ys) xs

union :: Eq a => [a] -> [a] -> [a]
union xs ys = nub (xs ++ ys)

difference :: Eq a => [a] -> [a] -> [a]
difference r1 r2 = filter (\x -> not $ elem x r2) r1

Extending the library with some useful SQL operators

Inner, outer and left (outer) joins: these combine a cartesian product operation with a filtering operation. Outer joins are augmented inner joins, in that there is one additional result row for each unmatched row on one or both sides. Requires a default value of each row type to combine with the unmatched rows -- we're avoiding 'NULL's (although we could use Just/Nothing instead).

predicate l r = fst l == fst r
left = [(1, "Matt"), (2, "Jackie"), (3, "Gilligan")]
right = [(1, "hammer"), (1, "saw"), (1, "screwdriver"), (3, "boat"), (4, "wrench")]

$ innerJoin predicate left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((3,"Gilligan"), (3,"boat"))]

$ leftJoin predicate (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),    <--- an extra row!!
 ((3,"Gilligan"), (3,"boat"))]

$ outerJoin predicate (0, "nobody") (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),
 ((3,"Gilligan"), (3,"boat")),
 ((0,"nobody"),   (4,"wrench"))]    <--- another extra row!!

Grouping, group processing, and aggregation: it's often useful to separate a relation into groups based on values of certain attribute(s), and then to continue processing with the grouped data.

-- group some words by their length
$ let g1 = groupBy length $ words "this is an article for my blog that I hope is interesting"
[(1,["I"]),       (2,["is","my","an","is"]),
 (3,["for"]),     (4,["hope","that","blog","this"]),
 (7,["article"]), (11,["interesting"])]

-- the first letters of words, in each group
$ groupLift (project head) g1
[(1,"I"), (2,"ima"),
 (3,"f"), (4,"htb"),
 (7,"a"), (11,"i")]

-- the number of words, in each group
$ groupLift length g1
[(1,1), (2,4),
 (3,1), (4,4),
 (7,1), (11,1)]

-- the unique letters, in each group
$ groupLift (nub . concat) g1
[(1,"I"),       (2,"ismyan"),
 (3,"for"),     (4,"hopetablgis"),
 (7,"article"), (11,"intersg")]

-- aggregation ignoring groups
$ aggregate (length . snd) sum g1
12

-- aggregation within groups
$ groupLift (aggregate (ord . head) sum) g1
[(1,73),  (2,416),
 (3,102), (4,434),
 (7,97),  (11,105)]

And the implementations:
innerJoin :: (a -> b -> Bool) -> [a] -> [b] -> [(a, b)]
innerJoin f ls rs = rfilter (uncurry f) (rproduct ls rs)

leftJoin :: forall a b. (a -> b -> Bool) -> b -> [a] -> [b] -> [(a, b)]
leftJoin p null rl rr = concatMap f rl
  where 
    -- go through all the a's 
    --   match each a with all b's
    --   if no matches, match it with the default
    --   otherwise keep all matches
    f :: a -> [(a, b)]
    f a = map ((,) a) $ addNull $ filter (p a) rr
      where
        addNull :: [b] -> [b]
        addNull [] = [null]
        addNull bs = bs

outerJoin :: (Eq a, Eq b) => (a -> b -> Bool) -> a -> b -> [a] -> [b] -> [(a, b)]
outerJoin p anull bnull as bs = union left right
  where 
    left = leftJoin p bnull as bs
    right = project swap $ leftJoin (flip p) anull bs as

groupBy :: (Ord b) => (a -> b) -> [a] -> [(b, [a])]
groupBy f rel = toList grouped
  where
    grouped = foldl f' (fromList []) rel
    f' mp next = addRow (f next) next mp
    -- check whether the key's already in the map:
    -- if it is, stick 'next' on the existing list
    -- if not, create a new, single-element list for that key
    addRow :: Ord k => k -> v -> Map k [v] -> Map k [v]
    addRow k v mp = case lookup k mp of    
                        (Just oldval) -> insert k (v:oldval) mp;  
                        _ -> insert k [v] mp;  

groupLift :: ([a] -> c) -> ([(b, [a])] -> [(b, c)])
groupLift f = map (fmap f)  

aggregate :: (a -> b) -> ([b] -> c) -> [a] -> c
aggregate proj f = f . map proj

Some notes about the design goals of the library

was shooting for flexibility, simplicity, and minimality, not efficiency. thus, it lacks many 'composite operators' that SQL has, for instance: in a SQL join, you both join and project all in one operation. in this library, instead, you'd do the join, then the projection separately ... maybe less efficient, but more expressive and compositional. another example, is that it lacks things like equi-joins that would make joining more efficient; it doesn't have them b/c that's covered by inner joins.

Tuesday, October 9, 2012

The limitations of typical relational algebra and SQL

A brief review of SQL and RA

Many programmers are familiar with some flavor of Structured Query Language (SQL), used for querying and administrating many common database products, and relational algebra (RA), which is the mathematical underpinnings of SQL and relational databases as we currently know them. Relational algebra deals with relations comprised of tuples, and operations for transforming and combining them. Here is an example of a relation:
  name  |  weight
 -----------------
  Mary  |   110
  Kyle  |   165
  Kevin |   205
  Lucy  |   120
In this relation, there are two columns. Each of the four rows has a value for each column; the rows are tuples. Furthermore, RA defines six basic operators: projection, rename, product, selection, union, and difference. From these, many more complicated and more useful operators can be derived. SQL is a query language heavily based on RA that makes it very convenient to work with relationally structured data. However, both RA and SQL have a number of shortcomings which make some things difficult and obtuse to accomplish. I'll spend the rest of this post discussing some of these things.

Unable to step outside of SQL/RA

SQL and RA are great for working purely with relations, but are otherwise severely limited. For example, what if you need to generate a list of numbers from 1 to 10 for a filter predicate -- it's possible but it's not easy. Or what if you need to create a datatype with an API that disallows certain values? That's going to be nearly impossible. Or if you have to rank all tuples in a relation -- the standard, pure SQL solution involves an O(n**2) self-join solution, which is absurd considering it's at worst an n * log(n) problem. Unfortunately for SQL/RA, such situations -- in which they are not sufficient or particularly well-suited to the task -- are very often encountered, and the resulting solutions are necessarily work-arounds, which means they're harder to debug, understand, and maintain.

Tuples have to be flat

The relational model of data views tuples as being constructed out of n values of primitive types, where n >= 1, and the primitive types are typically numbers, booleans, strings, and times. This model is important for data storage, as its flat nature allows any data to be immediately obtained without any 'digging' through hierarchical layers. However, when creating complex queries, strict adherence to flat tuples is more painful than helpful. For example, a 'group by' query returns one row per group, each row typically including the grouped-on column(s) and aggregate values calculated for some other column(s). However, this is far less useful than allowing a list of all the rows in that group, as is clear when one tries to get the 'top n' rows from each group:

  student | score
 ------------------
   Jake   |  84
  Blake   |  93
   Jake   |  62
   Jake   |  79
  Blake   |  93
  Blake   |  84

-- goal: calculate the average of the top two scores for each student

-- first, let's group by 'student'
  student | scores
 ------------------
   Jake   | [84, 62, 79]
  Blake   | [93, 93, 84]

-- now take the top two
  student | top 2 scores
 ------------------
   Jake   | [84, 79]
  Blake   | [93, 93]

-- and average them
  student | scores
 ------------------
   Jake   |   81.5
  Blake   |    93   

This is a straightforward problem to solve, if richer tuple types are allowed, but as they are not, the solutions become convoluted and obtuse. Again, I'd just like to emphasize that the flat model is great for data storage, but unnecessarily restricting when it comes to building queries.

User-defined abstractions are limited

If you've read SICP, you may remember the authors stating that the three main aspects of programming are primitives, means of abstraction, and means of composition. SQL and RA certainly have primitives, and composition isn't impossible, but they both fall flat on abstraction. Users are unable to create single-valued functions, aggregate functions, table-valued functions, or to encapsulate a table-manipulation as a single procedure. (For example, most if not all database products lack a 'divide by' operator; this is a complex, multi-step procedure, and very difficult to get right. It can't be implemented as a function, and thus must be painstakingly reimplemented time and again ... if it were available as a simple library, imagine how easy it would be to use.) The lack of easily user-defined string-processing functions is often bothersome, and results in gigantic and unwieldy built-in libraries of obtuse string processing functions. If the user could define, and share, them such built-in libraries would be unnecessary. The basic -- and only?-- unit of abstraction is a view, but views are not parameterizable, and thus can't be reused to perform a similar operation on multiple tables. SQL/RA also lack common organization features found in modern programming languages, such as modules, classes, namespaces ...

SQL and RA are typically first order

I've seen this mentioned in a few places, and the poster child example is that you can't calculate transitive closure using pure SQL/RA. I've also read the lack of recursion is limiting, and that relations aren't first class values. As I don't completely understand these points, I'll wait until later to flesh out this section.

Conclusion

To be fair, many of these problems have already been addressed by widely used database products. The solutions typically involve embedding SQL in some kind of general purpose procedural or imperative language, which is then harnessed to create useful abstractions and extend SQL's functionality. I think this is a good approach and will continue to become more popular as the benefits of tight integration between non-procedural SQL-like code and procedural code are better understood. SQL and RA are awesome tools for working with structured data and sets. But by themselves, they're not anywhere near as powerful as they can be when embedded in a general-purpose programming language.