Wednesday, October 10, 2012

A Haskell library for relational algebra

Why Haskell? Why relational algebra?

If you've never heard of Haskell before, and you like programming languages, you might want to check it out. It's a functional, statically-typed, type-inferred, elegant language in the ML family. The most important benefit that I've received from working with Haskell is a much better understanding of the static typing discipline, and the practice of designing a program or a library by figuring out a few concepts, capturing them as types -- either data or functions -- and building the rest of the code around them.

Relational algebra (RA) is a powerful mathematical tool for working with sets and functions on sets. Most common database products implement some flavor of SQL, a practical and standard language which contains many constructs from RA. Using SQL, a programmer can easily accomplish many complex data querying, manipulation, and transformation operations.

Unfortunately, applying RA often necessitates the use of a database. For whatever reason, there seem to be very few 'pure' RA libraries available for common languages. However, there are enormous benefits to be gained from integrating RA, instead of using a database: 1) not coupled to a database product, or its failure modes; 2) one less dependency for a deployed program; 3) code can mix RA with general purpose code, enhancing the effectiveness and efficiency of RA.

This post discusses the basic design and implementation of an RA library in Haskell.

Data model

The basic data types in RA are tuples, fixed-size units of n primitive values, and relations, or unordered sets composed of multiple named n-tuples of the same type. To distinguish RA tuples from Haskell tuples, I'll call them 'rows' for the rest of this post. Here's an example:

(first name: "Matthew", country: "USA", age: 25) <-- a named 3-tuple

a relation:

 first name | country | age   <-- the schema
-----------------------------
 "Matthew"  |  "USA"  |  25   <-- a tuple
  "Jimbo"   | "Spain" |  32   <-- another tuple of the same type
 "Jessica"  |  "USA"  |  27   <-- a third compatible tuple

However, the Haskell data model is a little bit different. Instead of restricting our row representation to just Haskell n-tuples, we'll let *any type* be a row, as long as we can compare any two instances of it for equality and ordering. And instead of using sets, we'll use lists -- Haskell has a large number of functions for working with lists, so it's a lot more convenient to use them than sets (although lists are intrinsically ordered, and do allow duplicates. We can ignore the first problem by never assuming the ordering is meaningful, but the second is more dangerous potentially). So what we have is:

-- the type of a relation:
:: (Eq a, Ord a) => [a]

-- and some examples of relations:
r1 :: Num a => [a]
r1 = [1,2,3]

r2 :: Num a => [(String, a)]
r2 = [("Matt", 1), ("Kevin", 14)]

r3 :: [a]
r3 = [] 

r4 :: [[String]]
r4 = [["abc", "xyz"], ["ghi"], []]
Note how the types of the rows can be almost anything -- numbers, Haskell tuples, polymorphic, or even lists.

Primitive RA operators

Rename: change attribute names, without changing any values. This operator is meaningless in the Haskell version, since it's not restricted to named tuples.

(name: "Matt", total: 32) -> (first name: "Matt", sum: 32)

Projection: apply a function to each tuple in a relation. Unfortunately, this could result in duplicates for either of two reasons: 1) the input contained duplicates, or 2) the mapping function created duplicates. The 2nd case has to be dealt with.

$ project (length . fst) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[4,3,5]

Row selection: select some rows, discarding the rest; none of the rows are changed.

$ rfilter (\(name, age) -> age >= 30) [("Matt", 30), ("Bob", 22), ("Jimbo", 39), ("Sarah", 28)]
[("Matt",30),("Jimbo",39)]

Cartesian product: combine two relations of size m and n rows respectively, resulting in a relation of size (m * n) rows, where an output row consists of a row from each of the input tables glued together. SQL and RA typically restrict result sets to *flat* tuples; since we've already gotten rid of this restriction, we're free to allow this to produce nested tuples.

$ rproduct "abc" [1..4]
[('a',1),('a',2),('a',3),('a',4),('b',1),('b',2),
 ('b',3),('b',4),('c',1),('c',2),('c',3),('c',4)]

Union: given two relations of the same type with m and n rows respectively, combine them by removing duplicates, resulting in a new relation of the *same* type with no more than (m + n) rows.

$ union [1..10] [5..15]
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

Difference: given two relations of the same type, remove all elements found in the second relation from the first relation.

$ difference [1..10] [5..15]
[1,2,3,4]

Here are the Haskell implementations (note that these are by no means as efficient as possible):
project :: Eq b => (a -> b) -> [a] -> [b]
project f = nub . map f

rfilter :: (a -> Bool) -> [a] -> [a]
rfilter = filter
    
rproduct :: [a] -> [b] -> [(a, b)]
rproduct = liftM2 (,)

intersect :: Eq a => [a] -> [a] -> [a]
intersect xs ys = filter (\x -> x `elem` ys) xs

union :: Eq a => [a] -> [a] -> [a]
union xs ys = nub (xs ++ ys)

difference :: Eq a => [a] -> [a] -> [a]
difference r1 r2 = filter (\x -> not $ elem x r2) r1

Extending the library with some useful SQL operators

Inner, outer and left (outer) joins: these combine a cartesian product operation with a filtering operation. Outer joins are augmented inner joins, in that there is one additional result row for each unmatched row on one or both sides. Requires a default value of each row type to combine with the unmatched rows -- we're avoiding 'NULL's (although we could use Just/Nothing instead).

predicate l r = fst l == fst r
left = [(1, "Matt"), (2, "Jackie"), (3, "Gilligan")]
right = [(1, "hammer"), (1, "saw"), (1, "screwdriver"), (3, "boat"), (4, "wrench")]

$ innerJoin predicate left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((3,"Gilligan"), (3,"boat"))]

$ leftJoin predicate (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),    <--- an extra row!!
 ((3,"Gilligan"), (3,"boat"))]

$ outerJoin predicate (0, "nobody") (0, "nothing") left right
[((1,"Matt"),     (1,"hammer")),
 ((1,"Matt"),     (1,"saw")),
 ((1,"Matt"),     (1,"screwdriver")),
 ((2,"Jackie"),   (0,"nothing")),
 ((3,"Gilligan"), (3,"boat")),
 ((0,"nobody"),   (4,"wrench"))]    <--- another extra row!!

Grouping, group processing, and aggregation: it's often useful to separate a relation into groups based on values of certain attribute(s), and then to continue processing with the grouped data.

-- group some words by their length
$ let g1 = groupBy length $ words "this is an article for my blog that I hope is interesting"
[(1,["I"]),       (2,["is","my","an","is"]),
 (3,["for"]),     (4,["hope","that","blog","this"]),
 (7,["article"]), (11,["interesting"])]

-- the first letters of words, in each group
$ groupLift (project head) g1
[(1,"I"), (2,"ima"),
 (3,"f"), (4,"htb"),
 (7,"a"), (11,"i")]

-- the number of words, in each group
$ groupLift length g1
[(1,1), (2,4),
 (3,1), (4,4),
 (7,1), (11,1)]

-- the unique letters, in each group
$ groupLift (nub . concat) g1
[(1,"I"),       (2,"ismyan"),
 (3,"for"),     (4,"hopetablgis"),
 (7,"article"), (11,"intersg")]

-- aggregation ignoring groups
$ aggregate (length . snd) sum g1
12

-- aggregation within groups
$ groupLift (aggregate (ord . head) sum) g1
[(1,73),  (2,416),
 (3,102), (4,434),
 (7,97),  (11,105)]

And the implementations:
innerJoin :: (a -> b -> Bool) -> [a] -> [b] -> [(a, b)]
innerJoin f ls rs = rfilter (uncurry f) (rproduct ls rs)

leftJoin :: forall a b. (a -> b -> Bool) -> b -> [a] -> [b] -> [(a, b)]
leftJoin p null rl rr = concatMap f rl
  where 
    -- go through all the a's 
    --   match each a with all b's
    --   if no matches, match it with the default
    --   otherwise keep all matches
    f :: a -> [(a, b)]
    f a = map ((,) a) $ addNull $ filter (p a) rr
      where
        addNull :: [b] -> [b]
        addNull [] = [null]
        addNull bs = bs

outerJoin :: (Eq a, Eq b) => (a -> b -> Bool) -> a -> b -> [a] -> [b] -> [(a, b)]
outerJoin p anull bnull as bs = union left right
  where 
    left = leftJoin p bnull as bs
    right = project swap $ leftJoin (flip p) anull bs as

groupBy :: (Ord b) => (a -> b) -> [a] -> [(b, [a])]
groupBy f rel = toList grouped
  where
    grouped = foldl f' (fromList []) rel
    f' mp next = addRow (f next) next mp
    -- check whether the key's already in the map:
    -- if it is, stick 'next' on the existing list
    -- if not, create a new, single-element list for that key
    addRow :: Ord k => k -> v -> Map k [v] -> Map k [v]
    addRow k v mp = case lookup k mp of    
                        (Just oldval) -> insert k (v:oldval) mp;  
                        _ -> insert k [v] mp;  

groupLift :: ([a] -> c) -> ([(b, [a])] -> [(b, c)])
groupLift f = map (fmap f)  

aggregate :: (a -> b) -> ([b] -> c) -> [a] -> c
aggregate proj f = f . map proj

Some notes about the design goals of the library

was shooting for flexibility, simplicity, and minimality, not efficiency. thus, it lacks many 'composite operators' that SQL has, for instance: in a SQL join, you both join and project all in one operation. in this library, instead, you'd do the join, then the projection separately ... maybe less efficient, but more expressive and compositional. another example, is that it lacks things like equi-joins that would make joining more efficient; it doesn't have them b/c that's covered by inner joins.

No comments:

Post a Comment