Wednesday, March 14, 2012

Recipes for `git` branches

Recipes for working with branches in `git`

If you've ever used `git`, and had to create and manage branches, you may have been impressed by how flexible and accomodating it is, but discouraged by how hard it can be to remember all the incantations. That's why I've created a list of recipes for basic branching and merging operations. Enjoy!

These recipes were distilled by me based on information found in an awesome book by Scott Chacon.

# create a branch
git branch [new-branch] [base-branch]

# switch to an existing branch
git checkout [branch]

# push branch to remote
git push [remote] [branch]

# pull changes from remote
#   pulls down remote branches,
#   but doesn't set up tracking branches
git fetch [remote]

# create and switch to a new branch
git checkout -b [branch]

# create a new branch from a specific commit (starting point)
git branch [branch] [sha-1 hash]

# create new local branch from remote branch
git checkout -b [localbranch] [remote]/[remotebranch]
# or
git checkout --track [remote]/[branch]
#   see http://git-scm.com/book/en/Git-Branching-Remote-Branches#Tracking-Branches
#   for more information about tracking branches


# merge branch into master
#   have to first switch to master
git checkout master
git merge [branch]

# check that the merge worked
#   shows an "unmerged" section if any files failed
git status

# delete a branch (after merging it)
git branch -d newbranch

# see last commit on each branch
git branch -v

# see merged branches (to current branch)
git branch --merged

# see unmerged branches (to current branch)
git branch --no-merged

# delete remote branch
#   the syntax is widely acknowledged to be obtuse
git push origin :newbranch

Sunday, March 11, 2012

Analyzing data using pivot tables

Introduction

Pivot tables are a useful tool in data analysis, allowing a user to quickly and easily understand the relationship between two attributes. However, they aren't straightforward to create in many database systems. In this article, we'll look at:

  1. What is a pivot table?
  2. Why would I need a pivot table?
  3. How do I create a pivot table?
To answer these questions, we'll create and use pivot tables with MySQL.



Background

1. What is a pivot table? It's a summary table used for reporting, analysis, and quick data inspection purposes. The name comes from the fact that the columns in the pivot table were row values in the base table; these values are pivoted into columns headers.
Terminology used in this article:

  • base table: the table of interest that we'll analyze
  • columns of interest: the two columns from the base table, whose relationship is shown through the pivot table
  • y-values: values from the first column of interest. The pivot table will have one row for each distinct y-value
  • x-values: values from the second column of interest. The pivot table will typically have one column for each distinct x-value

2. Why would I need a pivot table? Pivot tables provide summaries of data that are much easier to grok than the full table. They typically focus on two columns, providing aggregate data which can help indicate the relationships between them.

3. How do I create a pivot table? There is a straight-forward procedure for creating pivot tables: (indicate no. of rows, columns after each step)

  1. select the two columns of interest. Call them x and y: the y-values become values in the first column, while the x-values become column headers. The pivot table ends up with 1 y column, plus 1 column for each x-value
  2. create a view/query that adds extra columns to each row of the base table, indicating the type of the row
  3. aggregate the extended table using appropriate aggregate functions, grouping by the y-value column

2. and 3. will be covered in more detail in the remainder of this article.



Example: pivoting investment data

In this example, we'll look at data showing shares purchased during a number of investments made over three years (note: these are hypothetical companies -- if the names bear any resemblance to real companies' names, it is purely coincidental :). We'll start with the table schema and some sample data:

create table investment (
  id       int primary key auto_increment, 
  year     int, 
  person   varchar(50), 
  stock    varchar(50), 
  shares   int
);

mysql> select * from investment;
+----+------+---------+--------+--------+
| id | year | person  | stock  | shares |
+----+------+---------+--------+--------+
|  1 | 2009 | Matt    | Google |     27 |
|  2 | 2009 | Jeffrey | Google |     13 |
|  3 | 2010 | Timothy | Yahoo  |     31 |
|  4 | 2009 | Matt    | Yahoo  |      4 |
|  5 | 2011 | Timothy | IBM    |    100 |
|  6 | 2011 | Timothy | Google |     34 |
|  7 | 2010 | Jeffrey | Yahoo  |     68 |
|  8 | 2010 | Jeffrey | Yahoo  |     18 |
|  9 | 2011 | Matt    | IBM    |     49 |
| 10 | 2011 | Matt    | Google |     22 |
| 11 | 2011 | Timothy | Yahoo  |     51 |
| 12 | 2009 | Jeffrey | Yahoo  |     63 |
+----+------+---------+--------+--------+

Requirements -- our final goal is queries showing:
  1. the number of shares of each company bought by each investor
  2. the number of times each investor has bought stock in each company
We'll do this by creating two separate, but related, pivot tables.


Step 1: choose columns of interest

According to the requirements, we need to see investor vs. stock. Thus, those will be our two columns. I'll choose `person` as the y-value column, and `stock` as the x-value column. (For this example, the choice is arbitrary, but see the "Limitations" section for a reason to choose more carefully).


Step 2: extend

Extend the base table with extra columns, one for each x-value of interest. Put an appropriate value in each of the columns, or a neutral value if the column doesn't apply:
create view investment_extended as (
  select
    investment.*, 
    case when stock = "Google"  then shares end as Google,
    case when stock = "IBM"     then shares end as IBM,
    case when stock = "Yahoo"   then shares end as Yahoo 
  from investment
);

mysql> select * from investment_extended;
+----+------+---------+--------+--------+--------+------+-------+
| id | year | person  | stock  | shares | Google | IBM  | Yahoo |
+----+------+---------+--------+--------+--------+------+-------+
|  1 | 2009 | Matt    | Google |     27 |     27 | NULL |  NULL |
|  2 | 2009 | Jeffrey | Google |     13 |     13 | NULL |  NULL |
|  3 | 2010 | Timothy | Yahoo  |     31 |   NULL | NULL |    31 |
|  4 | 2009 | Matt    | Yahoo  |      4 |   NULL | NULL |     4 |
|  5 | 2011 | Timothy | IBM    |    100 |   NULL |  100 |  NULL |
|  6 | 2011 | Timothy | Google |     34 |     34 | NULL |  NULL |
|  7 | 2010 | Jeffrey | Yahoo  |     68 |   NULL | NULL |    68 |
|  8 | 2010 | Jeffrey | Yahoo  |     18 |   NULL | NULL |    18 |
|  9 | 2011 | Matt    | IBM    |     49 |   NULL |   49 |  NULL |
| 10 | 2011 | Matt    | Google |     22 |     22 | NULL |  NULL |
| 11 | 2011 | Timothy | Yahoo  |     51 |   NULL | NULL |    51 |
| 12 | 2009 | Jeffrey | Yahoo  |     63 |   NULL | NULL |    63 |
+----+------+---------+--------+--------+--------+------+-------+
Note that:
  1. I'm using NULL as the neutral value
  2. whether a new column has a non-null value depends on its x-value: rows with `stock = "Google"` have a value for new column `Google` and `NULL`s for the other new columns
  3. this distinction "marks" each row with its type -- and is the key to getting Step 3 to work
  4. whenever a new column is non-null: its value is always equal to column `shares` -- so we can meet the requirements
  5. we didn't change the number of rows
  6. we added one column per x-value of interest


Step 3: group by y-values and apply aggregate function

Our first requirement was to total shares bought per investor at each company. To accomplish that, we start with the view from Step 2, and remember that column `person` provides the y-values, so we'll need to `group by person`; also, we need the `sum` aggregate function to total the rows.

create view pivot_company_shares as (
  select 
    person, 
    sum(Google)            as Google, 
    coalesce(sum(IBM), 0)  as IBM, 
    sum(Yahoo)             as Yahoo 
  from investment_extended 
  group by person
);

mysql> select * from pivot_company_shares;
+---------+--------+------+-------+
| person  | Google | IBM  | Yahoo |
+---------+--------+------+-------+
| Jeffrey |     13 |    0 |   149 |
| Matt    |     49 |   49 |     4 |
| Timothy |     34 |  100 |    82 |
+---------+--------+------+-------+
Look at that -- how beautiful! Note that in `pivot_company_shares`, we use `coalesce` to check the IBM column for null, returning 0 if it is and the column value otherwise. This is because one of the investors didn't buy any IBM, and using `sum` on a bunch of null values returns null! We could have just left it as null, but the 0 looks a lot nicer.
To meet the second requirement, which is how many times each investor bought each stock, we just need to know how many rows from the extended view have non-null values. An easy way to do that is with the aggregate function `count(n)` -- it returns the number of rows with a non-null value for column `n`.
create view pivot_company_transactions as (
  select 
    person, 
    count(Google)  as Google, 
    count(IBM)     as IBM, 
    count(Yahoo)   as Yahoo 
  from investment_extended 
  group by person
);

mysql> select * from pivot_company_transactions;
+---------+--------+-----+-------+
| person  | Google | IBM | Yahoo |
+---------+--------+-----+-------+
| Jeffrey |      1 |   0 |     3 |
| Matt    |      2 |   1 |     1 |
| Timothy |      1 |   1 |     2 |
+---------+--------+-----+-------+

And ... voila! We didn't need to check for NULL this time because count returns 0 if it only find NULLs.


Important points

  • what aggregate function to use when grouping. I used sum, but count and max are also often used (max is often used when building one-row "objects" that had been spread across many rows)
  • what value to use in the extra columns -- the requirements dictated `shares` in this case
  • what "neutral" value to use in the extra columns. I used NULL, but it could also be 0 or ""
  • using multiple columns for y-values. This solution isn't limited to using a single column for the y-values -- just plug the extra columns into the group by clause (and don't forget to select them)

Conclusion and limitations

Pivot tables are quite useful and relatively straightforward. However, one downside to the approach I've outlined is that you have to know how many columns the pivot table will have in advance. The frustrations this causes can range from mildly annoying (adds extra typing) to severely limiting, if there are hundreds of values.

The inspiration for this post came from reading the SQL Cookbook by Anthony Molinaro, chapter 12. I highly recommend this book to anyone interested in improving their SQL chops, from beginner to advanced.

Tuesday, March 6, 2012

Breaking down the "Big 4" of tennis

Introduction

As many tennis fans may already know, the ATP tour is currently experiencing an unprecedented era of dominance from its top players. They -- Novak Djokovic, Rafael Nadal, Roger Federer, and Andy Murray -- have collectively occupied the top four spots for nearly all of the last 3 1/2 years, thus earning them the appellation "the Big 4". In fact, since the beginning of 2009, only two other players (Juan Martin del Potro and Robin Soderling) have managed to break into the top four, and have spent a total of 17 weeks there!

There are many ways to compare tennis players, both by on-court performances:

  • won/loss record
  • first serve percentage
  • winners vs. unforced errors
  • on-court time per match

and off-court characteristics:

  • height
  • weight
  • nationality
  • age

However, in this article, we'll look into some of the stats behind the performances of these players in tournament finals. Why? Because the data is easy available on Wikipedia!

Trends: expected, surprising

If you've paid much attention to tennis for the past few years, you'll probably know that Nadal has been one of the best ever on clay, while Federer has been equally as good on grass. You'll probably also know that Djokovic had one of the greatest seasons ever last year, and that Murray hasn't yet won a grand slam title.

But maybe you don't know how well each of the Big 4 performs indoors vs. outdoors, or how well they perform in various continents. And how about their toughest opponents (outside of the Big 4) or opponent's country? We'll see that and more in the next section!

Data analysis

Let's start by looking at the performances by continent. (Remember that we're only looking at finals). Here's the data:

+----------------+---------------+------+--------+-------+------------+
| player name    | continent     | wins | losses | total | percentage |
+----------------+---------------+------+--------+-------+------------+
| Andy Murray    | Oceania       |    1 |      2 |     3 | 33         |
| Andy Murray    | Middle East   |    2 |      2 |     4 | 50         |
| Andy Murray    | Asia          |    4 |      1 |     5 | 80         |
| Andy Murray    | Europe        |    8 |      1 |     9 | 89         |
| Andy Murray    | North America |    7 |      4 |    11 | 64         |
| Novak Djokovic | Middle East   |    3 |      0 |     3 | 100        |
| Novak Djokovic | Asia          |    3 |      1 |     4 | 75         |
| Novak Djokovic | Oceania       |    4 |      0 |     4 | 100        |
| Novak Djokovic | North America |    7 |      7 |    14 | 50         |
| Novak Djokovic | Europe        |   12 |      6 |    18 | 67         |
| Rafael Nadal   | South America |    1 |      0 |     1 | 100        |
| Rafael Nadal   | Middle East   |    1 |      1 |     2 | 50         |
| Rafael Nadal   | Oceania       |    1 |      2 |     3 | 33         |
| Rafael Nadal   | Asia          |    3 |      3 |     6 | 50         |
| Rafael Nadal   | North America |    6 |      5 |    11 | 55         |
| Rafael Nadal   | Europe        |   34 |     10 |    44 | 77         |
| Roger Federer  | Oceania       |    5 |      1 |     6 | 83         |
| Roger Federer  | Asia          |    5 |      2 |     7 | 71         |
| Roger Federer  | Middle East   |    8 |      2 |    10 | 80         |
| Roger Federer  | North America |   18 |      4 |    22 | 82         |
| Roger Federer  | Europe        |   36 |     21 |    57 | 63         |
+----------------+---------------+------+--------+-------+------------+

Here are some of the interesting things that I noticed:

  • while Federer and Nadal have played far more finals in Europe than in any other continent, Djokovic has hardly played more in Europe than in North America, and Murray has actually played more in North America. Part of this could be due to Murray's historic difficulties on clay (as we'll see later) -- he's never played in a clay-court final, and Europe's lineup is quite heavy in clay-court tournaments.
  • Nadal has played in 11 North American finals and won 6 of those -- perhaps disproving the notion that he is a clay-court specialist. In fact, he's won the US Open, Indian Wells twice, and the Canadian Masters twice (once each in Montreal in Toronto) -- all of which are hard court tournaments.
  • Federer's win percentage in North America is much higher than in Europe: 82% vs 63%. A big part of that difference can be explained by the fact that he's 5-11 against Nadal in European finals.
  • Nadal is the only one of the Big 4 to contest a final in South America.
  • both Federer and Nadal have contested more European finals (57 and 44) than the total number of finals contested by Djokovic or Murray (43 and 32).

Next up, let's look at performance against court surface. Again, here's the data:

+----------------+---------+------+--------+-------+------------+
| player name    | surface | wins | losses | total | percentage |
+----------------+---------+------+--------+-------+------------+
| Andy Murray    | carpet  |    1 |      0 |     1 | 100        |
| Andy Murray    | grass   |    2 |      0 |     2 | 100        |
| Andy Murray    | hard    |   19 |     10 |    29 | 66         |
| Novak Djokovic | clay    |    7 |      3 |    10 | 70         |
| Novak Djokovic | grass   |    1 |      2 |     3 | 33         |
| Novak Djokovic | hard    |   21 |      9 |    30 | 70         |
| Rafael Nadal   | clay    |   32 |      4 |    36 | 89         |
| Rafael Nadal   | grass   |    3 |      3 |     6 | 50         |
| Rafael Nadal   | hard    |   11 |     14 |    25 | 44         |
| Roger Federer  | carpet  |    2 |      5 |     7 | 29         |
| Roger Federer  | clay    |    9 |     12 |    21 | 43         |
| Roger Federer  | grass   |   11 |      2 |    13 | 85         |
| Roger Federer  | hard    |   50 |     11 |    61 | 82         |
+----------------+---------+------+--------+-------+------------+

No surprises there at first glance: Federer's great on grass, and not so much on clay; Nadal's fantastic on clay but not so much on grass. What else does the data have to say?

  • Is Federer really that bad on clay? Well, he's 2-10 against Nadal in clay-court finals, meaning he's 7-2 against other opponents. Not bad at all.
  • Murray's production has been nearly exclusively on hard courts -- no clay-court finals, and just three between carpet and grass. Surprising considering how strong of an all-around player he is.
  • The difference between Nadal's clay-court and hard-court prowess is astounding: 89% finals won vs just 44%. Nevertheless, he's still won 11 hard-court titles, many of them at big tournaments.
  • Carpet-courts seem to have gone out of style. Only Federer, the oldest of the Big 4, has played more than one carpet final (and none recently).

Next up, we'll see how results differ based on event type. We'll start with the data:

+----------------+---------------+------+--------+-------+------------+
| player name    | event type    | wins | losses | total | percentage |
+----------------+---------------+------+--------+-------+------------+
| Andy Murray    | 250           |   11 |      5 |    16 | 69         |
| Andy Murray    | 500           |    3 |      1 |     4 | 75         |
| Andy Murray    | Grand Slam    |    0 |      3 |     3 | 0          |
| Andy Murray    | Masters       |    8 |      1 |     9 | 89         |
| Novak Djokovic | 250           |    6 |      4 |    10 | 60         |
| Novak Djokovic | 500           |    7 |      1 |     8 | 88         |
| Novak Djokovic | Championships |    1 |      0 |     1 | 100        |
| Novak Djokovic | Grand Slam    |    5 |      2 |     7 | 71         |
| Novak Djokovic | Masters       |   10 |      7 |    17 | 59         |
| Rafael Nadal   | 250           |    5 |      3 |     8 | 63         |
| Rafael Nadal   | 500           |   11 |      2 |    13 | 85         |
| Rafael Nadal   | Championships |    0 |      1 |     1 | 0          |
| Rafael Nadal   | Grand Slam    |   10 |      5 |    15 | 67         |
| Rafael Nadal   | Masters       |   19 |     10 |    29 | 66         |
| Rafael Nadal   | Olympics      |    1 |      0 |     1 | 100        |
| Roger Federer  | 250           |   20 |      6 |    26 | 77         |
| Roger Federer  | 500           |   12 |      4 |    16 | 75         |
| Roger Federer  | Championships |    6 |      1 |     7 | 86         |
| Roger Federer  | Grand Slam    |   16 |      7 |    23 | 70         |
| Roger Federer  | Masters       |   18 |     12 |    30 | 60         |
+----------------+---------------+------+--------+-------+------------+
  • Murray has been near-perfect in Masters finals. When will he figure out how to transfer that success to Grand Slam finals?
  • Nadal is the only member of the Big 4 to have contested an Olympic final (2008).
  • Federer set a record in 2011 by winning his sixth year-end championships title (6-1 record in that event). Nadal, however, has yet to win -- this is the largest remaining hole on his resume.

I'll throw in a couple of extra data tables as food for thought:

+------+-----------------------+
| year | Grand Slam titles won |
+------+-----------------------+
| 2004 |                     3 |
| 2005 |                     3 |
| 2006 |                     4 |
| 2007 |                     4 |
| 2008 |                     4 |
| 2009 |                     3 |
| 2010 |                     4 |
| 2011 |                     4 |
| 2012 |                     1 |
+------+-----------------------+

+------+--------------------+
| year | Masters titles won |
+------+--------------------+
| 2004 |                  3 |
| 2005 |                  8 |
| 2006 |                  6 |
| 2007 |                  7 |
| 2008 |                  7 |
| 2009 |                  8 |
| 2010 |                  6 |
| 2011 |                  9 |
+------+--------------------+

The utter dominance of the Big 4 at Masters and Grand Slam events (there are 9 Masters tournaments per year, and 4 Grand Slams -- as of this writing, only 1 has been played so far this year).

+----------------+------+
| player name    | rank |
+----------------+------+
| Andy Murray    |  109 |
| Novak Djokovic |   36 |
| Rafael Nadal   |   48 |
| Roger Federer  |   67 |
+----------------+------+

Each player's ranking when he contested his first final.

Conclusion

With 5 big Masters tournaments coming up in the next 2 1/2 months, and two Grand Slams soon after, it will be interesting to see if the Big 4 can continue their domination. How long can they keep it up?