Monday, April 30, 2012

Why is Clojure hard to learn?

I've spent a good deal of time in the last few months using Clojure, and my experiences have been pleasant. As a Lisp dialect, it's automically a very interesting programming language. But even better, it's also a JVM language, and benefits from very tight integration with the JVM and with Java code itself.

However, all is not perfect in the Cloverse. Clojure has a very steep learning curve. While for the masters, it can be a powerful, supple, flexible tool, for the rest of us, it's esoteric, foreign, and opaque.

Why is this?

My belief is that this difficulty is, to a large degree, caused by the natural structure and organization of Clojure code, which is very different from the natural structure and organization of projects in mainstream programming languages. I will differentiate between four models of code organization, and show how Clojure's is hardest for the novice (but not for the master).

language type system paradigm list of methods/functions object supports methods/functions are typed interactive access to docs easy to use REPL
Java static object-oriented yes: compile-time (IDE autocompletion) yes yes: IDE feature no
Python dynamic object-oriented yes: run-time (limited IDE support possible) no yes: interactive `help` function yes
Haskell static functional no yes yes (limited) yes
Clojure dynamic function no no yes yes

Analysis/interpretation

What's missing? Note that with both Haskell and Clojure, since they're not object-oriented, it's not possible to easily find all of the functions that can be invoked on an object. Why? In object-oriented languages, the most useful methods are members of a class or object, and can be found "through" the object; whereas functions are not organized as "belonging" to an object; a function accepting an X could appear in any module or file (and indeed, it may make sense to do so). Although note that methods can be part of other classes (think 'util' or 'helper' classes), and indeed, such an organization can be difficult to grok.

This problem is mitigated to a certain extent in Haskell because functions are statically typed, thus, given the type of an object, all relevant functions can be looked up (check out Hoogle for an example).

There's no common REPL for Java (that I know of); I count this as a major negative for Java, because it makes it much harder to interact with code. However, this is more than offset by the fantastic IDEs, such as Eclipse and NetBeans, that have been created to help manage Java code bases. The key features of these programs are interactive access to lists of applicable methods, documentation, imports, automatic refactoring ...

Clojure enjoys neither Haskell's typing, which provides an important modicum of documentation, nor the luxury of specialized IDEs. Thus it's very difficult to find all the functions that an object supports.

How can this be fixed? I don't know, but I think the key lack is that of access to relevant information from within the programming environment, whether it applicable functions of docs. When someone figures out an effective way to implement this, expect Clojure to become very popular.

Summary

The problem that I believe Clojure is facing, and that likely all languages face in their infancy, is how to lower the barrier to entry, and make it easy for newcomers to effectively learn to use the language. An important aspect of this is how the information contained in function, object, and module documentation is accessed, indexed, and searched by the programmer. Clojure is lacking in this area, and therefore presents difficulties to newcomers.

Thursday, April 26, 2012

Relational division

Intuitive meaning

Relational division is one of the core operators in relational algebra. Using it, one can answer questions such as "What are all of the that have/did/associate with ?"

Examples

Example 1: what are all the combinations of (n1, n2) that have values of (1, 1) and (1, 2) for (n3, n4)?
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 |     +----+----+     +----+----+
|  1 |  1 |  2 |  2 |     | n3 | n4 |     | n1 | n2 |
|  1 |  1 |  2 |  3 |  %  +----+----+  =  +----+----+
|  1 |  2 |  1 |  1 |     |  1 |  1 |     |  1 |  1 |
|  1 |  2 |  1 |  2 |     |  1 |  2 |     |  1 |  2 |
|  1 |  2 |  2 |  3 |     +----+----+     +----+----+
|  1 |  2 |  2 |  4 |
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+
Example 2: what are all the combinations of (n1, n2) that have values of (1, 1) for (n3, n4)?
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 |     +----+----+     +----+----+
|  1 |  1 |  2 |  2 |     | n3 | n4 |     | n1 | n2 |
|  1 |  1 |  2 |  3 |  %  +----+----+  =  +----+----+
|  1 |  2 |  1 |  1 |     |  1 |  1 |     |  1 |  1 |
|  1 |  2 |  1 |  2 |     +----+----+     |  1 |  2 |
|  1 |  2 |  2 |  3 |                     |  2 |  1 |
|  1 |  2 |  2 |  4 |                     +----+----+
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+

The code: template

Here's a template for relational division. Each numbered box must be filled according to the exact tables used and columns desired in the result:
select 1 from (
  select 
    2.*  -- don't forget the ".*" !!
  from 2
  inner join 3
  4
) as r
group by 5
having count(*) = (select count(*) from 3);
Explanation of the numbered boxes:
  1. columns in outermost select: set of dividend's columns that ARE NOT in the divisor. Also known as "quotient" columns
  2. dividend table: there should be no duplicates
  3. divisor table: its set of columns must form a PROPER subset of dividend's columns*
  4. join condition: all divisor columns must compare equal to their corresponding dividend columns
  5. group by columns: same as 1
Note that the purpose of the "having" clause is to allow selection only of those groups with number of rows equal to the number of rows in divisor.

How it works: (Example 2 from above)

dividend divisor Step 1: join dividend to divisor on divisor's columns (n3, n4) Step 2: group join results by quotient columns Step 3: select groups that have the right "count", and project onto quotient columns
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  1 |  1 |  3 |
|  1 |  1 |  1 |  4 |
|  1 |  1 |  2 |  1 | 
|  1 |  1 |  2 |  2 | 
|  1 |  1 |  2 |  3 | 
|  1 |  2 |  1 |  1 | 
|  1 |  2 |  1 |  2 | 
|  1 |  2 |  2 |  3 | 
|  1 |  2 |  2 |  4 |
|  1 |  3 |  1 |  3 |
|  1 |  3 |  1 |  4 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
|  2 |  2 |  2 |  1 |
|  2 |  2 |  2 |  2 |
+----+----+----+----+
+----+----+
| n3 | n4 |
+----+----+
|  1 |  1 |
|  1 |  2 |
+----+----+
+----+----+----+----+
| n1 | n2 | n3 | n4 |
+----+----+----+----+
|  1 |  1 |  1 |  1 |
|  1 |  1 |  1 |  2 |
|  1 |  2 |  1 |  1 |
|  1 |  2 |  1 |  2 |
|  2 |  1 |  1 |  1 |
|  2 |  2 |  1 |  2 |
+----+----+----+----+
+----+----+----------+
| n1 | n2 | count(*) |
+----+----+----------+
|  1 |  1 |        2 |
|  1 |  2 |        2 |
|  2 |  1 |        1 |
|  2 |  2 |        1 |
+----+----+----------+
+----+----+
| n1 | n2 |
+----+----+
|  1 |  1 |
|  1 |  2 |
+----+----+

The code: filled-in template

Using the tables shown in the examples above, and "dividend" and "divisor" as the names of the dividend and divisor tables, respectively, we can fill in the template:
select r.n1, r.n2 from (
  select 
    dividend.*
  from dividend
  inner join divisor
  using (n3, n4)
) as r 
group by r.n1, r.n2
having count(*) = (select count(*) from divisor);

Other methods

There is another popular way to do relational division, involving a double-negative ("... where not exists (where not in ...) ..."). Although others find it easier to understand, I do not. Also, one code use a code generator or stored procedure that parameterizes production of a SQL string, automatically filling in the template. I didn't use this approach because that opens the door to a host of SQL injection-related problems.

Wednesday, April 25, 2012

SQL join types

Introduction

There are many kinds of joins in SQL. I often have trouble keeping them straight. This article divides the joins into pairs of opposites, giving brief definitions of each along with their opposites. For the rest of this article, we'll be joining two tables, which we'll call the "left table" and the "right table". We'll be using these tables:
create table rel1 (
  id int primary key, 
  letter varchar(1)
);

create table rel2 (
  id int primary key, 
  fk int, 
    foreign key (fk) references rel1(id),
  mychar varchar(1)
);

select * from rel1;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  3 | c      |
|  4 | d      |
+----+--------+

select * from rel2;
+----+--------+--------+
| id |    fk  | mychar |
+----+--------+--------+
|  1 |      1 | z      |
|  3 |      4 | o      |
|  5 |      2 | m      |
|  8 |      2 | q      |
+----+--------+--------+
`rel1` will be used as the left table, and `rel2` as the right table unless otherwise noted.

Inner vs. outer

Inner joins return only matching pairs of rows between the two tables. Outer joins also return rows from one or both tables that match 0 rows from the other table. (Note that `on rel1.id = rel2.id` is the "join predicate" in these examples)
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

select * 
from rel1 
outer join rel2 
  on rel1.id = rel2.id;
+------+--------+------+------+--------+
| id   | letter | id   | fk   | mychar |
+------+--------+------+------+--------+
|    1 | a      |    1 |    1 | z      |
|    2 | b      | NULL | NULL | NULL   |
|    3 | c      |    3 |    4 | o      |
|    4 | d      | NULL | NULL | NULL   |
| NULL | NULL   |    5 |    2 | m      |
| NULL | NULL   |    8 |    2 | q      |
+------+--------+------+------+--------+

Outer join: left and right outer join

Left outer joins return, in addition to the rows returned by the corresponding inner join, rows from the left table that don't match any rows in the right table. Right outer joins are identical to inner joins except that they also return unmatched rows from the right table.
select * 
from rel1 
left outer join rel2 
  on rel1.id = rel2.id;
+----+--------+------+------+--------+
| id | letter | id   | fk   | mychar |
+----+--------+------+------+--------+
|  1 | a      |    1 |    1 | z      |
|  2 | b      | NULL | NULL | NULL   |
|  3 | c      |    3 |    4 | o      |
|  4 | d      | NULL | NULL | NULL   |
+----+--------+------+------+--------+

select * 
from rel1 
right outer join rel2 
  on rel1.id = rel2.id;
+------+--------+----+------+--------+
| id   | letter | id | fk   | mychar |
+------+--------+----+------+--------+
|    1 | a      |  1 |    1 | z      |
|    3 | c      |  3 |    4 | o      |
| NULL | NULL   |  5 |    2 | m      |
| NULL | NULL   |  8 |    2 | q      |
+------+--------+----+------+--------+
Note that the union of a left join and a right join is a full outer join; their intersection is an inner join.

Semi vs. anti

"Semi"-joins return rows from the left table that match rows in the right table. "Anti"-joins return rows from the left table that don't match rows in the right table. Both of these join forms act as a sort of filter for the left table, since no columns from the right table are returned.
-- all rows in rel1 that have a matching row (by fk value) in rel2:
select * 
from rel1 
where id in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- all rel1 rows that don't have a matching row (by fk value) in rel2:
select * 
from rel1 
where id not in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+
Note that there are multiple ways to implement semi- and anti-joins in SQL:
-- "semi join" implemented using "inner join" and "distinct"
select distinct rel1.* 
from rel1 
inner join rel2 
  on rel1.id = rel2.fk;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- "anti join" implemented using "left join" and "is null"
select distinct rel1.* 
from rel1 
left join rel2 
  on rel1.id = rel2.fk 
where rel2.id is null;
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+

Equi vs. theta

"Equi"-joins are those where the join predicate is an equality. "Theta"-joins are those where the join predicate is an inequality.
-- equi-join
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

-- theta-join
select * 
from rel1 
inner join rel2 
  on rel1.id != rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+
Note that the union of an equi-join and a theta-join is a cartesian product, if the same tables and join predicate are used.

Implicit vs. explicit

source Explicit joins use one of the "join" keywords, such as "inner join" or "left join", combined with a join predicate:
select * from rel1 inner join rel2 on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Implicit joins do not use a "join" keyword, simply listing the tables in the "from" clause, and place the join predicate in the "where" clause:
select * from rel1, rel2 where rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Explicit joins are generally recommended, as they are thought to be more maintainable and more clearly express the programmer's intent. Note that this is purely a syntactic difference.

Cross join

This produces a cartesian product of the tables, joining every row in the left table with every row in the right table.
select * from rel1 inner join rel2 on 1;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  3 | c      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+

Natural join

The result is the set of rows, from the two tables, that have the same values for columns of the same name. If the tables share all columns names, the result is an intersection. If the tables share no column names, the result is a cartesian product. The number of columns in the result is the number of distinct column names in the tables.
select * 
from rel1 
natural join rel2;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select * 
from rel1 
inner join rel2 
  using(id);
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select rel1.*, rel2.fk, rel2.mychar 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+
This join is not useful in practice, as it can be completely replaced by queries using (more easily understandable) inner joins. Additionally, it may produce non-intuitive results in that tables are not necessarily joined on foreign keys, which to my mind is the actual "natural" way to join tables. (Recall that rel1.id <=> rel2.fk was the foreign key relationship we defined).

Self join

This is just a special case of a join where the same table is used for both sides of the join. It can be combined with any of the preceding "join" forms.
select * 
from rel2 a 
left join rel2 b 
  on a.id < b.fk;
+----+------+--------+------+------+--------+
| id | fk   | mychar | id   | fk   | mychar |
+----+------+--------+------+------+--------+
|  1 |    1 | z      |    3 |    4 | o      |
|  1 |    1 | z      |    5 |    2 | m      |
|  1 |    1 | z      |    8 |    2 | q      |
|  3 |    4 | o      |    3 |    4 | o      |
|  5 |    2 | m      | NULL | NULL | NULL   |
|  8 |    2 | q      | NULL | NULL | NULL   |
+----+------+--------+------+------+--------+
Note the table aliases, used to distinguish between the two uses of the same table.

Summary

A convenient way to understand joins is in pairs, as I showed:
join typeoppositemajor point of difference
innerouterdo/don't keep non-matching rows
left outerright outeraugment inner join with unmatched rows from left/right table
semiantifilter rows from a single table, keeping/dropping those that match rows in another table
equithetajoin predicate is equality/inequality
implicitexplicitsyntax
Note that cross, natural, and self joins don't have opposites in this sense.

Sources

article about joins
Wikipedia's "join" article

Monday, April 23, 2012

MySQL: investigating performance with large tables

MySQL: performance of a large table

This article examines how well MySQL deals with a large (50 million row) table, covering indexes, inserts, selects, deletes, and updates.

The set-up

MySQL version5.5.11
operating systemMac OS X 10.5.8
processor2 x 2.8 GHz Quad-Core Intel Xeon
RAM6 GB 800 MHz DDR2
hard drive300 GB, about 75% free
programming languageClojure
MySQL driverJDBC
number of rows50 million
approximate average row size115 bytes
Here's the schema:
create table mediumtable (
 id                 int primary key auto_increment,
 decimalfield1      decimal(10,2),
 indexedstringfield varchar(10),
   index isf (indexedstringfield(4)),
 indexfield1        int,
 indexfield2        int,
   index if1if2 (indexfield1, indexfield2)
);


create table largetable (
 id             int primary key auto_increment,
 intfield1      int,
 indexedintfield int,
   index iif (indexedintfield),
 stringfield    varchar(50),
 foreignkeyfield int,
   foreign key (foreignkeyfield) references mediumtable (id),
 indexfield1    int,
 indexfield2    int,
 indexfield3    int,
   index if1if2if3 (indexfield1, indexfield2, indexfield3),
 notnullint     int not null,
 notnullstring  varchar(20) not null
);

Building the database

I used Clojure to generate and insert 50 million rows for "largetable" and 50000 rows for "mediumtable". For each column, a random string or integer was generated depending on the type. Here's the code that was used:
(defn make-random-record
  []
  (let [f1 (rand-int 10000000)
        f2 (rand-int 1000)
        f3 (get-rand-string 40)
        f4 (+ 1 (rand-int 50005)) ;; because the keys are 1 to 50005 
        f5 (rand-int 100000)
        f6 (rand-int 100000)
        f7 (rand-int 100000)
        f8 (rand-int 100000000)
        f9 (get-rand-string 15)]
    {:intfield1       f1 
     :indexedintfield f2
     :stringfield     f3
     :foreignkeyfield f4
     :indexfield1     f5
     :indexfield2     f6
     :indexfield3     f7
     :notnullint      f8
     :notnullstring   f9}))

(defn insert-random-record
  """inserts n randomly generated records into 'largetable'"""
  [n]
  (with-connection db
   (dotimes [_ n]
    (insert-records "largetable" (make-random-record)))))
Row insertion, done with `(time (insert-random-record 50000000))`, took 33.5 hours.

Index statistics

mysql> select * from information_schema.statistics where table_schema = 'mysqlclojure';
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME  | NON_UNIQUE | INDEX_SCHEMA | INDEX_NAME      | SEQ_IN_INDEX | COLUMN_NAME        | COLLATION | CARDINALITY | SUB_PART | PACKED | NULLABLE | INDEX_TYPE | COMMENT | INDEX_COMMENT |
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| def           | mysqlclojure | largetable  |          0 | mysqlclojure | PRIMARY         |            1 | id                 | A         |    50000169 |     NULL | NULL   |          | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | iif             |            1 | indexedintfield    | A         |          18 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | foreignkeyfield |            1 | foreignkeyfield    | A         |       82372 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            1 | indexfield1        | A         |      217392 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            2 | indexfield2        | A         |    50000169 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | largetable  |          1 | mysqlclojure | if1if2if3       |            3 | indexfield3        | A         |    50000169 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          0 | mysqlclojure | PRIMARY         |            1 | id                 | A         |       50242 |     NULL | NULL   |          | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | isf             |            1 | indexedstringfield | A         |       50242 |        4 | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | if1if2          |            1 | indexfield1        | A         |       25121 |     NULL | NULL   | YES      | BTREE      |         |               |
| def           | mysqlclojure | mediumtable |          1 | mysqlclojure | if1if2          |            2 | indexfield2        | A         |       50242 |     NULL | NULL   | YES      | BTREE      |         |               |
+---------------+--------------+-------------+------------+--------------+-----------------+--------------+--------------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
10 rows in set (1.65 sec)


mysql> show indexes from largetable;
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name        | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| largetable |          0 | PRIMARY         |            1 | id              | A         |    50000169 |     NULL | NULL   |      | BTREE      |         |               |
| largetable |          1 | iif             |            1 | indexedintfield | A         |          18 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | foreignkeyfield |            1 | foreignkeyfield | A         |       65876 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            1 | indexfield1     | A         |      245098 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            2 | indexfield2     | A         |    50000169 |     NULL | NULL   | YES  | BTREE      |         |               |
| largetable |          1 | if1if2if3       |            3 | indexfield3     | A         |    50000169 |     NULL | NULL   | YES  | BTREE      |         |               |
+------------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (1.54 sec)

Performance of "select" statements, part 1: using the primary key

Selecting 1 row:
select * from largetable where id = 45000000;
...
1 row in set (0.02 sec)
Select 15 non-consecutive rows:
select * from largetable where id in (900, 1000, 10000, 100000, 1000000, 10000000, 48000000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000);
...
15 rows in set (0.14 sec)
Selecting 10000 consecutive rows:
select * from largetable where id between 2000000 and 2010000;
...
10001 rows in set (0.06 sec)
Note that it's faster to select many consecutive rows than few non-consecutive rows. This is because random I/Os are more expensive than sequential I/Os. Also, whereas presumably each of the non-consecutive rows is on a separate disk page, the pages filled by the consecutive rows are probably filled *only* with those rows. Also note that re-running each of these queries results in much faster execution the second time, presumably due to page caching. Since there are no concurrent inserts, this doesn't cause any concurrency issues.

Performance of "select" statements, part 2: using secondary indexes

Selecting a single row by secondary index value:
select * from largetable where indexedintfield = 155 limit 1;
...
1 row in set (0.11 sec)
Selecting 15 rows by secondary index value:
select * from largetable where indexfield1 = 14900 limit 15;
...
15 rows in set (0.13 sec)
Selecting many rows by secondary index value:
select * from largetable where indexfield1 = 149;
...
502 rows in set (5.95 sec)
Select a range of rows by secondary index value:
select * from largetable where indexfield1 between 149 and 152;
...
2058 rows in set (17.97 sec)
Note the similarity in execution time between query 2 and the second query of the primary key section: 0.14 secs vs 0.13 secs. Both have to read 15 rows that are essentially randomly distributed on disk. Also note that reading 2000 rows from a secondary index takes about 1000 times as long per row as reading from a primary index. Again, this is because InnoDB stores rows more or less consecutively by primary key value.

Performance of "select" statements, part 3: aggregates

Grouping on a secondary index:
select indexedintfield, count(*) from largetable group by indexedintfield;
...
1000 rows in set (7 min 59.51 sec)
To answer this query, even though MySQL doesn't have to read the table itself, it still has to read the entire index on "indexedintfield", which is on the order of hundreds of millions of bytes (at the leaf level of the index, there are 50 million entries, each with the index value and the primary key value -- even allowing for compression of identical index values, there are still over 50 million integers, and this is ignoring overhead.
Counting the number of rows in the table:
mysql> select count(*) from largetable;
...
1 row in set (5 min 48.61 sec)

mysql> select count(*) from largetable force index (primary);
...
1 row in set (2 min 22.84 sec)
Apparently, the performance of this query depends on which storage engine is in use. InnoDB has to read the entire table. Also note that forcing use of the primary (clustered) index speeds up the query. It's unclear to me why MySQL uses a secondary index when given the choice.

Performance of "insert" statements

The speed of insert operations may have slowed down at this size. When the table was nearly empty, approximately 500 rows/second could be inserted from Clojure. With 50 million rows already in the table, approximately 350 rows/second are inserted using the same function. I am not sure if this is due to the table size or to some other factor.

Performance of "update" statements

Updating by primary key value:
mysql> update largetable set notnullint = notnullint + 2 where id > 49950000;
Query OK, 50000 rows affected (0.61 sec)
Rows matched: 50000  Changed: 50000  Warnings: 0
Updating by secondary key value:
mysql> update largetable set notnullint = notnullint + 2 where indexedintfield = 999;
Query OK, 49744 rows affected (5 min 25.59 sec)
Rows matched: 49744  Changed: 49744  Warnings: 0
No surprises here: the I/O necessary to find the rows completely dwarfs all other costs.

Performance of "delete" statements

Deleting rows by primary key value:
mysql> delete from largetable where id = 50000001;
Query OK, 1 row affected (0.23 sec)
Deleting rows by secondary key value:
mysql> delete from largetable where indexedintfield = 500 and id > 50000000 limit 1;
Query OK, 1 row affected (1.50 sec)
Deleting ~170,000 rows by primary key range:
mysql> delete from largetable where id > 50000000;
Query OK, 1 row affected (2 min 17.28 sec)
As expected, it's faster to delete by primary key than secondary. What is surprising is how long it takes to delete many rows by primary key value.

Performance of "select" statements, part 4: joins

Since there is a one-to-many relation between "mediumtable" and "largetable", joins with the "where" clause restricting "largetable" are fast:
select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id = 56;
...
1 row in set (0.00 sec)
while joins with the "where" clause restricting "mediumtable" are slow:
select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where m.id = 56;
...
982 rows in set (9.72 sec)
This is because each row of "mediumtable" corresponds to many rows in "largetable" (as shown in the query results), which are spread around in various disk pages.
mysql> select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id <= 500;
...
500 rows in set (5.99 sec)

mysql> select * from largetable l inner join mediumtable m on l.foreignkeyfield = m.id where l.id between 150000 and 150499;
...
500 rows in set (0.09 sec)
I'm not sure why the second query is so much faster than the first, but I assume it's because "mediumtable" is loaded into cache during execution of the first query, so that the second query doesn't need to do any (or very little) I/O on "mediumtable".

Performance of "select" statements, part 5: subqueries

Check out these two queries:
mysql> select max(indexedintfield), min(indexedintfield) from largetable;
...
1 row in set (0.00 sec)

mysql> explain select max(a), min(a) from (select indexedintfield a from largetable) q;
...
??
What's the difference? The second one uses a subquery to rename fields, but it takes much longer to execute. Looks like MySQL's historical problems with subqueries are still ongoing.

Conclusions

What we've seen leads me to the conclusions that MySQL works fine as an OLTP database with 50 million rows, if the queries are restricted to simple joins primary key selects (note that all of this was done on a PC, without any performance tuning -- by no means a high-end server). However, aggregate queries and subqueries do not seem to work well and would probably cause an unacceptable performance hit if this were a "live" database. If that were the case, denormalization might be necessary. Therefore, I would be very hesitant to do any data analysis with such a setup. It's also to note that, with such a large database, full backups and dumps take a very long time; sharding and partitioning become important.

Wednesday, April 18, 2012

Understanding MySQL: execution plans, optimization, and table statistics

In my previous article, I showed some statistics and tables of rankings data from almost 40 years of rankings data provided by the ATP website (http://www.atpworldtour.com). In this article, I'll look at the code that generated those numbers, and how MySQL executes that code efficiently.

MySQL query execution: brief introduction

  1. a query is sent to the MySQL server (somehow: from a program, from the console, etc.)
  2. the query is parsed. if any errors occur, execution aborts
  3. relevant table statistics and indexes are located
  4. the optimizer generates an execution plan
  5. the plan is executed and the result set is returned to the client

The data model

I'll look at queries covering three tables:
create table country (
  name  varchar(3) primary key
);

create table player (
  pid int primary key,
  fname varchar(30),
  lname varchar(30),
  country varchar(3),
  foreign key (country) references country(name)
);

create table ranking (
  monday date,
  pid int,
  primary key (monday, pid),
  foreign key (pid) references player(pid),
  rank int
);
So what we have is a set of rankings every week, where typically each rank from 1 to 100 has a player at that ranking. And each player is associated with a country. The only indexes are those created by default by MySQL for primary keys and foreign keys.

1: query sent to MySQL

I'll use this query for the rest of this article:
  select
    p.fname,
    p.lname,
    r.*,
    c.*
  from ranking  r
  inner join player p
    on r.pid = p.pid
  inner join country c
    on p.country = c.abbreviation;

2: MySQL parses the query, checking for syntax errors, etc.

Since this query happens to be syntactically correct, and doesn't mention non-existent tables, MySQL accepts it cheerfully.

3: statistics and indexes are located

I can find these for myself using this query (see explanation of information_schema.statistics table in the docs here):
mysql> select * from information_schema.statistics where table_name in ('player', 'country', 'ranking');
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | NON_UNIQUE | INDEX_SCHEMA | INDEX_NAME | SEQ_IN_INDEX | COLUMN_NAME  | COLLATION | CARDINALITY | SUB_PART | PACKED | NULLABLE | INDEX_TYPE | COMMENT | INDEX_COMMENT |
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
| def           | tennis       | country    |          0 | tennis       | PRIMARY    |            1 | name         | A         |         202 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | player     |          0 | tennis       | PRIMARY    |            1 | pid          | A         |        1069 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | player     |          1 | tennis       | country    |            1 | country      | A         |         152 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          0 | tennis       | PRIMARY    |            1 | monday       | A         |        3401 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          0 | tennis       | PRIMARY    |            2 | pid          | A         |      166696 |     NULL | NULL   |          | BTREE      |         |               |
| def           | tennis       | ranking    |          1 | tennis       | pid        |            1 | pid          | A         |        1424 |     NULL | NULL   |          | BTREE      |         |               |
+---------------+--------------+------------+------------+--------------+------------+--------------+--------------+-----------+-------------+----------+--------+----------+------------+---------+---------------+
(TODO: figure out why there is an extra index on ranking.pid. I think it's because it's a foreign key) The important columns here are:
  • INDEX_NAME: it's "PRIMARY" for indexes on primary keys, and the column name for foreign keys. Note that I didn't specify column names, which allowed MySQL to default to this.
  • SEQ_IN_INDEX: indexes can cover more than one column. Since ranking has a two-column primary key, its second column (pid) has SEQ_IN_INDEX = 2
  • CARDINALITY: this is a very important piece of information for performance reasons. Cardinality is MySQL's estimate of the number of distinct values for that index. Indexes with low cardinality will be less selective. Note that the cardinality of a primary key should always be equal to the number of rows in the table, by definition. Also note that these statistics are just estimates.

4: execution plan is generated

MySQL will then use the statistics to come up with a pretty good execution plan. It may not be the best, but it's quite often very good. Here's what it came up with in this case (see the docs here for a more complete explanation of this output):
mysql> explain select p.fname,
    p.lname,
    r.*,
    c.*
  from ranking  r
  inner join player p
    on r.pid = p.pid
  inner join country c
    on p.country = c.abbreviation;
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
| id | select_type | table | type   | possible_keys   | key     | key_len | ref              | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
|  1 | SIMPLE      | p     | ALL    | PRIMARY,country | NULL    | NULL    | NULL             | 1097 |       |
|  1 | SIMPLE      | c     | eq_ref | PRIMARY         | PRIMARY | 5       | tennis.p.country |    1 |       |
|  1 | SIMPLE      | r     | ref    | pid             | pid     | 4       | tennis.p.pid     |  117 |       |
+----+-------------+-------+--------+-----------------+---------+---------+------------------+------+-------+
3 rows in set (0.00 sec)
This output shows one row for each table, view, or subquery that is accessed in the query. Important columns to pay attention to:
  • type: this is known as the "join type", but the name is somewhat deceiving because some queries don't involve joins. This column says how the table was accessed. "ALL" generally means that a tablescan is used to read all rows. "eq_ref" means that rows in the table match at most a single row from the previous tables. "ref" is similar to "eq_ref" but means that rows can match multiple rows from earlier tables.
  • ref: the column, from the previous tables, used for comparison when joining. For example, it's "tennis.p.country" for the second row, which means that rows from "country" need to match the "country" column of "player"
  • rows: this is MySQL's estimate for the number of rows of the previous tables that each row of the current table will match. Note that for the first table, there are no previous tables to match, so the number if simply MySQL's estimate of the number of rows in "player".
Also note that the order the tables are accessed in does not have to match the order of the tables as given in the query. That's okay; MySQL is free to optimize the table access order, as long as it doesn't change the final result.

5: query evaluated and result set returned

The result is too big to show, but it takes less than a second:
166208 rows in set (0.79 sec)

Monday, April 16, 2012

Tennis rankings statistics

Introduction

Total number of weeks spent in the top 10

+------------+------------+-----------------+
| first name | last name  | number of weeks |
+------------+------------+-----------------+
| Andre      | Agassi     |             747 |
| Pete       | Sampras    |             586 |
| Boris      | Becker     |             576 |
| Roger      | Federer    |             506 |
| Stefan     | Edberg     |             497 |
| Jimmy      | Connors    |             489 |
| Ivan       | Lendl      |             482 |
| Andy       | Roddick    |             440 |
| Yevgeny    | Kafelnikov |             390 |
| Michael    | Chang      |             369 |
| Rafael     | Nadal      |             364 |
| Goran      | Ivanisevic |             328 |
+------------+------------+-----------------+

Number of different years ranked in top 10

+----------------+-------------------+-------------------------+
| first name     | last name         | years ranked in top 100 |
+----------------+-------------------+-------------------------+
| Andre          | Agassi            |                      21 |
| Fabrice        | Santoro           |                      21 |
| Jimmy          | Connors           |                      20 |
| Ivan           | Lendl             |                      17 |
| John           | McEnroe           |                      17 |
| Magnus         | Gustafsson        |                      16 |
| Jonas          | Bjorkman          |                      16 |
| Boris          | Becker            |                      16 |
| Pete           | Sampras           |                      16 |
| Carlos         | Moya              |                      15 |
| Nicolas        | Lapentti          |                      15 |
| Vincent        | Spadea            |                      15 |
| Michael        | Chang             |                      15 |
| Thomas         | Muster            |                      15 |
| Arnaud         | Clement           |                      15 |
| Guy            | Forget            |                      15 |
| Wayne          | Ferreira          |                      14 |
| Guillermo      | Vilas             |                      14 |
| Jason          | Stoltenberg       |                      14 |
| Lleyton        | Hewitt            |                      14 |
| Roger          | Federer           |                      14 |
| Mark           | Woodforde         |                      14 |
| Tommy          | Haas              |                      14 |
| Stefan         | Edberg            |                      14 |
| Marc           | Rosset            |                      14 |
| Anders         | Jarryd            |                      14 |
| Kenneth        | Carlsen           |                      14 |
| Jakob          | Hlasek            |                      14 |
| Ivan           | Ljubicic          |                      14 |
| Goran          | Ivanisevic        |                      14 |
| Juan Carlos    | Ferrero           |                      14 |
| Thomas         | Enqvist           |                      14 |
| Greg           | Rusedski          |                      14 |
+----------------+-------------------+-------------------------+

Longevity: number of years between first and last top 100 ranking

+----------------+-------------------+-----------+
| first name     | last name         | longevity |
+----------------+-------------------+-----------+
| Andre          | Agassi            |   19.9068 |
| Fabrice        | Santoro           |   19.8493 |
| Jimmy          | Connors           |   19.4849 |
| Ivan           | Lendl             |   16.1288 |
| John           | McEnroe           |   16.0137 |
| Ronald         | Agenor            |   14.9781 |
| Guillermo      | Vilas             |   14.9014 |
| Pete           | Sampras           |   14.7863 |
| Vincent        | Spadea            |   14.6904 |
| Boris          | Becker            |   14.6137 |
| Jonas          | Bjorkman          |   14.5753 |
| Mats           | Wilander          |   14.5370 |
| Magnus         | Gustafsson        |   14.4411 |
| Michael        | Chang             |   14.1918 |
| Guy            | Forget            |   14.1342 |
| Nicolas        | Lapentti          |   14.1151 |
+----------------+-------------------+-----------+

Top players by points

Points are awarded weekly. If a player is ranked n for one week, he gets (100 - n) points. Thus, the top-ranked player gets 99 points for that week.
+----------------+-------------------+-------+-------+---------+
| first name     | last name         | total | weeks | average |
+----------------+-------------------+-------+-------+---------+
| Andre          | Agassi            | 89112 |  1019 | 87.4504 |
| Pete           | Sampras           | 68204 |   767 | 88.9231 |
| Boris          | Becker            | 65467 |   762 | 85.9147 |
| Stefan         | Edberg            | 62604 |   690 | 90.7304 |
| Roger          | Federer           | 59964 |   655 | 91.5481 |
| Michael        | Chang             | 57769 |   736 | 78.4905 |
| Jimmy          | Connors           | 55662 |   636 | 87.5189 |
| Carlos         | Moya              | 55515 |   723 | 76.7842 |
| Ivan           | Lendl             | 54568 |   596 | 91.5570 |
| Goran          | Ivanisevic        | 53588 |   665 | 80.5835 |
| Thomas         | Muster            | 53352 |   716 | 74.5140 |
| Andy           | Roddick           | 52508 |   576 | 91.1597 |
| Lleyton        | Hewitt            | 51084 |   646 | 79.0774 |
| Wayne          | Ferreira          | 49854 |   699 | 71.3219 |
| Fabrice        | Santoro           | 49553 |   942 | 52.6040 |
| Juan Carlos    | Ferrero           | 48744 |   655 | 74.4183 |
| Yevgeny        | Kafelnikov        | 47749 |   542 | 88.0978 |
| Tim            | Henman            | 47517 |   618 | 76.8883 |
| Jim            | Courier           | 47256 |   625 | 75.6096 |
| John           | McEnroe           | 47115 |   537 | 87.7374 |
| Tommy          | Haas              | 45905 |   643 | 71.3919 |
| Richard        | Krajicek          | 44432 |   578 | 76.8720 |
| Brad           | Gilbert           | 44093 |   558 | 79.0197 |
+----------------+-------------------+-------+-------+---------+

Effect of time on height and weight

+------+----------+---------+
| year | height   | weight  |
+------+----------+---------+
| 1973 | 181.8135 | 76.3189 |
| 1974 | 181.7914 | 76.4164 |
| 1975 | 181.3225 | 75.8736 |
| 1976 | 181.7393 | 76.2832 |
| 1977 | 182.1372 | 76.9437 |
| 1978 | 182.5895 | 76.6278 |
| 1979 | 182.9831 | 76.5511 |
| 1980 | 182.8359 | 76.1083 |
| 1981 | 179.1964 | 72.9821 |
| 1982 | 181.6087 | 75.9022 |
| 1983 | 183.2094 | 76.4364 |
| 1984 | 183.6162 | 76.7500 |
| 1985 | 183.6465 | 76.1334 |
| 1986 | 183.2473 | 76.1448 |
| 1987 | 183.3163 | 76.3040 |
| 1988 | 183.2512 | 76.4948 |
| 1989 | 183.3446 | 76.6306 |
| 1990 | 183.5925 | 76.5753 |
| 1991 | 183.7840 | 76.7710 |
| 1992 | 184.2427 | 77.1044 |
| 1993 | 184.3413 | 77.7862 |
| 1994 | 184.5725 | 78.1500 |
| 1995 | 184.9135 | 78.4188 |
| 1996 | 185.4239 | 79.3361 |
| 1997 | 185.3892 | 79.6341 |
| 1998 | 185.0104 | 79.7397 |
| 1999 | 184.5593 | 79.7152 |
| 2000 | 184.7325 | 80.1408 |
| 2001 | 184.1270 | 79.5645 |
| 2002 | 184.2771 | 79.3858 |
| 2003 | 184.5173 | 79.7938 |
| 2004 | 185.2883 | 80.3637 |
| 2005 | 184.7863 | 80.2042 |
| 2006 | 184.5260 | 79.5190 |
| 2007 | 184.9772 | 79.6904 |
| 2008 | 185.4658 | 79.8740 |
| 2009 | 185.2888 | 79.6398 |
| 2010 | 185.7619 | 79.9013 |
| 2011 | 185.8758 | 79.0297 |
| 2012 | 186.1962 | 79.1639 |
+------+----------+---------+

Number of weeks players ranked #1 in a year

+---------------------+------+----------+
| name                | year | weeks #1 |
+---------------------+------+----------+
| Ilie Nastase        | 1973 |       20 |
| Jimmy Connors       | 1974 |       23 |
| Ilie Nastase        | 1974 |       21 |
| John Newcombe       | 1974 |        8 |
| Jimmy Connors       | 1975 |       47 |
| Jimmy Connors       | 1976 |       24 |
| Jimmy Connors       | 1977 |       37 |
| Bjorn Borg          | 1977 |        1 |
| Jimmy Connors       | 1978 |       10 |
| Jimmy Connors       | 1979 |       21 |
| Bjorn Borg          | 1979 |        7 |
| Bjorn Borg          | 1980 |       14 |
| John McEnroe        | 1980 |        4 |
| John McEnroe        | 1981 |        3 |
| Bjorn Borg          | 1981 |        2 |
| Jimmy Connors       | 1982 |        8 |
| John McEnroe        | 1982 |        6 |
| John McEnroe        | 1983 |       11 |
| Jimmy Connors       | 1983 |        6 |
| Ivan Lendl          | 1983 |        4 |
| John McEnroe        | 1984 |       27 |
| Ivan Lendl          | 1984 |       12 |
| John McEnroe        | 1985 |       34 |
| Ivan Lendl          | 1985 |       18 |
| Ivan Lendl          | 1986 |       52 |
| Ivan Lendl          | 1987 |       52 |
| Ivan Lendl          | 1988 |       36 |
| Mats Wilander       | 1988 |       16 |
| Ivan Lendl          | 1989 |       48 |
| Mats Wilander       | 1989 |        4 |
| Ivan Lendl          | 1990 |       32 |
| Stefan Edberg       | 1990 |       21 |
| Stefan Edberg       | 1991 |       40 |
| Boris Becker        | 1991 |       12 |
| Jim Courier         | 1992 |       41 |
| Stefan Edberg       | 1992 |       11 |
| Pete Sampras        | 1993 |       35 |
| Jim Courier         | 1993 |       17 |
| Pete Sampras        | 1994 |       52 |
| Andre Agassi        | 1995 |       30 |
| Pete Sampras        | 1995 |       22 |
| Pete Sampras        | 1996 |       45 |
| Thomas Muster       | 1996 |        6 |
| Andre Agassi        | 1996 |        2 |
| Pete Sampras        | 1997 |       52 |
| Pete Sampras        | 1998 |       46 |
| Marcelo Rios        | 1998 |        6 |
| Pete Sampras        | 1999 |       24 |
| Andre Agassi        | 1999 |       19 |
| Yevgeny Kafelnikov  | 1999 |        6 |
| Carlos Moya         | 1999 |        2 |
| Patrick Rafter      | 1999 |        1 |
| Andre Agassi        | 2000 |       36 |
| Pete Sampras        | 2000 |       10 |
| Gustavo Kuerten     | 2000 |        4 |
| Marat Safin         | 2000 |        2 |
| Gustavo Kuerten     | 2001 |       39 |
| Lleyton Hewitt      | 2001 |        7 |
| Marat Safin         | 2001 |        7 |
| Lleyton Hewitt      | 2002 |       52 |
| Lleyton Hewitt      | 2003 |       21 |
| Andre Agassi        | 2003 |       14 |
| Andy Roddick        | 2003 |        9 |
| Juan Carlos Ferrero | 2003 |        8 |
| Roger Federer       | 2004 |       48 |
| Andy Roddick        | 2004 |        4 |
| Roger Federer       | 2005 |       52 |
| Roger Federer       | 2006 |       52 |
| Roger Federer       | 2007 |       53 |
| Roger Federer       | 2008 |       32 |
| Rafael Nadal        | 2008 |       20 |
| Rafael Nadal        | 2009 |       26 |
| Roger Federer       | 2009 |       26 |
| Rafael Nadal        | 2010 |       30 |
| Roger Federer       | 2010 |       22 |
| Rafael Nadal        | 2011 |       26 |
| Novak Djokovic      | 2011 |       26 |
| Novak Djokovic      | 2012 |       15 |
+---------------------+------+----------+

Notes

All data was obtained from the ATP's website, where it is freely available. The data is incomplete in many cases, so while all efforts were made to ensure correctness of results, their accuracy can not be guaranteed.