Tuesday, January 10, 2012

Analyzing college football's biggest games: part 1

Are you a college football fan?  Ever wondered whether the games are more exciting this year than last?  Whether higher-ranked teams perform better than lower-ranked teams?  Whether this year's games were the highest scoring?

Background


Every year following the college football regular season since 1998, the Bowl Championship Series (or BCS) has held 5 (originally 4) bowl games between some of the top teams in the country.  We're going to look at some of the numbers behind the games.

In this post, we'll see the data schema and how it's analyzed.  In the second part, we'll look at the data itself.

The requirements


I wanted to analyze the statistics based on a few different criteria, including:
  1. team
  2. conference
  3. bowl
  4. year
  5. team rank
  6. qualification
  7. score
So I needed to have all these captured as attributes somehow.

The schema


After a few rounds of normalization/head-banging/repeat, there were two main entities:  bcsgame and bcsteam.  A "bcsgame" occurred at a specific location on a specific date, between two "bcsteam"s, and ended with each "bcsteam" have a number of points.  Also, each "bcsteam" was ranked (using the pre-bowl BCS poll) and associated with a specific conference at the start of the game.

Here's what that looked like translated into MySQL code:

create table if not exists `bcs_game` (
  `id`              int primary key auto_increment,
  `date`            date,
  `city`            varchar(45),
  `bowl`            varchar(45),
  `isChampionship`  boolean
);

create table if not exists `bcs_team` (
  `id`         int primary key auto_increment,
  `name`       varchar(45),
  `conf`       varchar(45),
  `points`     int,
  `rank`       int,
  `qual`       varchar(45),
  `id_game`    int,
    foreign key (`id_game` )
      references `football`.`bcs_game` (`id` )
);

Well, that looks pretty straightforward:  we have "bcs_game" with its attributes, and "bcs_team" with its attributes, and a foreign key to the game that the team participated in.

Analyzing the schema: strengths and weaknesses

 - strengths
  • all teams, winners and losers, in one table.  This makes it much easier to query than what I had originally tried:  two foreign keys from "bcs_game" to "bcs_team"
  • winner/loser is determined by score only -- there's no need to store duplicate information with a winner/loser column.

 - weaknesses
  • problem with national championship game.  After the 2006 season, the BCS expanded to 5 games each year by creating a special national championship game.  Previously it had rotated between the four big bowls.  This was awkward to deal with in the schema.
  • a programmer's (mine!) mistake could result in a game between three or more teams (or 0 or 1)!  There's nothing enforcing the requirement that exactly two teams participate in a game.
  • "conference" and "bowl" could/should be entities in their own right.

Populating the database

All the data used for this project was taken from the Wikipedia BCS page.

Analyzing the data

An array of views were created to analyze the data.  They were structured in three tiers.  First, the "end-user" views, or those views returning result sets of interest to somebody perusing the data.  In the middle, the second layer was the glue layer that tied the end-user views to the third layer, which was the "base" views which did the basic joins required by almost every end-user query.

For example, here's a "base" view:

create view joined_teams as
  select
    l.name    as w_name,
    l.points  as w_points,
    l.rank    as w_rank,
    l.conf    as w_conf,
    l.qual    as w_qual,
    r.name    as l_name,
    r.points  as l_points,
    r.rank    as l_rank,
    r.conf    as l_conf,
    r.qual    as l_qual,
    l.id_game
  from
    bcs_team l
  inner join
    bcs_team r
   on
     l.id_game = r.id_game and l.points > r.points; -- the winner is always on the left

What this does is join every pair of teams that played in the same game by equating "id_game".  The rest of the join condition -- "l.points > r.points" ensures that the winner is on the left and the loser is on the right!  Also, it keeps us from seeing the same row twice.

Here's a typical "glue-layer" view:

create view team_wins as
  select 
    w_name, 
    count(*) as wins 
  from 
    joined_teams 
  group by 
    w_name;

This view returns the name and the number of wins for each team that had at least one win.

And here's an "end-user" view (the difference between this and the previous view is that this one combines lots of "glue-layer" views):

create view conf_report as
  select 
    a.conf, 
    appearances, 
    coalesce(wins, 0)         as `wins`, 
    coalesce(losses, 0)       as `losses`,
    apoints                   as `points scored`,
    opoints                   as `points given up`,
    wpoints                   as `avg points, wins`,
    lpoints                   as `avg points, losses`,
    coalesce(rose, 0)         as `rose`,
    coalesce(orange, 0)       as `orange`,
    coalesce(fiesta, 0)       as `fiesta`,
    coalesce(sugar, 0)        as `sugar`,
    coalesce(championship, 0) as `championship`,
    teams   
  from 
    conf_appearances a 
  left join
    conf_wins b 
    on a.conf = b.w_conf 
  left join 
    conf_losses c 
    on a.conf = c.l_conf
  left join
    conf_rose e
    on a.conf = e.conf
  left join
    conf_orange f
    on a.conf = f.conf
  left join
    conf_fiesta g
    on a.conf = g.conf
  left join 
    conf_sugar h
    on a.conf = h.conf
  left join
    conf_champ i
    on a.conf = i.conf
  left join
    conf_teams j
    on a.conf = j.conf;

Ugh!  That's nasty!  What's going on here with all these left joins?

  • we're combining columns from lots of views -- that's why we need so many joins
  • some views might not have values for some conferences -- using "left join" lets us keep that row, giving it a null
  • we use coalesce at the end to replace null's with 0's

In part 2 we'll look at some pretty charts!

2 comments:

  1. Damn thats some good SQL but your right it is alot of work ... the joins are necessary if all your data is in tables ... But maybe an ordered term-vector might allow you to do alot of predictive stuff without forcing you to model each and every field.

    ReplyDelete
  2. Thanks for the suggestion! I'll have to look in to how to do that.

    ReplyDelete