Background
Every year following the college football regular season since 1998, the Bowl Championship Series (or BCS) has held 5 (originally 4) bowl games between some of the top teams in the country. We're going to look at some of the numbers behind the games.
In this post, we'll see the data schema and how it's analyzed. In the second part, we'll look at the data itself.
The requirements
I wanted to analyze the statistics based on a few different criteria, including:
- team
- conference
- bowl
- year
- team rank
- qualification
- score
The schema
After a few rounds of normalization/head-banging/repeat, there were two main entities: bcsgame and bcsteam. A "bcsgame" occurred at a specific location on a specific date, between two "bcsteam"s, and ended with each "bcsteam" have a number of points. Also, each "bcsteam" was ranked (using the pre-bowl BCS poll) and associated with a specific conference at the start of the game.
Here's what that looked like translated into MySQL code:
create table if not exists `bcs_game` ( `id` int primary key auto_increment, `date` date, `city` varchar(45), `bowl` varchar(45), `isChampionship` boolean ); create table if not exists `bcs_team` ( `id` int primary key auto_increment, `name` varchar(45), `conf` varchar(45), `points` int, `rank` int, `qual` varchar(45), `id_game` int, foreign key (`id_game` ) references `football`.`bcs_game` (`id` ) );
Well, that looks pretty straightforward: we have "bcs_game" with its attributes, and "bcs_team" with its attributes, and a foreign key to the game that the team participated in.
Analyzing the schema: strengths and weaknesses
- strengths- all teams, winners and losers, in one table. This makes it much easier to query than what I had originally tried: two foreign keys from "bcs_game" to "bcs_team"
- winner/loser is determined by score only -- there's no need to store duplicate information with a winner/loser column.
- weaknesses
- problem with national championship game. After the 2006 season, the BCS expanded to 5 games each year by creating a special national championship game. Previously it had rotated between the four big bowls. This was awkward to deal with in the schema.
- a programmer's (mine!) mistake could result in a game between three or more teams (or 0 or 1)! There's nothing enforcing the requirement that exactly two teams participate in a game.
- "conference" and "bowl" could/should be entities in their own right.
Populating the database
All the data used for this project was taken from the Wikipedia BCS page.Analyzing the data
An array of views were created to analyze the data. They were structured in three tiers. First, the "end-user" views, or those views returning result sets of interest to somebody perusing the data. In the middle, the second layer was the glue layer that tied the end-user views to the third layer, which was the "base" views which did the basic joins required by almost every end-user query.
For example, here's a "base" view:
create view joined_teams as select l.name as w_name, l.points as w_points, l.rank as w_rank, l.conf as w_conf, l.qual as w_qual, r.name as l_name, r.points as l_points, r.rank as l_rank, r.conf as l_conf, r.qual as l_qual, l.id_game from bcs_team l inner join bcs_team r on l.id_game = r.id_game and l.points > r.points; -- the winner is always on the left
What this does is join every pair of teams that played in the same game by equating "id_game". The rest of the join condition -- "l.points > r.points" ensures that the winner is on the left and the loser is on the right! Also, it keeps us from seeing the same row twice.
Here's a typical "glue-layer" view:
create view team_wins as select w_name, count(*) as wins from joined_teams group by w_name;
This view returns the name and the number of wins for each team that had at least one win.
And here's an "end-user" view (the difference between this and the previous view is that this one combines lots of "glue-layer" views):
create view conf_report as select a.conf, appearances, coalesce(wins, 0) as `wins`, coalesce(losses, 0) as `losses`, apoints as `points scored`, opoints as `points given up`, wpoints as `avg points, wins`, lpoints as `avg points, losses`, coalesce(rose, 0) as `rose`, coalesce(orange, 0) as `orange`, coalesce(fiesta, 0) as `fiesta`, coalesce(sugar, 0) as `sugar`, coalesce(championship, 0) as `championship`, teams from conf_appearances a left join conf_wins b on a.conf = b.w_conf left join conf_losses c on a.conf = c.l_conf left join conf_rose e on a.conf = e.conf left join conf_orange f on a.conf = f.conf left join conf_fiesta g on a.conf = g.conf left join conf_sugar h on a.conf = h.conf left join conf_champ i on a.conf = i.conf left join conf_teams j on a.conf = j.conf;
Ugh! That's nasty! What's going on here with all these left joins?
- we're combining columns from lots of views -- that's why we need so many joins
- some views might not have values for some conferences -- using "left join" lets us keep that row, giving it a null
- we use coalesce at the end to replace null's with 0's
In part 2 we'll look at some pretty charts!
Damn thats some good SQL but your right it is alot of work ... the joins are necessary if all your data is in tables ... But maybe an ordered term-vector might allow you to do alot of predictive stuff without forcing you to model each and every field.
ReplyDeleteThanks for the suggestion! I'll have to look in to how to do that.
ReplyDelete