NFLdb Part 1: Creating an NFL Stats Database

In the spirit of Thanksgiving, I thought I would start making an NFL stats database.  I have been getting more interested in professional football with each passing season and the amount of data that is collected from every game is enormous (and is reportedly growing.  The NFL has been playing with RFID chips in player’s uniforms to track all kinds of information!).  This will be my largest database to date by far.  Additionally, until now all of my posts have been about things I’ve already done.  This series will be in ‘real time’.  I will track my thought process, what I’ve learned, and the specifics of how I’m making the database.

Overview

Programming Language

I know from the beginning that I don’t want to have to type any of these stats in by hand, so it without a doubt needs to be done programmatically.  Since I am fairly familiar with Python, this is an easy choice.

Database Design

I first started thinking about what the tables should be, how they are related, what data will be in them, etc.  This is a mistake.  When doing a large project such as this, we need to first think what data do I want and what will I do with it?  In order to answer the first, it may be easier to answer the second:

What will I do with the data I collect?

My primary goal of this is to make my first, large-scale database and to learn the ins and outs of what that really means.  After that, I think it would be fun to be able to come up with the types of stats you hear the announcers say during games such as When the Colts are away and up by X points at the half, they have won A/B games in the last Y years.  I am genuinely excited to get all of this data and to come up with those silly stats.

What data do I want?

I would like to eventually collect as much data as possible, but in order to keep this project realistic I need to prioritize and do it in chunks.  I want to start with basic stats.  Football stats are typically separated into these categories (this is not all inclusive but seems like a good start):

  • Passing
  • Rushing
  • Receiving
  • Fumbles
  • Kicking
  • Punting
  • Kick Returns
  • Punt Returns
  • Defense
  • Penalty

I suspect it will also be easy to get static data about players such as:

  • Full name
  • Height
  • Weight
  • Position
  • Team
  • College

It should also be pretty easy to get data on teams:

  • Team Name
  • Stadium
  • Same stats as above for full seasons (total passing, total rushing, etc)

I am using this as a general starting point or template.  I’m not holding myself to all of this information; rather, I’m using it as a goal to reach for.

 

Python

I want to see what data I can realistically get with Python before I start to design the databases (this way I know exactly what data is going into the database instead of just guessing).  After a little bit of searching, I found a Python module called nflgame.

This module has better than average documentation but I still found myself asking lots of questions.  I would typically skip writing about the steps of figuring things like this out but I want to try to be as thorough as possible for this series.

To get this module, in the command prompt type:
pip install nflgame
The example on the module’s website shows how to get the top five running backs by rushing yards for the first week in 2013.

 

[python]
import nflgame

games = nflgame.games(2013, week=1)
players = nflgame.combine_game_stats(games)
for p in players.rushing().sort(‘rushing_yds’).limit(5):
msg = ‘%s %d carries for %d yards and %d TDs’
print msg % (p, p.rushing_att, p.rushing_yds, p.rushing_tds)
[/python]

From this, I can tell that rushing() is an attribute of nflgame.combine_game_stats() and that I can easily get rushing_att (rushing attempts), rushing_yds (rushing yards), and rushing_tds (rushing touchdowns).  Lots of good information here but it brings up many questions.  What are the other attributes of nflgame.combine_game_stats besides rushing() and does rushing() have any more than the three listed here?

Checking the attributes of combine_game_stats is easy within python.

[python]

import nflgame

games = nflgame.games(2015, week=9)
players = nflgame.combine_game_stats(games)

dir(players)

[/python]

First, we import the module and set the week for games we want to look at (the week doesn’t matter, so I chose the most recent games).  Then, we get the players from that week.  In order to see the attribute of players, we simply use the line

Dir() is a built-in function of Python.  If you give it an argument (in other words, put something in between the parentheses) it will try to return a list of the attributes for that object.  Without an argument, it returns a list of names in the current local scope.

Our output that looks like this:

It is readable but just barely so.  In order to make this more manageable, we need to use something called pretty print.  To install it, use

Then, within python we will do this:

[python]

from pprintpp import pprint as pp
pp(dir(players))

[/python]

The output of pprint is much more manageable:

This is looking good!  We can see the typical attributes such as __init__ at the top.  The ones I am intrigued by, however, are towards the bottom.  I see rushing is in there, which I already knew, but I can also find defense, fumbles, kicking, kickret, passing, etc.  This list looks a LOT like the stats categories that I had already established at the first part of this post.

Now, I just need to know all of the attributes of everything in this list.  Unfortunately, dir() does not work here so I started looking at the source code for the module.  After a little bit of digging (I searched the source files for ‘rushing_att’ since I know this is an attribute and it is something that I didn’t expect to be repeated a lot throughout the code), I found what I believed to be the list of all of the stats.  This file is located in C:\Python27\Lib\site-packages\nflgame\statmap.py.

I started with passing stats.  There were close to 10 sections of the statmap file that looked like this, where ‘cat’ was equal to ‘passing’:

[python]

14: {
‘cat’: ‘passing’,
‘fields’: [‘passing_att’, ‘passing_incmp’],
‘yds’: ”,
‘desc’: ‘Pass incomplete’,
‘long’: ‘Pass atempt, incomplete.’,
},

[/python]

From here, I am making an assumption that passing().passing_att will give the passing attempts in the same way that rushing().rushing_att gave rushing attempts.  If this is the case, then passing().passing_incmp will probably also give the number of incomplete passes.

I didn’t want to assume this is how the code works, so I began to test it with a simple script.  I first went through and found all of the things that I believed to be attributes of ‘passing’ from the statmap.py file and put them into a script to print out the numbers of the top 5 passers:

[python]
for p in players.passing().sort(‘passing_yds’).limit(5):

msg = ‘%s made %d passing attempts with %d incompletions and %d completions.  He had a total of %d yards, %d touchdowns, %d interceptions, %d sacks, and lost a total of %d from the sacks’

print msg % (p, p.passing_att, p.passing_incmp, p.passing_cmp, p.passing_yds, p.passing_tds, p.passing_int, p.passing_sk, p.passing_sk_yds)

[/python]

The output of this is:

Right off the bat, I noticed a lot of zeros which is unexpected.  Upon closer look, all five quarterbacks have 0 incompletions, interceptions, sacks, and lost 0 yards from sacks in week 9 of the 2015 season.  This seemed too odd to be true but I double checked Drew Brees’ week 9 stats just to be sure.  According to Pro Football Reference, Brees did indeed have 28 completions and he also had 39 attempts (meaning 19 incompletions).  He had 3 touchdowns for 387 yards (a difference of 2, which I think is reasonable), but he had 1 interception.  Sacks aren’t listed in Pro Football Reference, but I think it is a safe assumption that something is up with the incompletions, interceptions, sacks, and sack yardage.

I noted these attributes and repeated this process for rushing, receiving, fumbles, kicking, punting, kick returns, punt returns, defense, and penalties.  The table below summarizes the findings.

Category Attribute Working
Passing passing_att
passig_incmp N
passing_cmp
passing_yds
passing_tds
passing_int N
passing_sk N
passing_sk_yds N
Rushing rushing_att
rushing_tds
rushing_yds
Receiving receiving_rec
receiving_yds
receiving_tds N
receiving_tar N
receiving_yac_yds N
Fumbles fumbles_tot
fumbles_forced N
fumbles_notforced N
fumbles_oob N
fumbles_rec N
fumbles_rec_yds N
fumbles_rec_tds N
fumbles_lost
Kicking kicking_tot N
kicking_yds N
kicking_i20 N
kicking_touchback N
Punting punting_tot N
punting_yds
punting_i20
punting_touchback N
Kickret kickret_ret
kickret_yds N
kickret_tds
kickret_oob N
kickret_fair N
kickret_touchback N
Puntret Puntret_tot N
Puntret_tds N
Puntret_yds N
Puntret_oob N
Puntret_downed N
Puntret_fair N
Puntret_touchback N

It is pretty obvious when looking at this list that the majority of the stats are producing 0 when I attempt to print out the top five players for the stat.  I know that this isn’t an abandoned project since the latest commit on github was only 19 days ago which makes me have a feeling that these stats should be working.

I will continue to troubleshoot and perhaps get on the creator’s IRC and chat with him.

 

One thought on “NFLdb Part 1: Creating an NFL Stats Database

  1. I am starting play with this project, can this be migrated to SQL Server rather than using python and postgres?

Leave a Reply

Your email address will not be published. Required fields are marked *