Board Game Geek Database

I mentioned a couple of weeks ago that I was interested in pulling data from every board game on BGG and putting it into my own database.  The end goal was to have a database that I could add to every month and see the top 100 games that had the largest increase in ownership over the course of 1 month (or some other time span).  The initial post is here.

This project has turned out to be incredibly fun and fulfilling.  First, I wrote the Python script that I described in that initial post.  This ran through every BGGID on www.boardgamegeek.com (at that time, there were 200k) and checked to see if it was a board game (there were 83k on BGG).  If it was a board game, it first grabbed the name and how many BGG users currently own the game.  This script took way longer to run than I thought and hoped.  In total, it ran for about four days.

I could hardly wait for it to finish.  When it was over halfway, I realized that I wanted to also add the current rank of each game to the database.  I decided to let it run instead of starting over (and I’m glad I did!).  In the next few days, I was thinking a lot about the queries I would do when I finally got all of my data.  I realized it would be a lot more interesting if I had more information about each game; specifically, I wanted to also grab all of the publishers and all of the mechanics.  I’m glad I had decided to wait for the script to run through before restarting it!

While the script was still running, I made a copy of the database and prepared it for the new data.  The final schema was:

LqO9U

Running through the second time was a little over twice as fast.  This is because I already had a list of the valid BGGIDs for all of the board games.  Instead of looping through 200k objects, I only had to go through about 83k.  This was good, but two days is not acceptable for me if I intend on running this every month.

Results

I am really excited about generating new queries and learning more about the data and the board game industry.  I have created a Twitter account where I intend on regularly posting interesting tidbits and charts.  If you are interested, follow me @bggstats.

There was one query I had a particularly hard time figuring out.  After a couple of days struggling with it, I asked on StackOverflow looking for help.  Pretty immediately afterwards figured out the answer.

The Problem

I was interested in getting a list of the most popular mechanics for a particular publisher over time.  In other words, I wanted the result to have the columns date, mechanic, number of copies owned in bgg.  The date column would have one result per year, with mechanic being the most owned (from number of copies).

I pretty quickly came up with the query:

This returned:

year mechanic num owned
1974 Commodity Speculation 1460
1974 Dice Rolling 1460
1974 Tile Placement 1460
1982 Action Point Allowance System 16111
1982 Dice Rolling 16111
1982 Modular Board 16111
1982 Secret Unit Deployment 16111
1985 PaperandPencil 1949
1991 Auction/Bidding 1266
1992 Grid Movement 1704
1992 Pickup and Deliver 1704
2011 Action Point Allowance System 7943
2011 Area Control / Area Influence 174
2011 Area Movement 3607
2011 Auction/Bidding 174
2011 Card Drafting 5133
2011 Deck / Pool Building 3768
2011 Dice Rolling 2385
2011 Hand Management 5663
2011 Line Drawing 2141

This is really close.  The data is IN here, but I don’t want all of that.  I really only want one result per year – specifically, I want the row that has the max numowned value for that particular year.  I want something that looks more like this:

year mechanic num owned
1974 Commodity Speculation 1460
1982 Action Point Allowance System 16111
1985 PaperandPencil 1949
1991 Auction/Bidding 1266
1992 Grid Movement 1704
2011 Set Collection 13714

The key here is that I want to select from THAT select.  So, getting that select query was probably the hardest part.  From there, I want to grab the year, mechanic name, and max num owned value.  The trick here is that the initial query (the one showed above) needs to have the columns named.  I named mine yr, mech, and sm.  The final query is:

This returns what I wanted.  I think understanding how this query works and being able to come up with it on my own shows how much I’ve grown in my database work.  I’m excited for whatever is next.

Next Steps

There are several things I want to do or at least try to do.  In no particular order, they are:

  • Try to speed the script up.  I have a few things I want to try for this:
    • Try to use the BGG XML API directly, instead of using the Python module that someone wrote.
    • Next month (June), run through all 83k games.  After that (beginning in July), have the script look at the ownership values for each game for the last two months.  If there was no change, skip the game (and keep track of the games it skipped).  Do this every month.  Starting the following month (August), do the same thing.  In addition, check the ownership of the first X games that we skipped in July.  In September, do the same for the games that are changing but now do the next X games that we skipped in July.  This will allow the database to keep updating games, but not every game every month.
  • Add more data to the database.  This includes:
    • Artists
    • Designers
    • Category
  • Create a Reddit bot to allow Reddit users to interact with the data.  They could ask it something like “What is the most owned game from 2003” and it could respond with an answer.

Again, if you are interested in the results of this, find me on Twitter @bggstats.  If you want to see any charts from the data or want to know any fun facts such as the most owned game from the year you were born, let me know!

Have questions or suggestions?  Please feel free to comment below or contact me.

2 thoughts on “Board Game Geek Database

    1. This is something I don’t quite have figured out. Unfortunately, the Python module gives the same error if it reaches a BGG id that doesn’t exist and if it reaches a BGG id for an item that isn’t board game related (such as a video game item). This makes it difficult to know when you reach the end of the valid BGG ids.

      I have since switched to using the XML API directly. I haven’t looked yet, but hopefully I’ll be able to know when I reach the end of valid ids.

Leave a Reply

Your email address will not be published. Required fields are marked *