Data Analytics

Filling the NCAA void: Using BigQuery to simulate March Madness

April 2, 2020

https://storage.googleapis.com/gweb-cloudblog-publish/images/gcp_basketball.max-2600x2600.jpg

Alok Pattani

Data Science Developer Advocate

As COVID-19 continues to have enormous impact around the world, we’ve focused on supporting customers and making available public data to help research efforts, among various other initiatives. Beyond the essential issues at hand, it’s been a truly strange time for sports fans, with virtually every league shut down across the globe. Even though sports may be non-essential, they are one of our greatest distractions and forms of entertainment.

In particular, the recent American sports calendar has been missing an annual tradition that excites millions: March Madness®. The moniker represents the exciting postseason of college basketball, with both men’s and women’s teams competing to be crowned champions in the annual NCAA® Tournaments. Along with watching these fun, high-stakes games, sports fans fill out brackets to predict who will win in each stage of the tournament.

In our third year as partners with the NCAA, we had planned for a lot of data analysis related to men’s and women’s basketball before the cancellation of all remaining conference tournaments and both NCAA tournaments on March 12. It took us a few days to process a world with no tournament selections, no brackets, no upsets, and no shining moments, but we used Google Cloud tools and our data science skills to make the best of the situation by simulating March Madness.

Simulation is a key tool in the data science toolkit for many forecasting problems. Using Monte Carlo methods, which rely on repeated random sampling from probability distributions, you can model real-world scenarios in science, engineering, finance, and of course, sports. In this post, we’ll demonstrate how to use BigQuery to set up, run, and explore tens of thousands of NCAA basketball bracket simulations. We hope the example code and explanation can serve as inspiration for your own analyses that could use similar techniques. (Or you can skip ahead to play around with thousands of simulated brackets right now on Data Studio.)

Predicting a virtual tournament

In the context of projecting any NCAA Tournament, the first piece necessary is a bracket, which includes which teams make the field and creates the structure for determining who could play whom in each tournament round. The NCAA basketball committees didn’t release 2020 brackets, but we felt pretty good about using the final “projected” brackets from well-known bracketologists as proxies, since games stopped only a couple days short of selections. Specifically, we used bracket projections from Joe Lunardi at ESPN and Jerry Palm at CBS for the men, and Charlie Creme at ESPN and Michelle Smith at the NCAA for the women. These take into account a lot of different factors related to selection, seeding, and bracketing, and are fairly representative of the type of fields we might have seen from the committees.

The next step was finding a way to get win probabilities for any given matchup in a tournament field—i.e., if Team X played Team Y, how likely is it that Team X would win? To estimate these, we used past NCAA Tournament games for training data and created a logistic regression model that took into account three factors for each matchup:

The difference between the teams’ seeds. 1-seeds are generally better than 2-seeds, which are better than 3-seeds, and so on, down to 16-seeds.
The difference between the teams’ pre-tournament schedule-adjusted net efficiency. Think of these as team performance-based power ratings similar to the popular KenPom or Sagarin ratings, also applied to women’s teams (this post has further details on the calculations).
Home-court advantage. This is applicable for early-round women’s games that are often held at a top seed’s home stadium; almost all men’s games are at “neutral” sites.

BigQuery enables us to prepare our data so that each of those predictors is aligned with the results from past games. Then, we used BigQuery ML to create a logistic regression model with minimal code and without having to move our data outside the warehouse. Separate models were created for men’s and women’s tournament games, using the factors mentioned above. The code for the women’s tournament game model is shown here:

Both models had solid accuracy and log loss metrics, with sensible weights on each of the factors. The models then had to be applied to all possible team matchups in the projected 2020 brackets, which were generated along with each team’s seed, adjusted net efficiency, and home-court advantage using BigQuery. Then, we generated predictions from our saved models with BigQuery ML, again with minimal code and from within the data warehouse, as shown here:

The resulting table contains win probabilities for every potential tournament matchup, and sets us up for the real payoff: using the bracket structure to calculate advancement probabilities for each team getting to each round. For first-round matchups where matchups are already set— i.e., 1-seed South Carolina to face 16-seed Jackson State in Charlie Creme’s bracket—this is simply a lookup of the predicted win probability for the matchup in the table. But in later rounds, there’s more to consider: the probability that the team gets there at all, and if they do, that there is more than one possible opponent. For example, a 1-seed could face either the 8- or 9-seed in the Round of 32, the 4-, 5-, 12-, or 13-seed in the Sweet 16, and so on.

So, a team’s chance of advancing out of a given round is the chance they get to that round in the first place, multiplied by a weighted average of win probabilities—their chances of beating each possible opponent they might face, weighted by how likely they are to face them. Consider the example of an 8-seed advancing to the Sweet 16:

They are usually something like 50-50 to beat the 9-seed in the Round of 64
They are likely a sizable underdog in a potential matchup against a 1-seed
They likely have a very good chance of beating the 16-seed if they play them
But the 1-seed is the much more likely opponent in the Round of 32, so the lower matchup win probability gets weighted much higher in the advance calculation

Putting it all together, an 8-seed’s projected chance of making the Sweet 16 is usually well below 20%, since they have a (very likely) uphill battle against a top seed to get there.

Running this type of calculation for the entire bracket is naturally iterative. First, we use matchup win probabilities for all possible matchups in a given round to calculate the chances of all teams making it to the next round. Then, we use those chances as weights for each team and possible opponent’s likelihood of meeting in that next round, then repeat the first step using matchup win probabilities for the possible matchups in that round.

Doing this for all tournament rounds might typically be done using tools like Python or R, which requires moving data out of BigQuery and doing calculations in one of those languages, then perhaps writing results back to the database. But this particular problem is a great use case for BigQuery scripting, a feature that allows you to send multiple statements in one request, using variables and control statements (such as loops). This allows similar functionality for iterative scripts like in Python or R, but while still using SQL code and without having to leave the warehouse. In this case, as shown below, we’re using a WHILE loop cycling through each tournament round and outputting each team’s advance probabilities to a specific table that gets referenced back in the script (“[...]” represents code left out for clarity in this case):

DECLARE GAME_ROUND_NUM INT64;
SET GAME_ROUND_NUM = 0;

WHILE GAME_ROUND_NUM <= 7 DO

INSERT `tournaments.ncaa_tourney_round_by_round_probs`

WITH
    Matchups AS [...],

TourneyTms AS [...],

PrevRdAdvProbs AS
    (
      SELECT *

FROM
         `tournaments.ncaa_tourney_round_by_round_probs`
         
      WHERE
        game_round = (GAME_ROUND_NUM - 1)
    ),

ThisRdAdvProbs AS
    (
      SELECT
        TourneyTms.sport, 
        TourneyTms.tourney_year, 
        TourneyTms.bracket_type, 
        TourneyTms.bracket_date,

GAME_ROUND_NUM AS game_round,
        TourneyTms.tm_id, 
        TourneyTms.tm,

IFNULL(TmPrevRdAdvProb.round_adv_prob, IF(GAME_ROUND_NUM = 0, 1, NULL))
          AS round_reach_prob,

# Only weight by opp and team matchup win prob this part (given being here)
        SUM(
          IFNULL(OppPrevRdAdvProb.round_adv_prob, IF(GAME_ROUND_NUM = 0, 1, NULL)) * 
          Matchups.tm1_win_prob
          ) AS ifhere_win_prob,

# Weight chance of team winning matchup by probability of matchup occurring
        IFNULL(SUM(
          IFNULL(TmPrevRdAdvProb.round_adv_prob, IF(GAME_ROUND_NUM = 0, 1, NULL)) * 
          IFNULL(OppPrevRdAdvProb.round_adv_prob, IF(GAME_ROUND_NUM = 0, 1, NULL)) * 
          Matchups.tm1_win_prob
          ),
          IF(GAME_ROUND_NUM = 0, 1, NULL)
          ) AS round_adv_prob

FROM
        TourneyTms

[...]
    )

SELECT
      *

FROM
      ThisRdAdvProbs
    ;

SET GAME_ROUND_NUM = GAME_ROUND_NUM + 1;

END WHILE;

We collected the results and put them into this interactive Data Studio report, which lets you filter and sort every tournament team’s chances (in each projected bracket). Our results show Kansas would’ve been title favorites in the men’s bracket, with around a 15% to 16% chance to win it all. Oregon was the most likely women’s champion at either 27% or 31% (depending on projected bracket chosen). Keep in mind that this is NOT saying Kansas or Oregon was going to win—the probabilistic forecasts actually show a 5-in-6 chance of a champion other than the Jayhawks on the men’s side and a greater than 2-in-3 chance of the Ducks not winning the women’s title.

While fun to play around with, these results are not particularly unique. Companies like ESPN, FiveThirtyEight, and TeamRankings have provided probabilistic NCAA Tournament forecasts for years. The probabilities are fairly accurate gauges of each specific team’s chances, but filling out a bracket using the most likely team in each slot ends up looking very chalky—the better seeds almost always advance. “Real” March Madness isn’t exactly like this—it’s only one tournament with 63 slots on the bracket that get filled in with a specific winner. While top seeds and better teams generally advance in aggregate, there are always upsets, Cinderella runs, and unexpected results.

Simulating thousands of NCAA Tournaments

Fortunately, our procedure for the model and projections accounts for that randomness. To demonstrate this, we can simulate the actual bracket many times and actually look at results. The procedure is similar to the one we used to create the projections, using BigQuery scripting and the matchup win probabilities to loop round-by-round through the tournament. The differences are that we use random number generation to simulate an actual winner for each matchup (based on the win probability), and that we do so across many simulations to generate not just one possible bracket, but thousands of them—true Monte Carlo simulations. See the code below for details (again, “[...]” used as a placeholder for code removed to simplify presentation):

DECLARE NUM_SIMULATIONS INT64;
DECLARE GAME_ROUND_NUM INT64;
SET NUM_SIMULATIONS = 10000;
SET GAME_ROUND_NUM = 0;

WHILE GAME_ROUND_NUM <= 6 DO
  INSERT `tournaments.ncaa_tourney_round_by_round_sims`
  
    WITH
    PrevRdsSims AS
    (
      SELECT *

FROM
         `tournaments.ncaa_tourney_round_by_round_sims`
         
      WHERE
        game_round < GAME_ROUND_NUM
    ),
  
    ThisRdMatchupSims AS
    (
      SELECT 
        Matchups.*,
        Simulations AS simulation_num,
        RAND() AS rand_num

FROM
        `tournaments.ncaa_tourney_all_possible_matchups_with_pred_win_pct` Matchups,
         UNNEST(GENERATE_ARRAY(1, NUM_SIMULATIONS, 1)) Simulations
      
      [...]
    )
    
    SELECT 
      [...],
      
      winner_bracket_slot,
      
      IF(rand_num <= tm_pred_win_pct, tm1_seed, tm2_seed) AS win_tm_seed,
      IF(rand_num <= tm_pred_win_pct, tm1_id, tm2_id) AS win_tm_id,
      IF(rand_num <= tm_pred_win_pct, tm1, tm2) AS win_tm,

IF(rand_num <= tm_pred_win_pct, tm_pred_win_pct, (1 - tm_pred_win_pct))
        AS win_tm_round_matchup_pred_win_pct
    
    FROM
      ThisRdMatchupSims
    ;
    
  SET GAME_ROUND_NUM = GAME_ROUND_NUM + 1;

END WHILE;

Let this run for a few minutes and we wind up with not just one completed NCAA Tournament bracket per gender, but 20,000 brackets each for men and women (10,000 for each projected bracket we started with). We’ve made all of these brackets available in this interactive Data Studio dashboard, accelerated using BigQuery BI Engine. Use “Pick A Sim #” to flip through many of them, and use the dropdowns up top to filter by gender or starting bracket. Within the bracket, the percentage next to each team is the probability of them making it to that round, given the specific matchup in the previous round (blue represents an expected result, red an upset, and yellow a more 50/50 outcome). You can use “Thru Round” to mimic progressing through each round of the tournament, one at a time.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/gcp_ncaa_bracket.gif

Feel free to go through a few (dozen, hundred, …) simulations until you find the one you like the best...there are some wild ones in there. Check out Men’s Lunardi bracket simulation 108, where Boston University (the author’s alma mater) pulls three upsets and makes the Elite Eight as a 16-seed!

https://storage.googleapis.com/gweb-cloudblog-publish/images/gcp_ncaa_bracket.max-1900x1900.jpg

Perhaps one upside of having no tournaments is that we can pick a favorable simulation and convince ourselves that if the tournament had taken place, this is how it would've turned out!

Of course, these brackets aren’t just based on random coin flipping, where total chaos brackets are as likely as more plausible ones with fewer upsets. BU doesn’t get to the Final Four in any simulated bracket (though we could use the easy scalability of BigQuery to run more simulations), while the top seeds get there much more often. The simulations reflect accurate advancement chances for each matchup based on the modeling described above, so the resulting corpus of brackets reflect the proper amount of madness that typifies college basketball in March. Capturing the randomness appropriately is a good general point to keep in mind when creating these types of simulations to help solve non-basketball data science problems.

With the lack of actual national semifinals and title games going on over the next couple days, we hope the ability to play with thousands of simulated Final Fours provides some small bit of consolation to those of you missing the NCAA basketball tournaments in 2020. And you can check out our Medium NCAA blog for all of our past basketball data analysis using Google Cloud. Here’s to hoping that we’ll be watching and celebrating the real March Madness in future years.

Posted in