Major League Baseball Data

How to query public data sets using BigQuery

BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.

To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.

    Enable the API

Dataset overview

This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016.

This dataset contains the following tables:

Table Name Description
games_wide Every pitch, steal, or lineup event for each at bat in the 2016 regular season.*
games_post_wide Every pitch, steal, or lineup event for each at-bat in the 2016 post season.*
schedules The schedule for every team in the regular season.

*The schemas for the games_wide and games_post_wide tables are identical.

With this data you can effectively replay a game and rebuild basic statistics for players and teams.

You can start exploring this data in the BigQuery console:

Go to the Baseball public dataset

Sample Queries

Here are some examples of SQL queries you can run on this data in BigQuery.

These samples use BigQuery’s support for standard SQL. Use the #standardSQL tag to let BigQuery know you want to use standard SQL. For more information about the #standardSQL prefix, see Setting a query prefix.

What types of pitches were thrown, how many, top speeds?

#standardSQL
SELECT
  stats.pitchTypeDescription as ptDescription,
  COUNT(stats.pitchType) as thrown,
  MAX(stats.pitchSpeed) as speed_max,
  SUM(is_hit) as hits,
  SUM(out_on_strikes) as strike_outs,
  SUM(is_ground_out) as ground_outs,
  ROUND(SUM(is_ground_out)/ COUNT(stats.pitchType),4) as gdPerPitch,
  SUM(is_walk) as walks
FROM (
  SELECT
    inningNumber,
    inningHalf,
    inningHalfEventSequenceNumber as isn,
    inningEventType,
    atBatEventSequenceNumber as absn,
    description,
    atBatEventType,
    pitchType,
    pitchTypeDescription,
    pitchSpeed,
    pitchZone,
    balls,
    strikes,
    outs,
    status,
    outcomeId,
    outcomeDescription,
    is_hit,
    is_double_play,
    (if (strikes = 3, 1, 0)) as out_on_strikes,
    (if (balls = 4, 1, 0)) as is_walk,
    (if (pitchZone = 1 OR pitchZone = 2 or pitchZone = 3 or pitchZone = 10, 1, 0)) as high_zone,
    (if (pitchZone = 7 OR pitchZone = 8 or pitchZone = 9 or pitchZone = 12, 1, 0)) as low_zone,
    (if (outcomeId  = "oGO", 1, 0)) as is_ground_out
  FROM
    `bigquery-public-data.baseball.games_wide`
  WHERE
    atBatEventType = "PITCH"
  ORDER BY
    inningNumber ASC,
    inningHalf DESC,
    inningHalfEventSequenceNumber ASC,
    atBatEventSequenceNumber  ASC) as stats
GROUP BY
  ptDescription
ORDER BY
  thrown DESC

Only the top ten results are shown here, along with the first several columns:

pitch type query results

What was the at bat and pitching sequence for the fastest pitch(es) in the 2016 season?

#standardSQL
SELECT
  pitcherLastName,
  startTime,
  inningNumber,
  inningHalf as half,
  inningHalfEventSequenceNumber as seq,
  hitterPitchCount,
  pitcherPitchCount,
  pitchSpeed,
  pitchTypeDescription,
  pitchZone,
  hitterLastName,
  outcomeDescription
FROM
  `bigquery-public-data.baseball.games_wide`
WHERE
  atBatEventType = "PITCH"
  AND year = 2016
ORDER BY
  pitchSpeed DESC
LIMIT 10

The results are shown below:

fastest pitches query results

About the dataset

Dataset Source: Sportradar LLC

Category: Sports

Use: Copyright Sportradar LLC. Access to data is intended solely for internal research and testing purposes, and is not to be used for any business or commercial purpose. Data are not to be exploited in any manner without express approval from Sportradar. Display of data must include the phrase, “Data provided by Sportradar LLC,” and be hyperlinked to www.sportradar.com.

Update Frequency: Monthly (as needed)

View in BigQuery: Go to baseball data

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...