How to query public data sets using BigQuery
BigQuery is a fully managed data warehouse and analytics platform. Public datasets are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI, the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.
Currently, BigQuery public datasets are stored in the
location. When you query a public dataset, supply
--location=US flag on the command line, choose
US as the
processing location in the BigQuery web UI, or specify the
location property in the
jobReference section of the job resource
when you use the API. Because the public datasets are stored in the US, you cannot write public
data query results to a table in another region, and you cannot join tables in public datasets
with tables in another region.
To get started using a BigQuery public dataset, create or select a project. The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing. If you intend to go beyond the free tier, you should also enable billing.
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
Select or create a GCP project.
Make sure that billing is enabled for your project.
- BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, Enable the BigQuery API.
This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016.
This dataset contains the following tables:
|games_wide||Every pitch, steal, or lineup event for each at bat in the 2016 regular season.*|
|games_post_wide||Every pitch, steal, or lineup event for each at-bat in the 2016 post season.*|
|schedules||The schedule for every team in the regular season.|
*The schemas for the games_wide and games_post_wide tables are identical.
With this data you can effectively replay a game and rebuild basic statistics for players and teams.
You can start exploring this data in the BigQuery console:
Here are some examples of SQL queries you can run on this data in BigQuery.
These samples use BigQuery’s support for standard SQL. Use the
tag to let BigQuery know you want to use standard SQL. For more information
#standardSQL prefix, see
Setting a query prefix.
What types of pitches were thrown, how many, top speeds?
#standardSQL SELECT stats.pitchTypeDescription as ptDescription, COUNT(stats.pitchType) as thrown, MAX(stats.pitchSpeed) as speed_max, SUM(is_hit) as hits, SUM(out_on_strikes) as strike_outs, SUM(is_ground_out) as ground_outs, ROUND(SUM(is_ground_out)/ COUNT(stats.pitchType),4) as gdPerPitch, SUM(is_walk) as walks FROM ( SELECT inningNumber, inningHalf, inningHalfEventSequenceNumber as isn, inningEventType, atBatEventSequenceNumber as absn, description, atBatEventType, pitchType, pitchTypeDescription, pitchSpeed, pitchZone, balls, strikes, outs, status, outcomeId, outcomeDescription, is_hit, is_double_play, (if (strikes = 3, 1, 0)) as out_on_strikes, (if (balls = 4, 1, 0)) as is_walk, (if (pitchZone = 1 OR pitchZone = 2 or pitchZone = 3 or pitchZone = 10, 1, 0)) as high_zone, (if (pitchZone = 7 OR pitchZone = 8 or pitchZone = 9 or pitchZone = 12, 1, 0)) as low_zone, (if (outcomeId = "oGO", 1, 0)) as is_ground_out FROM `bigquery-public-data.baseball.games_wide` WHERE atBatEventType = "PITCH" ORDER BY inningNumber ASC, inningHalf DESC, inningHalfEventSequenceNumber ASC, atBatEventSequenceNumber ASC) as stats GROUP BY ptDescription ORDER BY thrown DESC
Only the top ten results are shown here, along with the first several columns:
What was the at bat and pitching sequence for the fastest pitch(es) in the 2016 season?
#standardSQL SELECT pitcherLastName, startTime, inningNumber, inningHalf as half, inningHalfEventSequenceNumber as seq, hitterPitchCount, pitcherPitchCount, pitchSpeed, pitchTypeDescription, pitchZone, hitterLastName, outcomeDescription FROM `bigquery-public-data.baseball.games_wide` WHERE atBatEventType = "PITCH" AND year = 2016 ORDER BY pitchSpeed DESC LIMIT 10
The results are shown below:
About the dataset
Dataset Source: Sportradar LLC
Use: Copyright Sportradar LLC. Access to data is intended solely for internal research and testing purposes, and is not to be used for any business or commercial purpose. Data are not to be exploited in any manner without express approval from Sportradar. Display of data must include the phrase, “Data provided by Sportradar LLC,” and be hyperlinked to www.sportradar.com.
Update Frequency: Monthly (as needed)
View in BigQuery: Go to baseball data