Developers & Practitioners

BigQuery explained: How to query your data

Previously in BigQuery Explained, we reviewed BigQuery architecture, storage management, and ingesting data into BigQuery. In this post, we will cover querying datasets in BigQuery using SQL, saving and sharing queries, creating views and materialized views. Let’s get started!

Standard SQL

BigQuery supports two SQL dialects: standard SQL and legacy SQL. Standard SQL is preferred for querying data stored in BigQuery because it’s compliant with the ANSI SQL 2011 standard. It has other advantages over legacy SQL, such as automatic predicate push down for JOIN operations and support for correlated subqueries. Refer to Standard SQL highlights for more information.

When you run a SQL query in BigQuery, it automatically creates, schedules and runs a query job. BigQuery runs query jobs in two modes: interactive (default) and batch.

  • Interactive (on-demand) queries are executed as soon as possible, and these queries count towards concurrent rate limit and daily limit.

  • Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. If BigQuery hasn’t started the query within 24 hours, job priority is changed to interactive. Batch queries don’t count towards your concurrent rate limit. They use the same resources as interactive queries.

The queries in this post follow Standard SQL dialect and run in interactive mode, unless otherwise mentioned.


BigQuery Table Types

Every table in BigQuery is defined by a schema describing the column names, data types, and other metadata. BigQuery supports the following table types:

BigQuery query types
BigQuery Table Types

BigQuery Schemas

In BigQuery, schemas are defined at the table level and provide structure to the data. Schema describes column definitions with their name, data type, description and mode.

  • Data types can be simple data types, such as integers, or more complex, such as ARRAY and STRUCT for nested and repeated values.
  • Column modes can be NULLABLEREQUIRED, or REPEATED.

Table schema is specified when loading data into the table or when creating an empty table. Alternatively, when loading data, you can use schema auto-detection for self-describing source data formats such as Avro, Parquet, ORC, Cloud Firestore or Cloud Datastore export files. Schema can be defined manually or in a JSON file as shown.

  [
{
  "description": "[DESCRIPTION]",
  "name": "[NAME]",
  "type": "[TYPE]",
  "mode": "[MODE]"
},
{
  "description": "[DESCRIPTION]",
  "name": "[NAME]",
  "type": "[TYPE]",
  "mode": "[MODE]"
}
]

Using SQL for Analysis

Now let’s analyze one of the public BigQuery datasets related to NCAA Basketball games and players using SQL. The game data covers play-by-play and box scores dated back to 2009. We will look at a specific game from the 2014 season between Kentucky’s Wildcats and Notre Dame’s Fighting Irish. This game had an exciting finish. Let’s find out what made it exciting!

On your BigQuery Sandbox, open the NCAA Basketball dataset from public datasets. Click on the “VIEW DATASET” button to open the dataset in BigQuery web UI.

Navigate to table mbb_pbp_sr under ncaa_basketball dataset to look at the schema. This table has play-by-play information of all men’s basketball games in the 2013–2014 season, and each row in the table represents a single event in a game.

Navigating BigQuery UI
Navigating BigQuery UI

Check the Details section of table mbb_pbp_sr. There are ~4 million game events with a total volume of ~3GB.

BigQuery exolained queries 3

Let’s run a query to filter the events for the game we are interested in. The query selects the following columns from the mbb_pbp_sr table:

  • game_clock: Time left in the game before the finish

  • points_scored : Points were scored in an event

  • team_name: Name of the team who scored the points

  • event_description: Description about the event

  • timestamp: Time when the event occurred

  SELECT
  game_clock,
  points_scored,
  team_name,
  event_description
FROM
  `bigquery-public-data.ncaa_basketball.mbb_pbp_sr`
WHERE
  season = 2014
  AND home_name = 'Wildcats'
  AND away_name = 'Fighting Irish'
  AND points_scored IS NOT NULL
ORDER BY
  timestamp DESC
LIMIT 10;

A breakdown of what this query is doing:

  • The SELECT statement retrieves the rows and the specified columns FROM the table

  • The WHERE clause filters the rows returned by SELECT. This query filters to return rows for the specific game we are interested in.

  • The ORDER BY statement controls the order of rows in the result set. This query sorts the rows resulting from SELECT by timestamp in descending order.

  • Finally, the LIMIT constraints the amount of data returned from the query. This query returns 10 events from the results set after the rows are sorted. Note that adding LIMIT does not reduce the amount of data processed by the query engine.

Now let’s look at the results.


BigQuery queries 4

From the results, it appears the player Andrew Harrison made two free throws scoring 2 points with only 6 seconds remaining in the game. This doesn’t tell us much except there were points scored towards the very end of the game.

Tip: Avoid using SELECT * in the query. Instead query only the columns needed. To exclude only certain columns use SELECT * EXCEPT.

Let’s modify the query to include cumulative sum of scores for each team rolling up to the event time, using analytic (window) functions. Analytic functions computes aggregates for each row over a group of rows defined by a window whereas aggregate functions compute a single aggregate value over a group of rows.

Run the below query with two new columns added—wildcats_score and fighting_irish_score, calculated on-the-fly using points_scored column.


  SELECT
  game_clock,
  SUM(
    CASE
      WHEN team_name = 'Wildcats' THEN points_scored
    END
  ) OVER(ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS wildcats_score,
  SUM(
    CASE
      WHEN team_name = 'Fighting Irish' THEN points_scored
    END
  ) OVER(ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS fighting_irish_score,
  team_name,
  event_description
FROM
  `bigquery-public-data.ncaa_basketball.mbb_pbp_sr`
WHERE
  season = 2014
  AND home_name = 'Wildcats'
  AND away_name = 'Fighting Irish'
  AND points_scored IS NOT NULL
ORDER BY
  timestamp DESC
LIMIT 10;

A breakdown of what this query is doing:

  • Calculate cumulative SUM of scores by each team in the game—specified by CASE statement

  • SUM is calculated on scores in the window defined within OVER clause

  • OVER clause references a window (group of rows) to use SUM

  • ORDER BY is part of window specification that defines sort order within a partition. This query orders rows by timestamp

  • Define the window frame from the start of the game specified by UNBOUNDED PRECEDING to the CURRENT ROW over which the analytic function SUM() is evaluated.

BigQuery queries 5

From the results, we can see how the game ended. The Fighting Irish held the lead by four points with 04:28 minutes remaining. Karl-Anthony Towns of Wildcats was able to tie the game on a layup with 01:12 minutes remaining, and Andrew Harrison made two free throws with 00:06 seconds remaining, setting the stage for the Wildcats’ win. That was a nail biting finish indeed!

In addition to aggregate and analytic functions, BigQuery also supports functions and operators such as string manipulation, date/time, mathematical functions, JSON extract and more, as shown.


BigQuery queries 6
BigQuery SQL Functions

Refer BigQuery SQL function reference for the complete list. In the upcoming posts, we will cover other advanced query features in BigQuery such as User Defined Functions, Spatial functions and more.

The Life of a BigQuery SQL Query

Under the hood when you run the SQL query it does the following:

  • A QueryJob is submitted to the BigQuery service. As reviewed in the BigQuery architecture, the BigQuery compute is decoupled from the BigQuery storage, and they are designed to work together to organize the data to make queries efficient over huge datasets.

  • Each query executed is broken up into stages which are then processed by workers (slots) and written back out to Shuffle. Shuffle provides resilience to failures within workers themselves, say a worker were to have an issue during query processing.

The Life of a BigQuery SQL query
The life of a BigQuery SQL query
  • BigQuery engine utilizes BigQuery’s columnar storage format to scan only the required columns to run the query. One of the best practices to control costs is to query only the columns that you need.

  • After the query execution is completed, the query service persists the results into a temporary table, and the web UI displays that data. You can also request to write results into a permanent table.

Saving and Sharing Queries

Saving Queries

Now that you have run SQL queries to perform analysis, how would you save those results? BigQuery writes all query results to a table. The table is either explicitly identified by the user as a destination table or a temporary cached results table. This temporary table is stored for 24 hours, so if you run the exact same query again (exact string match), and if the results would not be different, then BigQuery will simply return a pointer to the cached results. Queries that can be served from the cache do not incur any charges. Refer to the documentation to understand limitations and exceptions to query caching.

You can view cached query results from the Query History tab on BigQuery UI. This history includes all queries submitted by you to the service, not just those submitted via the web UI.

BigQuery query history
Query History

You can disable retrieval of cached results from the query settings when executing the query. This requires BigQuery to compute the query result, which will result in charges for the query to execute. This is typically used in benchmarking scenarios, such as in the previous post comparing performance of partitioned and clustered tables against non-partitioned tables.

You can also request the query write to a destination table. You will have control on when the table is deleted. Because the destination table is permanent, you will be charged for the storage of the results.

Sharing Queries

BigQuery allows you to share queries with others. When you save a query, it can be private (visible only to you), shared at the project level (visible to project members), or public (anyone can view it).

Check this video to know how to save and share your queries in BigQuery.


Standard Views

A view is a virtual table defined by a SQL query. A view has properties similar to a table and can be queried as a table. The schema of the view is the schema that results from running the query. The query results from the view contain data only from the tables and fields specified in the query that defines the view.

You can create a view by saving the query from the BigQuery UI using the “Save View” button or using BigQuery DDL—CREATE VIEW statement. Saving a query as a view does not persist results aside from the caching to a temporary table, which expires within a 24 hour window. The behavior is similar to the query executed on the tables.
Save query as a view
Save query as a view

When to use standard Views?

  • Let’s say you want to expose queries with complex logic to your users and you want to avoid the users needing to remember the logic, then you can roll those queries into a view.
  • Another use case is—views can be placed into datasets and offer fine-grained access controls to share dataset with specific users and groups without giving them access to underlying tables. These views are called authorized views. We will look into securing and accessing datasets in detail in a future post.
Refer to BigQuery documentation for creating and managing standard views.

Materialized Views

BigQuery supports Materialized Views (MV)—a beta feature. MV are precomputed views that periodically cache results of a query for increased performance and efficiency. Queries using MV are generally faster and consume less resources than queries retrieving the same data only from the base table. They can significantly boost performance of workloads with common and repeated queries.

BigQuery Materialized Views
BigQuery Materialized Views

Following are the key features of materialized views:

Zero Maintenance

  • BigQuery leverages precomputed results from MV and whenever possible reads only delta changes from the base table to compute up-to-date results. It automatically synchronizes data refreshes with data changes in base tables. No user inputs required. You also have the option to trigger manual refresh the views to control the costs of refresh jobs.

Always fresh

  • MV is always consistent with the source table. They can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.

Smart Tuning

  • MV supports query rewrites. If a query against the source table can instead be resolved by querying the MV, BigQuery will rewrite (reroute) the query to the MV for better performance and/or efficiency.

You create MV using BigQuery DDL—CREATE MATERIALIZED VIEW statement.

When to use Materialized Views?

  • MV are suited for cases when you need to query the latest data while cutting down latency and cost by reusing the previously computed results. MV act as pseudo-indexes, accelerating queries to the base table without updating any existing workflows.

Limitations of Materialized Views

  • As of this writing, joins are not currently supported in MV. However, you can leverage MV in a query that does aggregation on top of joins. This reduces cost and latency of the query.

  • MV supports a limited set of aggregation functions and restricted SQL. Refer the query patterns supported by materialized views.

Refer to BigQuery documentation for working with materialized views, and best practices.

What Next?

In this post, we reviewed the lifecycle of a SQL query in BigQuery, working with window functions, creating standard and materialized views, saving and sharing queries.

References

Codelab

  • Try this codelab to query a large public dataset based on Github archives.

In the next post, we will dive into joins, optimizing join patterns and denormalizing data with nested and repeated data structures.

Stay tuned. Thank you for reading! Have a question or want to chat? Find me on Twitter or LinkedIn.

Thanks to Yuri Grinshsteyn and Alicia Williams for helping with the post.