Managing Query Outputs

When evaluating your output data, consider the number of bytes written by your query. How many bytes are written for your result set? Are you properly limiting the amount of data written? Are you repeatedly writing the same data? The amount of data written by a query impacts query performance (I/O). If you are writing results to a permanent (destination) table, the amount of data written also has a cost.

The following best practices provide guidance on controlling your output data.

Avoid repeated joins and subqueries

Best practice: Avoid repeatedly joining the same tables and using the same subqueries.

If you are repeatedly joining the same tables, consider revisting your schema. Instead of repeatedly joining the data, it might be more performant for you to use nested repeated data to represent the relationships. Nested repeated data saves you the performance impact of the communication bandwidth that is required by a join. It also saves you the I/O costs that are incurred by repeatedly reading and writing the same data. For more information, see using nested and repeated fields.

Similarly, repeating the same subqueries impacts performance through repetitive query processing. If you are using the same subqueries in multiple queries, consider materializing the subquery results in a table. Then consume the materialized data in your queries.

Materializing your subquery results improves performance and reduces the overall amount of data that is read and written by BigQuery. The small cost of storing the materialized data outweighs the performance impact of repeated I/O and query processing.

Carefully consider materializing large result sets

Best practice: Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts.

BigQuery limits cached results to approximately 128MB compressed. Queries that return larger results overtake this limit and frequently result in the following error: Response too large.

This error often occurs when you select a large number of fields from a table with a considerable amount of data. Issues writing cached results can also occur in ETL-style queries that normalize data without reduction or aggregation.

You can overcome the limitation on cached result size by:

  • Using filters to limit the result set
  • Using a LIMIT clause to reduce the result set, especially if you using an ORDER BY clause
  • Writing the output data to a destination table

Be aware that writing very large result sets to destination tables impacts query performance (I/O). In addition, you will incur a small cost for storing the destination table. You can automatically delete a large destination table by using the dataset's default table expiration. For more information, see Use the expiration settings in the storage best practices.

Use a LIMIT clause with large sorts

Best practice: If you are sorting a very large number of values, use a LIMIT clause.

Writing results for a query with an ORDER BY clause can result in Resources exceeded errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data. If you are using an ORDER BY clause, also use a LIMIT clause.

For example, the following query orders a very large results set and throws a Resources exceeded error. The query sorts by the title column in the Wiki1B table. The title column contains millions of values.

SELECT title
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
ORDER BY title

To remove the error, use a query like the following:

SELECT title
FROM bigquery-samples.wikipedia_benchmark.Wiki1B
LIMIT 1000

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...