When evaluating your output data, consider the number of bytes written by your query. How many bytes are written for your result set? Are you properly limiting the amount of data written? Are you repeatedly writing the same data? The amount of data written by a query affects query performance (I/O). If you are writing results to a permanent (destination) table, the amount of data written also has a cost.
The following best practices provide guidance on controlling your output data.
Avoid repeated joins and subqueries
Best practice: Avoid repeatedly joining the same tables and using the same subqueries.
If you are repeatedly joining the same tables, consider revisiting your schema. Instead of repeatedly joining the data, it might be more performant for you to use nested repeated data to represent the relationships. Nested repeated data saves you the performance impact of the communication bandwidth that a join requires. It also saves you the I/O costs that you incur by repeatedly reading and writing the same data. For more information, see using nested and repeated fields.
Similarly, repeating the same subqueries affects performance through repetitive query processing. If you are using the same subqueries in multiple queries, consider materializing the subquery results in a table. Then consume the materialized data in your queries.
Materializing your subquery results improves performance and reduces the overall amount of data that BigQuery reads and writes. The small cost of storing the materialized data outweighs the performance impact of repeated I/O and query processing.
Carefully consider materializing large result sets
Best practice: Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts.
BigQuery limits cached results to approximately 10 GB
compressed. Queries that return larger results overtake this limit and
frequently result in the following error:
Response too large.
This error often occurs when you select a large number of fields from a table with a considerable amount of data. Issues writing cached results can also occur in ETL-style queries that normalize data without reduction or aggregation.
You can overcome the limitation on cached result size by:
- Using filters to limit the result set
- Using a
LIMITclause to reduce the result set, especially if you are using an
- Writing the output data to a destination table
You can page through the results using the BigQuery REST API. For more information, see Paging through table data.
Be aware that writing very large result sets to destination tables impacts query performance (I/O). In addition, you incur a small cost for storing the destination table. You can automatically delete a large destination table by using the dataset's default table expiration. For more information, see Use the expiration settings in the storage best practices.
LIMIT clause with large sorts
Best practice: If you are sorting a very large number of values, use a
Writing results for a query with an
ORDER BY clause can result in
errors. The final sorting is done on a single slot. If you attempt to order a very large result set, the final sorting can overwhelm the
slot that is processing the data. If you are using an
ORDER BY clause, also
For example, the following query orders a very large result set and throws a
Resources exceeded error. The query sorts by the
title column in
title column contains millions of values.
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title
To remove the error, use a query like the following:
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title DESC LIMIT 1000