Optimize storage in BigQuery
This page provides best practices for optimizing BigQuery storage. BigQuery stores data in columnar format. Column-oriented databases are optimized for analytic workloads that aggregate data over a very large number of records. As columns have typically more redundancy than rows, this characteristic allows for greater data compression by using techniques such as run-length encoding. For more information about how BigQuery stores data, see Overview of BigQuery storage. Optimizing BigQuery storage improves query performance and controls cost.
BigQuery provides details about the storage consumption of your
resources.
To view the table storage metadata, query the following INFORMATION_SCHEMA
views:
INFORMATION_SCHEMA.TABLE_STORAGE
INFORMATION_SCHEMA.TABLE_STORAGE_BY_ORGANIZATION
INFORMATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE
INFORMATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE_BY_ORGANIZATION
Cluster table data
Best practice: Create clustered tables.
To optimize storage for queries, start by clustering table data. By clustering frequently used columns, you can reduce the total volume of data scanned by the query. For information about how to create clusters, see Create and use clustered tables.
Partition table data
Best practice: Divide large tables with partitions.
With partitions, you can group and sort your data by a set of defined column characteristics, such as an integer column, a time-unit column, or the ingestion time. Partitioning improves the query performance and control costs by reducing the number of bytes read by a query.
For more information about partitions, see Introduction to partitioned tables.
Use the table and partition expiration settings
Best practice: To optimize storage, configure the default expiration settings for datasets, tables, and partitioned tables.
You can control storage costs and optimize storage usage by setting the default table expiration for newly created tables in a dataset. When a table expires, it gets deleted along with all of the data that the table contains. If you set the property when the dataset is created, any table created in the dataset is deleted after the expiration period. If you set the property after the dataset is created, only new tables are deleted after the expiration period.
For example, if you set the default table expiration to seven days, older data is automatically deleted after one week.
This option is useful if you need access to only the most recent data. It is also useful if you are experimenting with data and don't need to preserve it.
If your tables are partitioned by date, the dataset's default table expiration
applies to the individual partitions. You can also control partition expiration
using the time_partitioning_expiration
flag in the bq command-line tool or
the expirationMs
configuration setting in the API. When a partition expires,
data in the partition is deleted but the partitioned table is not dropped even
if the table is empty.
For example, the following command expires partitions after three days:
bq mk \ --time_partitioning_type=DAY \ --time_partitioning_expiration=259200 \ project_id:dataset.table
Use BigQuery long-term storage
Best practice: Use BigQuery long-term storage to reduce cost of older data.
When you load data into BigQuery storage, the data is subject to BigQuery storage pricing. Rather than exporting your older data to another storage option (such as Cloud Storage), you can take advantage of BigQuery long-term storage pricing.
If you have a table that is not modified for 90 consecutive days, the price of storage for that table automatically drops by 50 percent. If you have a partitioned table, each partition is considered separately for eligibility for long-term pricing, subject to the same rules as non-partitioned tables.
Configure the storage billing model
Best practice: Optimize the storage billing model based on your usage patterns.
BigQuery supports storage billing using logical (uncompressed) or physical (compressed) bytes, or a combination of both. The storage billing model configured for each dataset determines your storage pricing, but it does not impact query performance.
You can use the INFORMATION_SCHEMA
views to determine the storage billing model
that works best based on your usage patterns.
Identify long-term or short-term data
Best practice: Identify if row-level data needs to be stored long term, and only store aggregated data long term.
In many cases, details contained in transactional or row-level data are useful in the short term, but are referenced less over the long term. In these situations, you can build aggregation queries to compute and store the metrics associated with this data, and then use table or partition expiration to systematically remove the row-level data. This reduces storage charges while keeping metrics available for long-term consumption.
Reduce the time travel window
Best practice: Based on your requirement, you can lower the time travel window.
Reducing the time travel window from the default value of 7 days reduces the retention period for data deleted from or changed in a table. You are billed for time travel storage only when using the physical (compressed) storage billing model.
The time travel window is set at the dataset level.
Archive data to Cloud Storage
Best practice: Consider archiving data in Cloud Storage.
You can move data from BigQuery to Cloud Storage based on the business need for archival. As a best practice, consider BigQuery long-term pricing before exporting data out of BigQuery.
What's next
- Learn how to optimize cost.
- Learn how to optimize query.
- Learn how to optimize functions.