Optimize storage in BigQuery

This page provides best practices for optimizing BigQuery storage. BigQuery stores data in columnar format. Column-oriented databases are optimized for analytic workloads that aggregate data over a very large number of records. As columns have typically more redundancy than rows, this characteristic allows for greater data compression by using techniques such as run-length encoding. For more information about how BigQuery stores data, see Overview of BigQuery storage. Optimizing BigQuery storage improves query performance and controls cost.

BigQuery provides details about the storage consumption of your resources. To view the table storage metadata, query the following INFORMATION_SCHEMA views:

Cluster table data

Best practice: Create clustered tables.

To optimize storage for queries, start by clustering table data. By clustering frequently used columns, you can reduce the total volume of data scanned by the query. For information about how to create clusters, see Create and use clustered tables.

Partition table data

Best practice: Divide large tables with partitions.

With partitions, you can group and sort your data by a set of defined column characteristics, such as an integer column, a time-unit column, or the ingestion time. Partitioning improves the query performance and control costs by reducing the number of bytes read by a query.

For more information about partitions, see Introduction to partitioned tables.

Use the table and partition expiration settings

Best practice: To optimize storage, configure the default expiration settings for datasets, tables, and partitioned tables.

You can control storage costs and optimize storage usage by setting the default table expiration for newly created tables in a dataset. When a table expires, it gets deleted along with all of the data that the table contains. If you set the property when the dataset is created, any table created in the dataset is deleted after the expiration period. If you set the property after the dataset is created, only new tables are deleted after the expiration period.

For example, if you set the default table expiration to seven days, older data is automatically deleted after one week.

This option is useful if you need access to only the most recent data. It is also useful if you are experimenting with data and don't need to preserve it.

If your tables are partitioned by date, the dataset's default table expiration applies to the individual partitions. You can also control partition expiration using the time_partitioning_expiration flag in the bq command-line tool or the expirationMs configuration setting in the API. When a partition expires, data in the partition is deleted but the partitioned table is not dropped even if the table is empty. For example, the following command expires partitions after three days:

bq mk \
--time_partitioning_type=DAY \
--time_partitioning_expiration=259200 \
project_id:dataset.table

Store data in BigQuery

Best practice: Store your data in BigQuery.

When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage. After the data is loaded into BigQuery, the data is subject to BigQuery storage pricing. You are charged for the physical or the logical storage your table consumes including the time travel storage blocks.

Rather than exporting your older data to another storage option (such as Cloud Storage), take advantage of BigQuery long-term storage pricing.

If you have a table that is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent. If you have a partitioned table, each partition is considered separately for eligibility for long-term pricing subject to the same rules as non-partitioned tables.

Identify long-term or short-term data

Best practice: Identify if row-level data needs to be stored long term, and only store aggregated data long term.

In many cases, details contained in transactional or row-level data are useful in the short term, but are referenced less over the long term. In these situations, you can build aggregation queries to compute and store the metrics associated with this data, and then use table or partition expiration to systematically remove the row-level data. This reduces storage charges while keeping metrics available for long-term consumption.

Reduce the time travel window

Best practice: Based on your requirement, you can lower the time travel window.

Reducing the time travel days from the default value of seven reduces the total number of storage blocks stored for an object. The time travel window is set at the dataset level.

Archive data to Cloud Storage

Best practice: Consider archiving data in Cloud Storage.

You can move data from BigQuery to Cloud Storage based on the business need for archival. As a best practice, consider BigQuery long-term pricing before exporting data out of BigQuery.

What's next