Partitioned Tables

This page explains the rationale behind partitioned tables, and describes a few best practices. By dividing a large table into smaller partitions, you can improve query performance and reduce the number of bytes billed by restricting the amount of data scanned. BigQuery enables you to partition tables by date. Tables partitioned by date are called date-partitioned tables.

To create a partitioned table, see Creating and Updating Date-Partitioned Tables. For an example of querying a partitioned table, see Querying Date-Partitioned Tables.

Background

Before partitioned tables became available, BigQuery users would often divide large datasets into separate tables organized by time period; usually daily tables, where each table represented data loaded on that particular date.

Dividing a dataset into daily tables helped to reduce the amount of data scanned when querying a specific date range. For example, if you have a a year's worth of data in a single table, a query that involves the last seven days of data still requires a full scan of the entire table to determine which data to return. However, if your table is divided into daily tables, you can restrict the query to the seven most recent daily tables.

Daily tables, however, have several disadvantages. You must manually, or programmatically, create the daily tables. SQL queries are often more complex because your data can be spread across hundreds of tables. Performance degrades as the number of referenced tables increases. There is also a limit of 1,000 tables that can be referenced in a single query. Partitioned tables have none of these disadvantages.

Partitioned tables

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date. To create a date-partitioned table, see Creating Date-Partitioned Tables.

You can use either legacy SQL or standard SQL to query partitioned tables.

Schema of partitions

A partition does not have a schema by itself. All partitions in a partitioned table share the schema of the table. If you need to load data into a partition with a schema that is not the same as the schema of the table, you need to update the schema of the table before loading the data. Alternatively, you can use schema update options to update the schema of the table as part of a load or query job. See configuration.load.schemaUpdateOptions.

The _PARTITIONTIME pseudo column

Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. The timestamp is based on UTC time and represents the number of microseconds since the unix epoch. For example, if data is appended to a table on April 15, 2016, all of the rows of data appended on that day contain the value TIMESTAMP("2016-04-15") in the _PARTITIONTIME column.

The _PARTITIONTIME column name is reserved, which means that you cannot create a column by that name in any of your tables.

To select the value of _PARTITIONTIME, you must use an alias. For example, the following query selects _PARTITIONTIME by assigning the alias pt to the pseudo column:

SELECT
  _PARTITIONTIME AS pt,
  field1
FROM
  mydataset.table1

Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.

Addressing table partitions

To make it easier to reference data from a specific partition, BigQuery provides partition decorators. Partition decorators take the form:

[TABLE_NAME]$YYYYMMDD

where [TABLE_NAME] is the name of a partitioned table, and YYYYMMDD represents a date. For example, the partition for May 19, 2016 in a table named mydataset:table can be referenced using:

mydataset.table$20160519

BigQuery treats a table name with a partition decorator as a table, which means that a table name with a partition decorator can be used as a table in any api that receives a table name.

Copying partitioned tables

If you copy a partitioned table to a new table, all of the partitioning information is copied with the table. The new table and the old table will have identical partitions.

Long-term storage pricing

Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn’t been modified in the last 90 days, the data in that partition is considered long term storage price and is charged at the discounted price.

Best Practices

Viewing most recent partitions

To limit the amount of data scanned to a set of partitions, create a view that contains a filter on _PARTITIONTIME. For example, the following query can be used to create a view that includes only the most recent seven days of data from a table named mydataset.partitioned_table:

Legacy SQL

SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000))
  AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));

Standard SQL

SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
  AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);

For information about creating views, see Creating Views.

Dealing with timezone issues

The value of _PARTITIONTIME is based on the UTC date when the data is loaded, which means that partitions are divided based on 12:00 AM UTC. If you want to query data based on a timezone other than UTC, you should choose one of the following options before you start loading data into your table.

There are two ways to query data in a partitioned table using a custom, non-UTC, timezone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.

If you use a timestamp column, you can use the default UTC-based partitioning and account for timezone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a timezone other than UTC, use partition decorators to load data into partitions based on a different timezone.

Tracking timezones using a timestamp

To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.

To query for data based on a timezone other than UTC, use both the _PARTITIONTIME pseudo column and your custom timestamp column. Using _PARTITIONTIME limits the table scan to the relevant partitions, and your custom timestamp further limits the results to your timezone. For example, to query data from a partitioned table (mydataset.partitioned_table) with a timestamp field [MY_TIMESTAMP_FIELD] for data added to the table between 2016-05-01 08:00:00 PST and 2016-05-05 14:00:00 PST:

legacy SQL

SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND DATE_ADD([MY_TIMESTAMP_FIELD], 8, ‘HOUR’) BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

standard SQL

SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND TIMESTAMP_ADD([MY_TIMESTAMP_FIELD], INTERVAL 8 HOUR) BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

Loading data using partition decorators

Partition decorators enable you to load data into a specific partition. To adjust for timezones, use a partition decorator to load data into a partition based on your preferred timezone. For example, if you are on Pacific Standard Time (PST), load all data generated on May 1, 2016 PST into the partition for that date by using the corresponding partition decorator:

[TABLE_NAME]$20160501

Better performance with the pseudo column

To improve query performance, use the _PARTITIONTIME pseudo column by itself on the left side of a comparison. For example, the following queries process the same amount of data, but the second example can provide better performance.

Example 1. The following query can be slower because it combines the pseudo column value with other operations in the WHERE filter.

Legacy SQL

# Can be slower
SELECT
  field1
FROM
  mydataset.table1
WHERE
  DATE_ADD(_PARTITIONTIME, 5, "DAY") > TIMESTAMP("2016-04-15")

Standard SQL

# Can be slower
SELECT
  field1
FROM
  mydataset.table1
WHERE
  TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15")

Example 2. The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.

Legacy SQL

# Often performs better
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > DATE_ADD(TIMESTAMP(‘2016-04-15’), -5, "DAY")

Standard SQL

# Often performs better
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP(‘2016-04-15’), INTERVAL 5 DAY)

Depending on the table size, the second query, which places _PARTITIONTIME by itself on the left side of the > comparison operator, can provide better performance than the first query. Because the queries process the same amount of data, the number of bytes billed is the same in both cases.

What's next

Send feedback about...

BigQuery Documentation