This page explains the rationale behind partitioned tables, and describes a few best practices. By dividing a large table into smaller partitions, you can improve query performance and reduce the number of bytes billed by restricting the amount of data scanned. BigQuery enables you to partition tables by date. Tables partitioned by date are called date-partitioned tables.
Before partitioned tables became available, BigQuery users would often divide large datasets into separate tables organized by time period; usually daily tables, where each table represented data loaded on that particular date.
Dividing a dataset into daily tables helped to reduce the amount of data scanned when querying a specific date range. For example, if you have a a year's worth of data in a single table, a query that involves the last seven days of data still requires a full scan of the entire table to determine which data to return. However, if your table is divided into daily tables, you can restrict the query to the seven most recent daily tables.
Daily tables, however, have several disadvantages. You must manually, or programmatically, create the daily tables. SQL queries are often more complex because your data can be spread across hundreds of tables. Performance degrades as the number of referenced tables increases. There is also a limit of 1,000 tables that can be referenced in a single query. Partitioned tables have none of these disadvantages.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date. To create a date-partitioned table, see Creating Date-Partitioned Tables.
You can use either legacy SQL or standard SQL to query partitioned tables.
Schema of partitions
A partition does not have a schema by itself. All partitions in a partitioned table share the schema of the table. If you need to load data into a partition with a schema that is not the same as the schema of the table, you need to update the schema of the table before loading the data. Alternatively, you can use schema update options to update the schema of the table as part of a load or query job. See configuration.load.schemaUpdateOptions.
The _PARTITIONTIME pseudo column
Partitioned tables include a pseudo column named
_PARTITIONTIME that contains
a date-based timestamp for data loaded into the table. The timestamp is based
on UTC time and represents the number of microseconds since the unix epoch.
For example, if data is appended to a table on April 15, 2016, all of the rows
of data appended on that day contain the value
_PARTITIONTIME column name is reserved, which means that you cannot
create a column by that name in any of your tables.
To select the value of
_PARTITIONTIME, you must use an alias. For example,
the following query selects
_PARTITIONTIME by assigning the alias
the pseudo column:
SELECT _PARTITIONTIME AS pt, field1 FROM mydataset.table1
Data in the streaming buffer has
a NULL value for the
Addressing table partitions
To make it easier to reference data from a specific partition, BigQuery provides partition decorators. Partition decorators take the form:
where [TABLE_NAME] is the name of a partitioned table, and
a date. For example, the partition for May 19, 2016 in a table named
mydataset:table can be referenced using:
BigQuery treats a table name with a partition decorator as a table, which means that a table name with a partition decorator can be used as a table in any api that receives a table name.
Copying partitioned tables
If you copy a partitioned table to a new table, all of the partitioning information is copied with the table. The new table and the old table will have identical partitions.
Long-term storage pricing
Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn’t been modified in the last 90 days, the data in that partition is considered long term storage price and is charged at the discounted price.
Viewing most recent partitions
To limit the amount of data scanned to a set of partitions, create a view that
contains a filter on
_PARTITIONTIME. For example, the following query can be
used to create a view that includes only the most recent seven days of data from
a table named
SELECT * FROM mydataset.partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000)) AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));
For information about creating views, see Creating Views.
Dealing with timezone issues
The value of
_PARTITIONTIME is based on the UTC date when the data is loaded,
which means that partitions are divided based on 12:00 AM UTC. If you want to
query data based on a timezone other than UTC, you should choose one of the
following options before you start loading data into your table.
There are two ways to query data in a partitioned table using a custom, non-UTC, timezone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.
If you use a timestamp column, you can use the default UTC-based partitioning and account for timezone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a timezone other than UTC, use partition decorators to load data into partitions based on a different timezone.
Tracking timezones using a timestamp
To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.
To query for data based on a timezone other than UTC, use both the
_PARTITIONTIME pseudo column and your custom timestamp column.
_PARTITIONTIME limits the table scan to the relevant partitions, and
your custom timestamp further limits the results to your timezone. For example,
to query data from a partitioned table (
with a timestamp field
for data added to the table between
2016-05-01 08:00:00 PST and
2016-05-05 14:00:00 PST:
SELECT field1 FROM mydataset:partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP(“2016-05-01”) AND TIMESTAMP(“2016-05-06”) AND DATE_ADD(my_timestamp_field, 8, ‘HOUR’) BETWEEN TIMESTAMP(“2016-05-01 12:00:00”) AND TIMESTAMP(“2016-05-05 14:00:00”);
Loading data using partition decorators
Partition decorators enable you to load data into a specific partition. To adjust for timezones, use a partition decorator to load data into a partition based on your preferred timezone. For example, if you are on Pacific Standard Time (PST), load all data generated on May 1, 2016 PST into the partition for that date by using the corresponding partition decorator:
Better performance with the pseudo column
To improve query performance, use the
_PARTITIONTIME pseudo column by itself
on the left side of a comparison. For example, the following queries process
the same amount of data, but the second example can provide better performance.
The following query can be slower because it combines the pseudo column value with other operations in the
# Can be slower SELECT field1 FROM mydataset.table1 WHERE DATE_ADD(_PARTITIONTIME, 5, “DAY”) > TIMESTAMP(‘2016-04-15’)
The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.
# Often performs better SELECT field1 FROM mydataset.table1 WHERE _PARTITIONTIME > DATE_ADD(TIMESTAMP(‘2016-04-15’), -5, “DAY”)
Depending on the table size, the second query, which places
by itself on the left side of the
> comparison operator, can provide better
performance than the first query. Because the queries process the same amount of
data, the number of bytes billed is the same in both cases.