Querying partitioned tables

Querying partitioned tables

You can query partitioned tables by:

  • Using the Cloud Console
  • Using the bq command-line tool's bq query command
  • Calling the jobs.insert API method and configuring a query job
  • Using the client libraries

For more information on running queries, see Running interactive and batch queries.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document.

Required permissions

To query partitioned tables, you need IAM permissions to query tables and run query jobs.

Permissions to query a table

To query a table, you need the bigquery.tables.getData IAM permission.

Each of the following predefined IAM roles includes the permissions that you need in order to create a view:

  • roles/bigquery.dataViewer
  • roles/bigquery.dataOwner
  • roles/bigquery.dataEditor
  • roles/bigquery.admin

Additionally, if you have the bigquery.datasets.create permission, you can query tables and views in the datasets that you create.

Permissions to run a query job

To run a query job, you need the bigquery.jobs.create IAM permission.

Each of the following predefined IAM roles includes the permissions that you need in order to run a query job:

  • roles/bigquery.user
  • roles/bigquery.jobUser
  • roles/bigquery.admin

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Ingestion-time partitioned table pseudo columns

When you create an ingestion-time partitioned table, two pseudo columns are added to the table: a _PARTITIONTIME pseudo column and a _PARTITIONDATE pseudo column. The _PARTITIONTIME pseudo column contains a date-based timestamp for data that is loaded into the table. The _PARTITIONDATE pseudo column contains a date representation. Both pseudo column names are reserved, which means that you cannot create a column with either name in any of your tables.

_PARTITIONTIME and _PARTITIONDATE are available only in ingestion-time partitioned tables. Partitioned tables do not have pseudo columns. For information on querying partitioned tables, see Querying partitioned tables.

The _PARTITIONTIME pseudo column

The _PARTITIONTIME pseudo column contains a timestamp that is based on UTC time and represents the number of microseconds since the unix epoch. For example, if data is appended to a table on April 15, 2016, 08:15:00 UTC, all of the rows of data that are appended on that day have the _PARTITIONTIME column that contains one of the following values: + TIMESTAMP("2016-04-15 08:00:00") for hourly partitioned tables. + TIMESTAMP("2016-04-15") for daily partitioned tables. + TIMESTAMP("2016-04-01") for monthly partitioned tables. + TIMESTAMP("2016-01-01") for yearly partitioned tables.

To query the _PARTITIONTIME pseudo column, you must use an alias. For example, the following query selects _PARTITIONTIME by assigning the alias pt to the pseudo column:

SELECT
  _PARTITIONTIME AS pt,
  column
FROM
  dataset.table

Where:

  • column is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • dataset is the dataset containing the partitioned table.
  • table is the partitioned table.

Data in the streaming buffer has NULL values in the _PARTITIONTIME column.

The _PARTITIONDATE pseudo column

The _PARTITIONDATE pseudo column contains the UTC date corresponding to the value in the _PARTITIONTIME pseudo column. This column is not supported in hourly, monthly, or yearly partitioned tables.

To query the _PARTITIONDATE pseudo column, you must use an alias. For example, the following query selects _PARTITIONDATE by assigning the alias pd to the pseudo column:

SELECT
  _PARTITIONDATE AS pd,
  column
FROM
  dataset.table

Where:

  • column is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • dataset is the dataset containing the partitioned table.
  • table is the partitioned table.

Data in the streaming buffer has NULL values in the _PARTITIONDATE column.

Querying ingestion-time partitioned tables using pseudo columns

When you query data in ingestion-time partitioned tables, you reference specific partitions by specifying the values in the _PARTITIONTIME or _PARTITIONDATE pseudo columns. For example:

  • _PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
  • _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

or

  • _PARTITIONDATE >= "2018-01-29" AND _PARTITIONDATE < "2018-01-30"
  • _PARTITIONDATE BETWEEN '2016-01-01' AND '2016-01-02'

Limiting partitions queried using pseudo columns

You can use the _PARTITIONTIME and _PARTITIONDATE pseudo columns to limit the number of partitions scanned during a query. This is also referred to as pruning partitions. Partition pruning is the mechanism BigQuery uses to eliminate unnecessary partitions from the input scan. The pruned partitions are not included when calculating the bytes scanned by the query, reducing the on-demand analysis cost. In general, partition pruning helps reduce query cost when the filters consist of constant expressions, which can be evaluated at the outset of the query.

For example, the following query scans only the partitions between the dates January 1, 2016 and January 2, 2016 from the partitioned table:

_PARTITIONTIME

SELECT
  column
FROM
  dataset.table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02')

_PARTITIONDATE

SELECT
  column
FROM
  dataset.table
WHERE
  _PARTITIONDATE BETWEEN '2016-01-01'
  AND '2016-01-02'

Partition pruning examples

This example demonstrates limiting the number of scanned partitions using a pseudo column filter in a subquery:

_PARTITIONTIME

SELECT
  column1,
  column2
FROM (
  SELECT
    column1,
    column2
  FROM
    dataset.table
  WHERE
    _PARTITIONTIME = TIMESTAMP('2016-03-28')) t1
CROSS JOIN
  dataset.table t2
WHERE
  t1.column2 = "one"

_PARTITIONDATE

SELECT
  column1,
  column2
FROM (
  SELECT
    column1,
    column2
  FROM
    dataset.table
  WHERE
    _PARTITIONDATE = '2016-03-28') t1
CROSS JOIN
  dataset.table t2
WHERE
  t1.column2 = "one"

The following query limits the partitions that are queried, based on the first filter condition in the WHERE clause. However, the second filter condition in the WHERE clause does not limit the partitions that are queried because it uses table values, which are dynamic.

_PARTITIONTIME

SELECT
  column
FROM
  dataset.table2
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01')
  AND _PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)

_PARTITIONDATE

SELECT
  column
FROM
  dataset.table2
WHERE
  _PARTITIONDATE BETWEEN '2017-01-01' AND '2017-03-01'
  AND _PARTITIONDATE = (SELECT MAX(date) from dataset.table1)

Pseudo column queries that scan all partitions

The following examples use pseudo columns but scan all the partitions in a time-unit partitioned table.

In legacy SQL, the _PARTITIONTIME filter works only when the filter is specified as closely as possible to the table name. For example, the following query scans all partitions in table1 despite the presence of the _PARTITIONTIME filter:

#legacySQL
# Scans all partitions on t1
SELECT
  t1.field1,
  t2.field1
FROM
  dataset.table1 t1
CROSS JOIN
  dataset.table2 t2
WHERE
  table1._PARTITIONTIME = TIMESTAMP('2016-03-28')
  AND t1.field1 = "one"

Do not include any other columns in a _PARTITIONTIME filter. For example, the following query does not limit the partitions that are scanned because field1 is a column in the table and BigQuery can't determine in advance which partitions to select. For more information, see Pruning (limiting) partitions.

# Scans all partitions of table2
SELECT
  field1
FROM
  dataset.table2
WHERE
  _PARTITIONTIME + field1 = TIMESTAMP('2016-03-28');

Better performance with pseudo columns

To improve query performance, use the _PARTITIONTIME pseudo column by itself on the left side of a comparison. For example, the following queries process the same amount of data, but the second example can provide better performance.

Example 1: The following query can be slower because it combines the pseudo column value with other operations in the WHERE filter.

Standard SQL

#standardSQL
/* Can be slower */
SELECT
  field1
FROM
  dataset.table1
WHERE
  TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15")

Legacy SQL

#legacySQL
/* Can be slower */
SELECT
  field1
FROM
  dataset.table1
WHERE
  DATE_ADD(_PARTITIONTIME, 5, "DAY") > TIMESTAMP("2016-04-15")

Example 2: The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.

Standard SQL

#standardSQL
/* Often performs better */
SELECT
  field1
FROM
  dataset.table1
WHERE
  _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP('2016-04-15'), INTERVAL 5 DAY)

Legacy SQL

#legacySQL
/* Often performs better */
SELECT
  field1
FROM
  dataset.table1
WHERE
  _PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")

Depending on the table size, the second query, which places _PARTITIONTIME by itself on the left side of the > comparison operator, can provide better performance than the first query. Because the queries process the same amount of data, the number of bytes billed is the same in both cases.

Querying ingestion-time partitioned tables using a wildcard table

In addition to using the pseudo columns to limit the number of partitions scanned during a query, you can also use the pseudo columns to query a range of partitioned tables using a wildcard table. For information on using a wildcard table with partitioned tables, see Scanning a range of partitioned tables using _PARTITIONTIME.

Querying ingestion-time partitioned tables using time zones

The value of _PARTITIONTIME is based on the UTC date when the field is populated, which means that partitions are divided based on 12:00 AM UTC. If you want to query data based on a time zone other than UTC, you should choose one of the following options before you start loading data into your table.

There are two ways to query data in a partitioned table using a custom, non-UTC, time zone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.

If you use a timestamp column, you can use the default UTC-based partitioning and account for time zone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a time zone other than UTC, use partition decorators to load data into partitions based on a different time zone.

Querying time zones using a timestamp column

To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.

To query for data based on a timezone other than UTC, use both the _PARTITIONTIME pseudo column and your custom timestamp column. Using _PARTITIONTIME limits the table scan to the relevant partitions, and your custom timestamp further limits the results to your timezone. For example, to query data from a partitioned table (mydataset.partitioned_table) with a timestamp field MY_TIMESTAMP_FIELD for data added to the table between 2016-05-01 12:00:00 PST and 2016-05-05 14:00:00 PST:

Standard SQL

#standardSQL
SELECT
  field1
FROM
  dataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR) BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

Legacy SQL

#legacySQL
SELECT
  field1
FROM
  dataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND DATE_ADD(MY_TIMESTAMP_FIELD, 8, 'HOUR') BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

Querying an integer-range partitioned table

Integer-range partitioned tables can only be queried through Standard SQL. When an integer-range partitioned table is queried, if there are filters on the integer partitioning column, the filters will be used to prune the partitions and reduce the query cost.

The following query scans the 3 partitions that start with 30, 40, and 50, for an integer-range partitioned tables with the partitioning specification of customer_id:0:100:10.

bq query --nouse_legacy_sql \
'SELECT * FROM mydataset.mytable WHERE customer_id BETWEEN 30 AND 50'

Here's example output from the bq query command:

+-------------+-------+
| customer_id | value |
+-------------+-------+
|          40 |    41 |
|          45 |    46 |
|          30 |    31 |
|          35 |    36 |
|          50 |    51 |
+-------------+-------+

For this example, each partition has 2 rows, and each row has 2 integer columns, so the query scans 3 * 2 * 2 * 8 = 96 bytes. You can examine the job information:

bq show -j bqjob_r4fce65fa3381528e_000001670994aeb6_1

Here's example output from the bq show command:

  Job Type    State      Start Time      Duration       User Email        Bytes Processed   Bytes Billed   Billing Tier   Labels
 ---------- --------- ----------------- ---------- --------------------- ----------------- -------------- -------------- --------
  query      SUCCESS   24 Sep 12:19:58   0:00:01    joe@example.com      96                10485760       1

DML statements are supported. For example:

bq query --nouse_legacy_sql \
'DELETE FROM mydataset.mytable WHERE customer_id = 30'

Currently, partition pruning is not supported for functions over an integer range partitioned column. As an example, the following query scans the entire table.

bq query --nouse_legacy_sql \
'SELECT * FROM mydataset.mytable WHERE customer_id+1 BETWEEN 30 AND 50'

Table decorators on integer-range partitioned tables

Legacy SQL supports using table decorators to address a partition in an integer range partitioned table. The key to address a range partition is the start of the range.

The following example queries the range partition that starts with 0. The partitioning column is customer_id and the partition has two rows.

bq query --use_legacy_sql=true 'SELECT * FROM mydataset.mytable$0'

+-------------+-------+
| customer_id | value |
+-------------+-------+
|           0 |     1 |
|           5 |     6 |
+-------------+-------+

Creating a view using an ingestion-time partitioned table's pseudo columns

To limit the amount of data read by a query to a set of partitions, create a view that contains a filter on the _PARTITIONTIME or _PARTITIONDATE pseudo column. For example, the following query can be used to create a view that includes only the most recent seven days of data from a table named dataset.partitioned_table:

Standard SQL

#standardSQL
SELECT
  *
FROM
  dataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
  AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY);

Legacy SQL

#legacySQL
SELECT
  *
FROM
  dataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000))
  AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));

For information about creating views, see Creating views.

Ingestion-time partitioned tables' _UNPARTITIONED_ partition

The __UNPARTITIONED__ partition temporarily holds data that is streamed to a partitioned table while it is in the streaming buffer. Data that is streamed directly to a specific partition of a partitioned table does not use the __UNPARTITIONED__ partition. Instead, the data is streamed directly to the partition. See streaming into partitioned tables for more information.

To query data in the __UNPARTITIONED__ partition, use the _PARTITIONTIME pseudo column with the NULL value. For example:

SELECT
  column
FROM
  dataset.table
WHERE
  _PARTITIONTIME IS NULL

Where:

  • column is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • dataset is the dataset containing the partitioned table.
  • table is the partitioned table.

Querying partitioned tables

Tables partitioned based on a TIMESTAMP, DATE, DATETIME, or INTEGER column do not have pseudo columns. To limit the number of partitions scanned when querying partitioned tables, use a predicate filter (a WHERE clause). Filters on the partitioning column will be used to prune the partitions and reduce the query cost.

Hourly, monthly, and yearly partitioned tables can only be queried through Standard SQL.

When you create a partitioned table, you can require the use of predicate filters by enabling the Require partition filter option. When this option is applied, attempts to query the partitioned table without specifying a WHERE clause produce the following error: Cannot query over table 'project_id.dataset.table' without a filter that can be used for partition elimination.

For more information about adding the Require partition filter option when you create a partitioned table, see Creating partitioned tables.

If you do not enable the Require partition filter option when you create a partitioned table, you can update the table to add the option.

Pruning (limiting) partitions

To limit the partitions that are scanned in a query, use a constant expression in your filter. If you use dynamic expressions in your query filter, BigQuery must scan all of the partitions.

For example, the following query prunes partitions because the filter, WHERE t1.ts=CURRENT_TIMESTAMP(), contains a constant expression:

#standardSQL
SELECT
  t1.name,
  t2.category
FROM
  table1 t1
INNER JOIN
  table2 t2
ON t1.id_field = t2.field2
WHERE
  t1.ts = CURRENT_TIMESTAMP()

However, the following query doesn't prune partitions, because the filter, WHERE t1.ts = (SELECT timestamp from table where key = 2), is not a constant expression; it depends on the dynamic values of the timestamp and key fields:

#standardSQL
SELECT
  t1.name,
  t2.category
FROM
  table1 t1
INNER JOIN
  table2 t2
ON
  t1.id_field = t2.field2
WHERE
  t1.ts = (SELECT timestamp from table3 where key = 2)

Isolate the partition column in your filter

Isolate the partition column when expressing a filter. Filters that require data from multiple fields to compute will not prune partitions. For example, a query with a date comparison using the partitioning column and a second field, or queries containing some field concatenations will not prune partitions.

For example, the following filter does not prune partitions because it requires a computation based on the partitioning ts field and a second field ts2:

WHERE TIMESTAMP_ADD(ts, INTERVAL 6 HOUR) > ts2

Next steps