Querying Partitioned Tables

Querying partitioned tables

You can query partitioned tables by using the BigQuery web UI, the command-line tool's bq query command, or by calling the jobs.insert API method and configuring a query job.

For more information on running queries, see Running Interactive and Batch Queries.

Required permissions

At the dataset level, querying a partitioned table requires READER access to the dataset that contains the table.

Instead of using dataset-level permissions, you can leverage a project-level IAM role that includes bigquery.tables.getData permissions. bigquery.tables.getData permissions are required to read the data in the table being queried.

All predefined, project-level IAM roles include bigquery.tables.getData permissions except for bigquery.user, bigquery.jobUser, and bigquery.metadataViewer.

You must also be granted bigquery.jobs.create permissions to run query jobs. The following predefined, project-level IAM roles include bigquery.jobs.create permissions:

For more information on IAM roles and permissions in BigQuery, see access control. For more information on dataset-level roles, see Primitive roles for datasets.

Ingestion-time partitioned table pseudo columns

When you create an ingestion-time partitioned table, two pseudo columns are added to the table: a _PARTITIONTIME pseudo column and a _PARTITIONDATE pseudo column. The _PARTITIONTIME pseudo column contains a date-based timestamp for data that is loaded into the table. The _PARTITIONDATE pseudo column contains a date representation. Both pseudo column names are reserved, which means that you cannot create a column with either name in any of your tables.

_PARTITIONTIME and _PARTITIONDATE are available only in ingestion-time partitioned tables. Partitioned tables do not have pseudo columns. For information on querying partitioned tables, see Querying partitioned tables.

The _PARTITIONTIME pseudo column

The _PARTITIONTIME pseudo column contains a timestamp that is based on UTC time and represents the number of microseconds since the unix epoch. For example, if data is appended to a table on April 15, 2016, all of the rows of data that are appended on that day contain the value TIMESTAMP("2016-04-15") in the _PARTITIONTIME column.

To query the _PARTITIONTIME pseudo column, you must use an alias. For example, the following query selects _PARTITIONTIME by assigning the alias pt to the pseudo column:

SELECT
  _PARTITIONTIME AS pt,
  [COLUMN]
FROM
  [DATASET].[TABLE]

Where:

  • [COLUMN] is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • [DATASET] is the dataset containing the partitioned table.
  • [TABLE] is the partitioned table.

Data in the streaming buffer has NULL values in the _PARTITIONTIME column.

The _PARTITIONDATE pseudo column

The _PARTITIONDATE pseudo column contains the UTC date corresponding to the value in the _PARTITIONTIME pseudo column.

To query the _PARTITIONDATE pseudo column, you must use an alias. For example, the following query selects _PARTITIONDATE by assigning the alias pd to the pseudo column:

SELECT
  _PARTITIONDATE AS pd,
  [COLUMN]
FROM
  [DATASET].[TABLE]

Where:

  • [COLUMN] is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • [DATASET] is the dataset containing the partitioned table.
  • [TABLE] is the partitioned table.

Data in the streaming buffer has NULL values in the _PARTITIONDATE column.

Querying ingestion-time partitioned tables using pseudo columns

When you query data in ingestion-time partitioned tables, you reference specific partitions by specifying the values in the _PARTITIONTIME or _PARTITIONDATE pseudo columns. For example:

  • _PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
  • _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')

or

  • _PARTITIONDATE >= "2018-01-29" AND _PARTITIONDATE < "2018-01-30"
  • _PARTITIONDATE BETWEEN '2016-01-01' AND '2016-01-02'

Limiting partitions queried using pseudo columns

Use the _PARTITIONTIME and _PARTITIONDATE pseudo columns to limit the number of partitions scanned during a query. This is also referred to as pruning partitions. Partition pruning is the mechanism BigQuery uses to eliminate unnecessary partitions from the input scan. The pruned partitions are not included when calculating the bytes scanned by the query, reducing the on-demand analysis cost. In general, partition pruning will reduce query cost when the filters can be evaluated at the outset of the query without requiring any subquery evaluations or data scans.

For example, the following query scans only the partitions between the dates January 1, 2016 and January 2, 2016 from the partitioned table:

_PARTITIONTIME

SELECT
  [COLUMN]
FROM
  [DATASET].[TABLE]
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02')

_PARTITIONDATE

SELECT
  [COLUMN]
FROM
  [DATASET].[TABLE]
WHERE
  _PARTITIONDATE BETWEEN '2016-01-01'
  AND '2016-01-02'

Partition pruning examples

This example demonstrates limiting the number of scanned partitions using a pseudo column filter in a subquery:

_PARTITIONTIME

SELECT
  [COLUMN1],
  [COLUMN2]
FROM (
  SELECT
    [COLUMN1],
    [COLUMN2]
  FROM
    [DATASET].[TABLE]
  WHERE
    _PARTITIONTIME = TIMESTAMP('2016-03-28')) t1
CROSS JOIN
  [DATASET].[TABLE2] t2
WHERE
  t1.[COLUMN2] = "one"

_PARTITIONDATE

SELECT
  [COLUMN1],
  [COLUMN2]
FROM (
  SELECT
    [COLUMN1],
    [COLUMN2]
  FROM
    [DATASET].[TABLE]
  WHERE
    _PARTITIONDATE = '2016-03-28') t1
CROSS JOIN
  [DATASET].[TABLE2] t2
WHERE
  t1.[COLUMN2] = "one"

The following query limits some partitions that are based on part of the filter condition, _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01'). It does not limit it based on the condition that involves the subquery.:

_PARTITIONTIME

SELECT
  [COLUMN]
FROM
  [DATASET].[TABLE2]
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01')
  AND _PARTITIONTIME = (SELECT MAX(timestamp) from [DATASET].[TABLE1])

_PARTITIONDATE

SELECT
  [COLUMN]
FROM
  [DATASET].[TABLE2]
WHERE
  _PARTITIONDATE BETWEEN '2017-01-01' AND '2017-03-01'
  AND _PARTITIONDATE = (SELECT MAX(date) from [DATASET].[TABLE1])

Pseudo column queries that scan all partitions

The following examples use pseudo columns but scan all the partitions in a time-partitioned table.

In legacy SQL, the _PARTITIONTIME filter works only when the filter is specified as closely as possible to the table name. For example, the following query scans all partitions in table1 despite the presence of the _PARTITIONTIME filter:

#legacySQL
# Scans all partitions on t1
SELECT
  t1.field1,
  t2.field1
FROM
  mydataset.table2 t1
CROSS JOIN
  mydataset.table2 t2
WHERE
  t1._PARTITIONTIME = TIMESTAMP('2016-03-28')
  AND t1.field2 = "one"

Do not include any other columns in a _PARTITIONTIME filter. For example, the following query does not limit the partitions that are scanned because field2 is a column in the table and BigQuery can't determine in advance which partitions to select.

# Scans all partitions of table2
SELECT
  field1
FROM
  mydataset.table2
WHERE
  _PARTITIONTIME + field2 = TIMESTAMP('2016-03-28');

Filters on _PARTITIONTIME that include subqueries can't be used to limit the number of partitions scanned for a partitioned table. For example, the following query does not limit the partitions scanned on table mydataset.table2:

# Scans all partitions of table2
SELECT
  field1
FROM
  mydataset.table2
WHERE
  _PARTITIONTIME = (SELECT MAX(timestamp) FROM mydataset.table1)

Better performance with pseudo columns

To improve query performance, use the _PARTITIONTIME pseudo column by itself on the left side of a comparison. For example, the following queries process the same amount of data, but the second example can provide better performance.

Example 1: The following query can be slower because it combines the pseudo column value with other operations in the WHERE filter.

standard SQL

#standardSQL
/* Can be slower */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15")

legacy SQL

#legacySQL
/* Can be slower */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  DATE_ADD(_PARTITIONTIME, 5, "DAY") > TIMESTAMP("2016-04-15")

Example 2: The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.

standard SQL

#standardSQL
/* Often performs better */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP('2016-04-15'), INTERVAL 5 DAY)

legacy SQL

#legacySQL
/* Often performs better */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")

Depending on the table size, the second query, which places _PARTITIONTIME by itself on the left side of the > comparison operator, can provide better performance than the first query. Because the queries process the same amount of data, the number of bytes billed is the same in both cases.

Querying ingestion-time partitioned tables using a wildcard table

In addition to using the pseudo columns to limit the number of partitions scanned during a query, you can also use the pseudo columns to query a range of partitioned tables using a wildcard table. For information on using a wildcard table with partitioned tables, see Scanning a range of partitioned tables using _PARTITIONTIME.

Querying ingestion-time partitioned tables using time zones

The value of _PARTITIONTIME is based on the UTC date when the field is populated, which means that partitions are divided based on 12:00 AM UTC. If you want to query data based on a time zone other than UTC, you should choose one of the following options before you start loading data into your table.

There are two ways to query data in a partitioned table using a custom, non-UTC, time zone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.

If you use a timestamp column, you can use the default UTC-based partitioning and account for time zone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a time zone other than UTC, use partition decorators to load data into partitions based on a different time zone.

Querying time zones using a timestamp column

To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.

To query for data based on a timezone other than UTC, use both the _PARTITIONTIME pseudo column and your custom timestamp column. Using _PARTITIONTIME limits the table scan to the relevant partitions, and your custom timestamp further limits the results to your timezone. For example, to query data from a partitioned table (mydataset.partitioned_table) with a timestamp field [MY_TIMESTAMP_FIELD] for data added to the table between 2016-05-01 08:00:00 PST and 2016-05-05 14:00:00 PST:

standard SQL

#standardSQL
SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND TIMESTAMP_ADD([MY_TIMESTAMP_FIELD], INTERVAL 8 HOUR) BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

legacy SQL

#legacySQL
SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND DATE_ADD([MY_TIMESTAMP_FIELD], 8, 'HOUR') BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

Creating a view using an ingestion-time partitioned table's pseudo columns

To limit the amount of data read by a query to a set of partitions, create a view that contains a filter on the _PARTITIONTIME or _PARTITIONDATE pseudo column. For example, the following query can be used to create a view that includes only the most recent seven days of data from a table named mydataset.partitioned_table:

standard SQL

#standardSQL
SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
  AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);

legacy SQL

#legacySQL
SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000))
  AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));

For information about creating views, see Creating a view.

Ingestion-time partitioned tables' __UNPARTITIONED__ partition

The __UNPARTITIONED__ partition temporarily holds data that is streamed to a partitioned table while it is in the streaming buffer. Data that is streamed directly to a specific partition of a partitioned table does not use the __UNPARTITIONED__ partition. Instead, the data is streamed directly to the partition. See streaming into partitioned tables for more information.

To query data in the __UNPARTITIONED__ partition, use the _PARTITIONTIME pseudo column with the NULL value. For example:

SELECT
  [COLUMN]
FROM
  [DATASET].[TABLE]
WHERE
  _PARTITIONTIME IS NULL

Where:

  • [COLUMN] is the name of a column to query. You can specify multiple columns as a comma-separated list.
  • [DATASET] is the dataset containing the partitioned table.
  • [TABLE] is the partitioned table.

Querying partitioned tables

Tables partitioned based on a TIMESTAMP or DATE column do not have pseudo columns. To limit the number of partitions scanned when querying partitioned tables, use a predicate filter (a WHERE clause).

Pruning (limiting) partitions

Express the predicate filter as closely as possible to the table identifier. Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.

For example, the following query prunes partitions:

#standardSQL
SELECT
  t1.name,
  t2.category
FROM
  table1 t1
INNER JOIN
  table2 t2
ON t1.id_field = t2 field2
WHERE
  t1.ts = CURRENT_TIMESTAMP()

The following query does not prune partitions (note the use of a subquery):

#standardSQL
SELECT
  t1.name,
  t2.category
FROM
  table1 t1
INNER JOIN
  table2 t2
ON
  t1.id_field = t2.field2
WHERE
  t1.ts = (SELECT timestamp from table3 where key = 2)

Isolate the partition column in your filter

Isolate the partition column when expressing a filter. Filters that require data from multiple fields to compute will not prune partitions. For example, a query with a date comparison using the partitioning column and a second field, or queries containing some field concatenations will not prune partitions.

For example, the following filter does not prune partitions because it requires a computation based on the partitioning ts field and a second field ts2:

WHERE TIMESTAMP_ADD(ts, INTERVAL 6 HOUR) > ts2

Next steps

¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…

¿Necesitas ayuda? Visita nuestra página de asistencia.