Querying partitioned tables
You can query partitioned tables by:
- Using the Cloud Console
- Using the
bq
command-line tool'sbq query
command - Calling the jobs.insert API method and configuring a query job
- Using the client libraries
For more information on running queries, see Running interactive and batch queries.
Required permissions
At a minimum, to query a table, you must be granted
bigquery.tables.getData
permissions.
The following predefined IAM roles include
bigquery.tables.getData
permissions:
bigquery.dataViewer
bigquery.dataOwner
bigquery.dataEditor
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to query tables and views
in the dataset.
You must also be granted bigquery.jobs.create
permissions to run query jobs.
The following predefined IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
For more information on IAM roles and permissions in BigQuery, see Access control.
Ingestion-time partitioned table pseudo columns
When you create an ingestion-time partitioned table,
two pseudo columns are added to the table: a _PARTITIONTIME
pseudo column and
a _PARTITIONDATE
pseudo column. The _PARTITIONTIME
pseudo column contains a
date-based timestamp for data that is loaded into the table. The
_PARTITIONDATE
pseudo column contains a date representation. Both pseudo
column names are reserved, which means that you cannot create a column with
either name in any of your tables.
_PARTITIONTIME
and _PARTITIONDATE
are available only in ingestion-time
partitioned tables. Partitioned tables do not have pseudo columns. For
information on querying partitioned tables, see
Querying partitioned tables.
The _PARTITIONTIME pseudo column
The _PARTITIONTIME
pseudo column contains a timestamp that is based on UTC
time and represents the number of microseconds since the unix epoch. For
example, if data is appended to a table on April 15, 2016, 08:15:00 UTC, all of
the rows of data that are appended on that day have the _PARTITIONTIME
column
that contains one of the following values:
+ TIMESTAMP("2016-04-15 08:00:00")
for hourly partitioned tables.
+ TIMESTAMP("2016-04-15")
for daily partitioned tables.
+ TIMESTAMP("2016-04-01")
for monthly partitioned tables.
+ TIMESTAMP("2016-01-01")
for yearly partitioned tables.
To query the _PARTITIONTIME
pseudo column, you must use an alias. For example,
the following query selects _PARTITIONTIME
by assigning the alias pt
to
the pseudo column:
SELECT _PARTITIONTIME AS pt, column FROM dataset.table
Where:
- column is the name of a column to query. You can specify multiple columns as a comma-separated list.
- dataset is the dataset containing the partitioned table.
- table is the partitioned table.
Data in the streaming buffer has
NULL
values in the _PARTITIONTIME
column.
The _PARTITIONDATE pseudo column
The _PARTITIONDATE
pseudo column contains the UTC date corresponding to the
value in the _PARTITIONTIME
pseudo column. This column is not supported in
hourly, monthly, or yearly partitioned tables.
To query the _PARTITIONDATE
pseudo column, you must use an alias. For example,
the following query selects _PARTITIONDATE
by assigning the alias pd
to
the pseudo column:
SELECT _PARTITIONDATE AS pd, column FROM dataset.table
Where:
- column is the name of a column to query. You can specify multiple columns as a comma-separated list.
- dataset is the dataset containing the partitioned table.
- table is the partitioned table.
Data in the streaming buffer has
NULL
values in the _PARTITIONDATE
column.
Querying ingestion-time partitioned tables using pseudo columns
When you query data in ingestion-time partitioned tables, you reference specific
partitions by specifying the values in the _PARTITIONTIME
or _PARTITIONDATE
pseudo columns. For example:
_PARTITIONTIME >= "2018-01-29 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00"
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
or
_PARTITIONDATE >= "2018-01-29" AND _PARTITIONDATE < "2018-01-30"
_PARTITIONDATE BETWEEN '2016-01-01' AND '2016-01-02'
Limiting partitions queried using pseudo columns
You can use the _PARTITIONTIME
and _PARTITIONDATE
pseudo columns to limit the number
of partitions scanned during a query. This is also referred to as pruning
partitions. Partition pruning is the mechanism BigQuery uses to
eliminate unnecessary partitions from the input scan. The pruned partitions are
not included when calculating the bytes scanned by the query, reducing the
on-demand analysis cost. In general, partition pruning helps reduce query cost
when the filters consist of constant expressions, which can be evaluated at the outset of the query.
For example, the following query scans only the partitions between the dates January 1, 2016 and January 2, 2016 from the partitioned table:
_PARTITIONTIME
SELECT column FROM dataset.table WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
_PARTITIONDATE
SELECT column FROM dataset.table WHERE _PARTITIONDATE BETWEEN '2016-01-01' AND '2016-01-02'
Partition pruning examples
This example demonstrates limiting the number of scanned partitions using a pseudo column filter in a subquery:
_PARTITIONTIME
SELECT column1, column2 FROM ( SELECT column1, column2 FROM dataset.table WHERE _PARTITIONTIME = TIMESTAMP('2016-03-28')) t1 CROSS JOIN dataset.table t2 WHERE t1.column2 = "one"
_PARTITIONDATE
SELECT column1, column2 FROM ( SELECT column1, column2 FROM dataset.table WHERE _PARTITIONDATE = '2016-03-28') t1 CROSS JOIN dataset.table t2 WHERE t1.column2 = "one"
The following query limits the partitions that are queried, based on the first filter
condition in the WHERE
clause. However, the second filter condition in the WHERE
clause does
not limit the partitions that are queried because it uses table values, which are dynamic.
_PARTITIONTIME
SELECT column FROM dataset.table2 WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01') AND _PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
_PARTITIONDATE
SELECT column FROM dataset.table2 WHERE _PARTITIONDATE BETWEEN '2017-01-01' AND '2017-03-01' AND _PARTITIONDATE = (SELECT MAX(date) from dataset.table1)
Pseudo column queries that scan all partitions
The following examples use pseudo columns but scan all the partitions in a time-unit partitioned table.
In legacy SQL, the _PARTITIONTIME
filter works only when the filter is
specified as closely as possible to the table name. For example, the following
query scans all partitions in table1
despite the presence of the
_PARTITIONTIME
filter:
#legacySQL # Scans all partitions on t1 SELECT t1.field1, t2.field1 FROM dataset.table1 t1 CROSS JOIN dataset.table2 t2 WHERE table1._PARTITIONTIME = TIMESTAMP('2016-03-28') AND t1.field1 = "one"
Do not include any other columns in a _PARTITIONTIME
filter. For example, the
following query does not limit the partitions that are scanned because field1
is a column in the table and BigQuery can't determine in advance
which partitions to select. For more information, see
Pruning (limiting) partitions.
# Scans all partitions of table2 SELECT field1 FROM dataset.table2 WHERE _PARTITIONTIME + field1 = TIMESTAMP('2016-03-28');
Better performance with pseudo columns
To improve query performance, use the _PARTITIONTIME
pseudo column by itself
on the left side of a comparison. For example, the following queries process
the same amount of data, but the second example can provide better performance.
Example 1: The following query can be slower because it combines the pseudo
column value with other operations in the WHERE
filter.
Standard SQL
#standardSQL /* Can be slower */ SELECT field1 FROM dataset.table1 WHERE TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15")
Legacy SQL
#legacySQL /* Can be slower */ SELECT field1 FROM dataset.table1 WHERE DATE_ADD(_PARTITIONTIME, 5, "DAY") > TIMESTAMP("2016-04-15")
Example 2: The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.
Standard SQL
#standardSQL /* Often performs better */ SELECT field1 FROM dataset.table1 WHERE _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP('2016-04-15'), INTERVAL 5 DAY)
Legacy SQL
#legacySQL /* Often performs better */ SELECT field1 FROM dataset.table1 WHERE _PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
Depending on the table size, the second query, which places _PARTITIONTIME
by itself on the left side of the >
comparison operator, can provide better
performance than the first query. Because the queries process the same amount of
data, the number of bytes billed is the same in both cases.
Querying ingestion-time partitioned tables using a wildcard table
In addition to using the pseudo columns to limit the number of partitions scanned during a query, you can also use the pseudo columns to query a range of partitioned tables using a wildcard table. For information on using a wildcard table with partitioned tables, see Scanning a range of partitioned tables using _PARTITIONTIME.
Querying ingestion-time partitioned tables using time zones
The value of _PARTITIONTIME
is based on the UTC date when the field is
populated, which means that partitions are divided based on 12:00 AM UTC. If you
want to query data based on a time zone other than UTC, you should choose one of
the following options before you start loading data into your table.
There are two ways to query data in a partitioned table using a custom, non-UTC, time zone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.
If you use a timestamp column, you can use the default UTC-based partitioning and account for time zone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a time zone other than UTC, use partition decorators to load data into partitions based on a different time zone.
Querying time zones using a timestamp column
To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.
To query for data based on a timezone other than UTC, use both the
_PARTITIONTIME
pseudo column and your custom timestamp column.
Using _PARTITIONTIME
limits the table scan to the relevant partitions, and
your custom timestamp further limits the results to your timezone. For example,
to query data from a partitioned table (mydataset.partitioned_table
)
with a timestamp field MY_TIMESTAMP_FIELD
for data added to the table between 2016-05-01 12:00:00 PST
and
2016-05-05 14:00:00 PST
:
Standard SQL
#standardSQL SELECT field1 FROM dataset.partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01") AND TIMESTAMP("2016-05-06") AND TIMESTAMP_ADD(MY_TIMESTAMP_FIELD, INTERVAL 8 HOUR) BETWEEN TIMESTAMP("2016-05-01 12:00:00") AND TIMESTAMP("2016-05-05 14:00:00");
Legacy SQL
#legacySQL SELECT field1 FROM dataset.partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01") AND TIMESTAMP("2016-05-06") AND DATE_ADD(MY_TIMESTAMP_FIELD, 8, 'HOUR') BETWEEN TIMESTAMP("2016-05-01 12:00:00") AND TIMESTAMP("2016-05-05 14:00:00");
Creating a view using an ingestion-time partitioned table's pseudo columns
To limit the amount of data read by a query to a set of partitions, create a
view that contains a filter on the _PARTITIONTIME
or _PARTITIONDATE
pseudo
column. For example, the following query can be used to create a view that
includes only the most recent seven days of data from a table named
dataset.partitioned_table
:
Standard SQL
#standardSQL SELECT * FROM dataset.partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY);
Legacy SQL
#legacySQL SELECT * FROM dataset.partitioned_table WHERE _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000)) AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));
For information about creating views, see Creating views.
Ingestion-time partitioned tables' _UNPARTITIONED_ partition
The __UNPARTITIONED__
partition temporarily holds data that is streamed to a
partitioned table while it is in the streaming buffer. Data that is streamed
directly to a specific partition of a partitioned table does not use the
__UNPARTITIONED__
partition. Instead, the data is streamed directly to the
partition. See streaming into partitioned tables
for more information.
To query data in the __UNPARTITIONED__
partition, use the _PARTITIONTIME
pseudo column with the NULL
value. For example:
SELECT column FROM dataset.table WHERE _PARTITIONTIME IS NULL
Where:
- column is the name of a column to query. You can specify multiple columns as a comma-separated list.
- dataset is the dataset containing the partitioned table.
- table is the partitioned table.
Querying partitioned tables
Tables partitioned based on a TIMESTAMP
, DATE
, DATETIME
, or INTEGER
column do not have pseudo columns. To limit the number of partitions scanned
when querying partitioned tables, use a predicate filter (a WHERE
clause).
Filters on the partitioning column will be used to prune the partitions and
reduce the query cost.
Hourly, monthly, and yearly partitioned tables can only be queried through Standard SQL.
When you create a partitioned table, you can require the use of predicate
filters by enabling the Require partition filter option. When this option
is applied, attempts to query the partitioned table without specifying a WHERE
clause produce the following error:
Cannot query over table 'project_id.dataset.table' without a
filter that can be used for partition elimination
.
For more information on adding the Require partition filter option when you create a partitioned table, see Creating partitioned tables.
If you do not enable the Require partition filter option when you create a partitioned table, you can update the table to add the option.
Pruning (limiting) partitions
To limit the partitions that are scanned in a query, use a constant expression in your filter. If you use dynamic expressions in your query filter, BigQuery must scan all of the partitions.
For example, the following query prunes partitions because the
filter,
WHERE t1.ts=CURRENT_TIMESTAMP()
, contains a constant expression:
#standardSQL SELECT t1.name, t2.category FROM table1 t1 INNER JOIN table2 t2 ON t1.id_field = t2.field2 WHERE t1.ts = CURRENT_TIMESTAMP()
However, the following query doesn't prune partitions, because the filter,
WHERE t1.ts = (SELECT timestamp from table where key = 2)
, is not
a constant expression; it depends on the dynamic values of the timestamp
and
key
fields:
#standardSQL SELECT t1.name, t2.category FROM table1 t1 INNER JOIN table2 t2 ON t1.id_field = t2.field2 WHERE t1.ts = (SELECT timestamp from table3 where key = 2)
Isolate the partition column in your filter
Isolate the partition column when expressing a filter. Filters that require data from multiple fields to compute will not prune partitions. For example, a query with a date comparison using the partitioning column and a second field, or queries containing some field concatenations will not prune partitions.
For example, the following filter does not prune partitions because it
requires a computation based on the partitioning ts
field and a second field
ts2
:
WHERE TIMESTAMP_ADD(ts, INTERVAL 6 HOUR) > ts2
Next steps
- For an overview of partitioned tables, see Introduction to partitioned tables.
- To learn more about ingestion-time partitioned tables, see Creating and using ingestion-time partitioned tables.
- To learn more about column-based time partitioned tables, see Creating and using date/timestamp/datetime partitioned tables.
- To learn more about integer-range partitioned tables, see Creating and using integer-range partitioned tables.