BigQuery supports querying externally partitioned data in
Avro, Parquet, ORC, JSON, and CSV formats that is stored
on Cloud Storage using a default hive partitioning layout.
Hive partitioning support is enabled by setting the appropriate
options in the
table definition file.
You can create and modify tables using the
Cloud Console, bq
command-line tool, and
BigQuery API.
For instructions on querying managed partitioned tables, see Introduction to partitioned tables.
Limitations
- Hive partitioning support is built assuming a common source URI prefix for
all URIs that ends immediately before partition encoding, as follows:
gs://BUCKET/PATH_TO_TABLE/
. - The directory structure of a hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys per table.
- The data must follow a default hive partitioning layout.
- The hive partitioning keys and the columns in the underlying files cannot overlap.
- All limitations for querying external data sources stored on Cloud Storage apply.
- Support is for Standard SQL only.
Supported data layouts
The data must follow a default hive partitioned layout.
For example, the following files follow the default layout—the
key-value pairs are configured as directories with an =
sign as a separator,
and the partition keys are always in the same order:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar
The common source URI prefix in this example is gs://myBucket/myTable
.
Unsupported data layouts
If the partition key names are not encoded in the directory path, partition schema detection fails. For example, consider the following path, which does not encode the partition key names:
gs://myBucket/myTable/2019-10-31/en/foo
Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/lang=fr/dt=2018-10-31/bar
For these files, partition schema detection would fail.
Partition schema detection modes
Hive partition keys appear as normal columns when querying data from Cloud Storage. BigQuery supports three modes of hive partition schema detection:
AUTO
: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.STRINGS
: Key names are automatically converted to STRING type.CUSTOM
: Partition key schema is encoded as specified in the source URI prefix.
Providing a custom partition key schema
To use a CUSTOM
schema, you must specify the schema in the source URI prefix
field. Using a CUSTOM
schema lets you specify the type for each partition key.
The values must validly parse as the specified type or the query fails.
For example, if you set the source_uri_prefix
flag to gs://myBucket/myTable/{dt:DATE}/{val:STRING}
,
BigQuery treats val
as a STRING, dt
as a DATE, and
uses gs://myBucket/myTable
as the source URI prefix for the matched files.
For detailed instructions for the Cloud Console,
bq
command-line tool, and BigQuery API, see
Setting HivePartitioningOptions.
Partition pruning
BigQuery prunes partitions when possible using query predicates on the partition keys. This lets BigQuery avoid reading unnecessary files, which helps improve performance.
Requiring predicate filters on partition keys in queries
When you create an externally partitioned table, you can require the use of
predicate filters on partition keys by enabling the requirePartitionFilter
option under
HivePartitioningOptions.
When this option is enabled, attempts to query the externally partitioned table
without specifying a WHERE
clause produce the following error:
Cannot query over table <table_name> without a filter over column(s)
<partition key names> that can be used for partition elimination
.
For more information about adding the requirePartitionFilter
option when you
create a partitioned table, see the instructions for the
bq
command-line tool and BigQuery API in Setting HivePartitioningOptions.
Setting Hive Partitioning Options
You can set hive partitioning options using the Cloud Console, the
bq
command-line tool, or the BigQuery API, as follows:
Console
To set hive partitioning options in the Cloud Console, do the following:
Open the BigQuery page in the Cloud Console.
In the Explorer panel, expand your project and select a dataset.
Click Create table to create a new table.
Select Cloud Storage from the Create table from drop-down list.
- In the Select file from Cloud Storage bucket field, enter your dataset's path, then select its file type from the drop-down list.
- Check the Source data partitioning box, then enter the source URI prefix in the provided field.
- Under Partition inference mode, choose "Automatically infer types" for AUTO or "All columns are strings" for STRINGS.
- Choose "External table" from the Table type drop-down.
bq
- To set automatic partition key detection, set the
--hive_partitioning_mode
flag toAUTO
. For example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ GCS_URIS > TABLE_DEF_FILE
- To set string-typed partition key detection, set the
--hive_partitioning_mode
flag toSTRINGS
. For example:
bq mkdef --source_format=PARQUET --hive_partitioning_mode=STRINGS \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ GCS_URIS > TABLE_DEF_FILE
- To specify a custom partition key schema using the
--source_uri_prefix
flag, set the--hive_partitioning_mode
flag toCUSTOM
. For example:
bq mkdef --source_format=NEWLINE_DELIMITED_JSON --hive_partitioning_mode=CUSTOM \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/CUSTOM_SCHEMA_ENCODING \ GCS_URIS FILE_SCHEMA > TABLE_DEF_FILE
For the --hive_partitioning_source_uri_prefix
flag, specify the
<var>CUSOM_SCHEMA_ENCODING</var>
portion of the argument immediately
following
the <var>GCS_URI_SHARED_PREFIX</var>
portion, as follows:
GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...
For example:
--hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING}
- To require the use of a predicate filter at query time, add
--require_hive_partition_filter=True
to thebq mkdef
command; for example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ --require_hive_partition_filter=True \ GCS_URIS > TABLE_DEF_FILE
API
To set hive partitioning using the BigQuery API, include a hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file.
If you set the hivePartitioningOptions.mode
field to CUSTOM
, you must
encode the partition key schema in the
hivePartitioningOptions.sourceUriPrefix
field as follows:
gs://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...
To enforce the use of a predicate filter at query time, set the
hivePartitioningOptions.requirePartitionFilter
field to true
.
Setting HivePartitioningOptions using the client libraries
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.