BigQuery supports querying externally partitioned data in
Avro, Parquet, ORC, JSON, and CSV formats that is stored
on Cloud Storage using a default hive partitioning layout.
Hive partitioning support is enabled by setting the appropriate
options in the
table definition file.
You can create and modify tables using the
Cloud Console, bq
command-line tool, and
BigQuery API.
For instructions on querying managed partitioned tables, see Introduction to partitioned tables.
Limitations
- Hive partitioning support is built assuming a common source URI prefix for
all URIs that ends immediately before partition encoding, as follows:
gs://BUCKET/PATH_TO_TABLE/
. - The directory structure of a hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys per table.
- The data must follow a default hive partitioning layout.
- The hive partitioning keys and the columns in the underlying files cannot overlap.
- All limitations for querying external data sources stored on Cloud Storage apply.
- Support is for Standard SQL only.
Supported data layouts
The data must follow a default hive partitioned layout.
For example, the following files follow the default layout—the
key-value pairs are configured as directories with an =
sign as a separator,
and the partition keys are always in the same order:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar
The common source URI prefix in this example is gs://myBucket/myTable
.
Unsupported data layouts
If the partition key names are not encoded in the directory path, partition schema detection fails. For example, consider the following path, which does not encode the partition key names:
gs://myBucket/myTable/2019-10-31/en/foo
Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/lang=fr/dt=2018-10-31/bar
For these files, partition schema detection would fail.
Partition schema detection modes
Hive partition keys appear as normal columns when querying data from Cloud Storage. BigQuery supports three modes of hive partition schema detection:
AUTO
: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.STRINGS
: Key names are automatically converted to STRING type.CUSTOM
: Partition key schema is encoded as specified in the source URI prefix.
Providing a custom partition key schema
To use a CUSTOM
schema, you must specify the schema in the source URI prefix
field. Using a CUSTOM
schema lets you specify the type for each partition key.
The values must validly parse as the specified type or the query fails.
For example, if you set the source_uri_prefix
flag to gs://myBucket/myTable/{dt:DATE}/{val:STRING}
,
BigQuery treats val
as a STRING, dt
as a DATE, and
uses gs://myBucket/myTable
as the source URI prefix for the matched files.
Partition pruning
BigQuery prunes partitions when possible using query predicates on the partition keys. This lets BigQuery avoid reading unnecessary files, which helps improve performance.
Requiring predicate filters on partition keys in queries
When you create an externally partitioned table, you can require the use of
predicate filters on partition keys by enabling the requirePartitionFilter
option under
HivePartitioningOptions.
When this option is enabled, attempts to query the externally partitioned table
without specifying a WHERE
clause produce the following error:
Cannot query over table <table_name> without a filter over column(s)
<partition key names> that can be used for partition elimination
.
Creating an external table for hive-partitioned data
You can create an external table for hive-partitioned data in the following ways:
- Using the Cloud Console.
- Using the
bq
command-line tool. Using the client libraries.
Console
Open the BigQuery page in the Cloud Console.
In the Explorer panel, expand your project and select a dataset.
Click Create table to create a new table.
On the Create table page, in the Source section, do the following:
- In the Create table from drop-down list, select Cloud Storage.
- In the Select file from Cloud Storage bucket field, enter
the path to the Cloud Storage folder, using
wildcard format.
For example,
my_bucket/my_files*
. - In the File format drop-down list, select the file type.
- Select the Source data partitioning checkbox.
- In the Select Source URI Prefix field, enter the
Cloud Storage URI prefix. For example,
gs://my_bucket/my_files
. - Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
- Optional: To require that all queries against this table must use a predicate filter, check Require partition filter. For more information, see Requiring predicate filters on partition keys in queries.
On the Create table page, in the Destination section, do the following:
- Select the project name and the dataset name.
- In the Table type drop-down list, select External table.
- In the Table name field, enter a name for the external table.
On the Create table page, in the Schema section, enter the schema information. BigQuery supports schema auto-detection for some formats. For more information, see Using schema auto-detection.
Click Create table.
bq
- To set automatic partition key detection, set the
--hive_partitioning_mode
flag toAUTO
. For example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ GCS_URIS > TABLE_DEF_FILE
- To set string-typed partition key detection, set the
--hive_partitioning_mode
flag toSTRINGS
. For example:
bq mkdef --source_format=PARQUET --hive_partitioning_mode=STRINGS \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ GCS_URIS > TABLE_DEF_FILE
- To specify a custom partition key schema using the
--source_uri_prefix
flag, set the--hive_partitioning_mode
flag toCUSTOM
. For example:
bq mkdef --source_format=NEWLINE_DELIMITED_JSON --hive_partitioning_mode=CUSTOM \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/CUSTOM_SCHEMA_ENCODING \ GCS_URIS FILE_SCHEMA > TABLE_DEF_FILE
For the --hive_partitioning_source_uri_prefix
flag, specify the
<var>CUSOM_SCHEMA_ENCODING</var>
portion of the argument immediately
following
the <var>GCS_URI_SHARED_PREFIX</var>
portion, as follows:
GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...
For example:
--hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING}
- To require the use of a predicate filter at query time, add
--require_hive_partition_filter=True
to thebq mkdef
command; for example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \ --hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \ --require_hive_partition_filter=True \ GCS_URIS > TABLE_DEF_FILE
API
To set hive partitioning using the BigQuery API, include a hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file.
If you set the hivePartitioningOptions.mode
field to CUSTOM
, you must
encode the partition key schema in the
hivePartitioningOptions.sourceUriPrefix
field as follows:
gs://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...
To enforce the use of a predicate filter at query time, set the
hivePartitioningOptions.requirePartitionFilter
field to true
.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.