BigQuery supports loading externally partitioned data in Avro, Parquet, ORC, CSV and JSON formats that is stored on Cloud Storage using a default hive partitioning layout.
Limitations
- Hive partitioning support is built assuming a common source URI prefix for
all URIs that ends immediately before partition encoding, which looks like
gs://<bucket>/<path_to_table>/
. - The directory structure of a hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys per table.
- The data must follow a default hive partitioning layout.
- The hive partitioning keys and the columns in the underlying files cannot overlap.
- Support is currently limited to the CLI and REST API.
- All limitations for loading from Cloud Storage apply.
Supported data layouts
The data must follow a default hive partitioned layout.
For instance, the following files follow the default layout, with the
key/value pairs laid out as directories with an =
sign as a separator,
and the partition keys are always in the same order.
gs://bucket/table/dt=2019-10-31/lang=en/foo
gs://bucket/table/dt=2018-10-31/lang=fr/bar
Unsupported data layouts
If the partition key names are not encoded in the directory path, partition schema detection will fail. For example, consider the following path, which does not encode the partition key names:
gs://bucket/table/2019-10-31/en/foo
Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:
gs://bucket/table/dt=2019-10-31/lang=en/foo
gs://bucket/table/lang=fr/dt=2018-10-31/bar
Partition schema detection modes
Hive partition keys appear as normal columns when loading data from Cloud Storage. We support three modes of hive partition schema detection:
- AUTO: Key names and types are auto detected. The following types can be detected: STRING, INTEGER, DATE and TIMESTAMP.
- STRINGS: Key names are automatically inferred with STRING type.
- CUSTOM: Partition key schema is encoded in the source URI prefix.
Providing a custom partition key schema
CUSTOM detection requires encoding the schema in the source URI prefix field by providing partition key name/type pairs. The values must validly parse as the specified type or the load fails.
For example, setting source_uri_prefix to:
gs://my_bucket/my_table/{dt:DATE}/{val:STRING}
treats val
as a STRING and dt
as a DATE, and
extracting gs://my_bucket/my_table
as the source URI prefix
for the matched files.
See detailed instructions for the CLI and REST API.
Performing incremental loads
Consider the following data layout:
gs://my_bucket/my_table/dt=2019-10-31/val=1/file1
gs://my_bucket/my_table/dt=2018-10-31/val=2/file2
gs://my_bucket/my_table/dt=2017-10-31/val=3/file3
gs://my_bucket/my_table/dt=2016-10-31/val=4/file4
To load only data from 2019-10-31:
- Set the hive partitioning mode to AUTO, STRINGS, or CUSTOM
- Set the source URI prefix to
gs://my_bucket/my_table/
for AUTO or STRINGS hive partitioning modes. For CUSTOM, providegs://my_bucket/my_table/{dt:DATE}/{val:INTEGER}
. - Use the URI
gs://my_bucket/my_table/dt=2019-10-31/*
- Data is loaded with
dt
andval
columns included, with values2019-10-31
and1
, respectively.
To load only data from specific files:
- Set the hive partitioning mode to AUTO, STRINGS, or CUSTOM
- Set the source URI prefix to
gs://my_bucket/my_table/
for AUTO or STRINGS hive partitioning modes. For CUSTOM, providegs://my_bucket/my_table/{dt:DATE}/{val:INTEGER}
. - Use the URIs
gs://my_bucket/my_table/dt=2017-10-31/val=3/file3,gs://my_bucket/my_table/dt=2016-10-31/val=4/file4
- Data is loaded from both files with the
dt
andval
columns filled in.
Setting HivePartitioningOptions using the CLI
Requesting automatic partition key type detection
bq load --source_format=ORC --hive_partitioning_mode=AUTO \ --hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix \ dataset.table gcs_uris
Requesting string-typed partition key detection
bq load --source_format=PARQUET --autodetect \ --hive_partitioning_mode=STRINGS \ --hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix \ dataset.table gcs_uris
Providing a custom partition key schema encoded via the source_uri_prefix field
bq load --source_format=JSON --hive_partitioning_mode=CUSTOM \ --hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix/custom_schema_encoding \ dataset.table gcs_uris file_schema
The partition key schema is encoded immediately following the source URI
prefix. Use the following format to specify
--hive_partitioning_source_uri_prefix
:
--hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix/{key1:TYPE1}/{key2:TYPE2}/{key3:TYPE3}
Setting HivePartitioningOptions using the REST API
Support for hive partitioning exists by setting the HivePartitioningOptions on the JobConfigurationLoad.
Note that when hivePartitioningOptions.mode
is set to CUSTOM, you must
encode the partition key schema within the sourceUriPrefix by setting
hivePartitioningOptions.sourceUriPrefix
to
gs://bucket/path_to_table/{key1:TYPE1}/{key2:TYPE2}/{key3:TYPE3}
.