Loading externally partitioned data

BigQuery can load data that is stored in Cloud Storage using a Hive partitioning layout. Hive partitioning means that the external data is organized into multiple files, with a naming convention to separate files into different partitions. For more information, see Supported data layouts on this page.

Limitations

  • Hive partitioning support is built assuming a common source URI prefix for all URIs that ends immediately before partition encoding, which looks like gs://<bucket>/<path_to_table>/.
  • The directory structure of a Hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys for each table.
  • The data must follow a default Hive partitioning layout.
  • The Hive partitioning keys and the columns in the underlying files cannot overlap.
  • All limitations for loading from Cloud Storage apply.

Supported data layouts

The data must follow a default Hive partitioned layout. For example, the following files follow the default layout, with the key-value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order.

gs://bucket/table/dt=2019-10-31/lang=en/foo
gs://bucket/table/dt=2018-10-31/lang=fr/bar

The common source URI prefix in this example is gs://bucket/table.

Unsupported data layouts

If the partition key names are not encoded in the directory path, partition schema detection fails. For example, consider the following path, which does not encode the partition key names:

gs://bucket/table/2019-10-31/en/foo

Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:

gs://bucket/table/dt=2019-10-31/lang=en/foo
gs://bucket/table/lang=fr/dt=2018-10-31/bar

Partition schema detection modes

Hive partition keys appear as normal columns when loading data from Cloud Storage. We support three modes of Hive partition schema detection:

  • AUTO: Key names and types are auto detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.
  • STRINGS: Key names are automatically inferred with STRING type.
  • CUSTOM: Partition key schema is encoded in the source URI prefix.

Provide a custom partition key schema

CUSTOM detection requires encoding the schema in the source URI prefix field by providing partition key name/type pairs. The values must validly parse as the specified type or the load fails.

For example, setting source_uri_prefix to: gs://my_bucket/my_table/{dt:DATE}/{val:STRING} treats val as a STRING and dt as a DATE, and extracts gs://my_bucket/my_table as the source URI prefix for the matched files.

See detailed instructions for the Google Cloud console, bq command-line tool, and REST API.

Perform incremental loads

Consider the following data layout:

gs://my_bucket/my_table/dt=2019-10-31/val=1/file1
gs://my_bucket/my_table/dt=2018-10-31/val=2/file2
gs://my_bucket/my_table/dt=2017-10-31/val=3/file3
gs://my_bucket/my_table/dt=2016-10-31/val=4/file4

To load only data from 2019-10-31, do the following:

  • Set the Hive partitioning mode to AUTO, STRINGS, or CUSTOM.
  • Set the source URI prefix to gs://my_bucket/my_table/ for AUTO or STRINGS Hive partitioning modes. For CUSTOM, provide gs://my_bucket/my_table/{dt:DATE}/{val:INTEGER}.
  • Use the URI gs://my_bucket/my_table/dt=2019-10-31/*.
  • Data is loaded with dt and val columns included, with values 2019-10-31 and 1, respectively.

To load only data from specific files, do the following:

  • Set the Hive partitioning mode to AUTO, STRINGS, or CUSTOM.
  • Set the source URI prefix to gs://my_bucket/my_table/ for AUTO or STRINGS Hive partitioning modes. For CUSTOM, provide gs://my_bucket/my_table/{dt:DATE}/{val:INTEGER}.
  • Use the URIs gs://my_bucket/my_table/dt=2017-10-31/val=3/file3,gs://my_bucket/my_table/dt=2016-10-31/val=4/file4.
  • Data is loaded from both files with the dt and val columns filled in.

Load Hive partitioned data

To load Hive partitioned data, choose one of the following options:

Console

  1. In the Google Cloud console, go to BigQuery.

    Go to BigQuery

  2. In the Explorer panel, expand your project and select a dataset.
  3. Click View actions, and then click Open.
  4. In the details panel, click Create table.
  5. On the Create table page, specify the following details:
    1. In the Source section, for Create table from, select Google Cloud Storage. Then, do the following:
      1. For Select file from Cloud Storage bucket, enter the path to the Cloud Storage folder, using wildcards. For example, my_bucket/my_files*. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
      2. From the File format list, select the file type.
      3. Select the Source data partitioning checkbox, and then for Select Source URI Prefix, enter the Cloud Storage URI prefix. For example, gs://my_bucket/my_files.
      4. In the Partition inference mode section, select one of the following options:
        • Automatically infer types: set the partition schema detection mode to AUTO.
        • All columns are strings: set the partition schema detection mode to STRINGS.
        • Provide my own: set the partition schema detection mode to CUSTOM and manually enter the schema information for the partition keys. For more information, see Provide a custom partition key schema.
      5. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. Requiring a partition filter can reduce cost and improve performance. For more information, see Requiring predicate filters on partition keys in queries.
    2. In the Destination section, specify the following details:
      1. For Dataset name, select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to Native table .
    3. In the Schema section, enter the schema definition.
    4. To enable the auto detection of schema, select Auto detect.
    5. Click Create table.

bq

Load Hive partitioned data using automatic partition key type detection:

bq load --source_format=ORC --hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix \
dataset.table gcs_uris

Load Hive partitioned data using string-typed partition key detection:

bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix \
dataset.table gcs_uris

Load Hive partitioned data using a custom partition key schema that is encoded using the source\_uri\_prefix field:

bq load --source_format=JSON --hive_partitioning_mode=CUSTOM \
--hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix/partition_key_schema \
dataset.table gcs_uris file_schema

The partition key schema is encoded immediately following the source URI prefix. Use the following format to specify --hive_partitioning_source_uri_prefix:

--hive_partitioning_source_uri_prefix=gcs_uri_shared_prefix/{key1:TYPE1}/{key2:TYPE2}/{key3:TYPE3}

API

Support for Hive partitioning exists by setting the HivePartitioningOptions on the JobConfigurationLoad.