Stay organized with collections Save and categorize content based on your preferences.

Querying externally partitioned data

BigQuery supports querying externally partitioned data in Avro, Parquet, ORC, JSON, and CSV formats that is stored on Cloud Storage using a default Hive partitioning layout. Hive partitioning support is enabled by setting the appropriate options in the table definition file. You can create and modify tables using the Google Cloud console, bq command-line tool, and BigQuery API.

For instructions on querying managed partitioned tables, see Introduction to partitioned tables.

Limitations

  • Hive partitioning support is built assuming a common source URI prefix for all URIs that ends immediately before partition encoding, as follows: gs://BUCKET/PATH_TO_TABLE/.
  • The directory structure of a Hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys per table.
  • The data must follow a default Hive partitioning layout.
  • The Hive partitioning keys and the columns in the underlying files cannot overlap.
  • All limitations for querying external data sources stored on Cloud Storage apply.
  • Support is for Google Standard SQL only.

Supported data layouts

The data must follow a default Hive partitioned layout. For example, the following files follow the default layout—the key-value pairs are configured as directories with an = sign as a separator, and the partition keys are always in the same order:

gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar

The common source URI prefix in this example is gs://myBucket/myTable.

Unsupported data layouts

If the partition key names are not encoded in the directory path, partition schema detection fails. For example, consider the following path, which does not encode the partition key names:

gs://myBucket/myTable/2019-10-31/en/foo

Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:

gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/lang=fr/dt=2018-10-31/bar

For these files, partition schema detection would fail.

Partition schema detection modes

Hive partition keys appear as normal columns when querying data from Cloud Storage. BigQuery supports three modes of Hive partition schema detection:

  • AUTO: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.
  • STRINGS: Key names are automatically converted to STRING type.
  • CUSTOM: Partition key schema is encoded as specified in the source URI prefix.

Providing a custom partition key schema

To use a CUSTOM schema, you must specify the schema in the source URI prefix field. Using a CUSTOM schema lets you specify the type for each partition key. The values must validly parse as the specified type or the query fails.

For example, if you set the source_uri_prefix flag to gs://myBucket/myTable/{dt:DATE}/{val:STRING}, BigQuery treats val as a STRING, dt as a DATE, and uses gs://myBucket/myTable as the source URI prefix for the matched files.

Partition pruning

BigQuery prunes partitions when possible using query predicates on the partition keys. This lets BigQuery avoid reading unnecessary files, which helps improve performance.

Requiring predicate filters on partition keys in queries

When you create an externally partitioned table, you can require the use of predicate filters on partition keys by enabling the requirePartitionFilter option under HivePartitioningOptions.

When this option is enabled, attempts to query the externally partitioned table without specifying a WHERE clause produce the following error: Cannot query over table <table_name> without a filter over column(s) <partition key names> that can be used for partition elimination.

Creating an external table for Hive partitioned data

To create an external table for Hive partitioned data, choose one of the following options:

Console

  1. In the Google Cloud console, go to BigQuery.

    Go to BigQuery

  2. In the Explorer panel, expand your project and select a dataset.
  3. Click View actions, and then click Open.
  4. In the details panel, click Create table.
  5. On the Create table page, specify the following details:
    1. In the Source section, for Create table from, select Google Cloud Storage. Then, do the following:
      1. For Select file from Cloud Storage bucket, enter the path to the Cloud Storage folder, using wildcards. For example, my_bucket/my_files*. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
      2. From the File format list, select the file type.
      3. Select the Source data partitioning checkbox, and then for Select Source URI Prefix, enter the Cloud Storage URI prefix. For example, gs://my_bucket/my_files.
      4. In the Partition inference mode section, select one of the following options:
        • Automatically infer types: set the partition schema detection mode to AUTO.
        • All columns are strings: set the partition schema detection mode to STRINGS.
        • Provide my own: set the partition schema detection mode to CUSTOM and manually enter the schema information for the partition keys. For more information, see Provide a custom partition key schema.
      5. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. Requiring a partition filter can reduce cost and improve performance. For more information, see Requiring predicate filters on partition keys in queries.
    2. In the Destination section, specify the following details:
      1. For Dataset name, select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to External table.
    3. In the Schema section, enter the schema definition.
    4. To enable the auto detection of schema, select Auto detect.
    5. Click Create table.

SQL

Use the CREATE EXTERNAL TABLE DDL statement.

The following example uses automatic detection of Hive partition keys:

CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
WITH PARTITION COLUMNS
OPTIONS (
format = 'SOURCE_FORMAT',
uris = ['GCS_URIS'],
hive_partition_uri_prefix = 'GCS_URI_SHARED_PREFIX',
require_hive_partition_filter = BOOLEAN);

Replace the following:

  • SOURCE_FORMAT: the format of the external data source, such as PARQUET
  • GCS_URIS: the path to the Cloud Storage folder, using wildcard format
  • GCS_URI_SHARED_PREFIX: the source URI prefix without the wildcard
  • BOOLEAN: whether to require a predicate filter at query time. This flag is optional. The default value is false.

The following example uses custom Hive partition keys and types by listing them in the WITH PARTITION COLUMNS clause:

CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
WITH PARTITION COLUMNS (PARTITION_COLUMN_LIST)
OPTIONS (
format = 'SOURCE_FORMAT',
uris = ['GCS_URIS'],
hive_partition_uri_prefix = 'GCS_URI_SHARED_PREFIX',
require_hive_partition_filter = BOOLEAN);

Replace the following:

  • PARTITION_COLUMN_LIST: a list of columns following the same order in the path of Cloud Storage folder, in the format of:
KEY1 TYPE1, KEY2 TYPE2

The following example creates an externally partitioned table. It uses schema auto-detection to detect both the file schema and the hive partitioning layout. If the external path is gs://bucket/path/field_1=first/field_2=1/data.parquet, the partition columns are detected as field_1 (STRING) and field_2 (INT64).

CREATE EXTERNAL TABLE dataset.AutoHivePartitionedTable
WITH PARTITION COLUMNS
OPTIONS (
uris = ['gs://bucket/path/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://bucket/path',
require_hive_partition_filter = false);

The following example creates an externally partitioned table by explicitly specifying the partition columns. This example assumes that the external file path has the pattern gs://bucket/path/field_1=first/field_2=1/data.parquet.

CREATE EXTERNAL TABLE dataset.CustomHivePartitionedTable
WITH PARTITION COLUMNS (
field_1 STRING, -- column order must match the external path
field_2 INT64)
OPTIONS (
uris = ['gs://bucket/path/*'],
format = 'PARQUET',
hive_partition_uri_prefix = 'gs://bucket/path',
require_hive_partition_filter = false);

bq

First, use the bq mkdef command to create a table definition file:

bq mkdef \
--source_format=SOURCE_FORMAT \
--hive_partitioning_mode=PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \
--require_hive_partition_filter=BOOLEAN \
 GCS_URIS > DEFINITION_FILE

Where:

  • SOURCE_FORMAT is the format of the external data source. For example, CSV.
  • PARTITIONING_MODE is the Hive partitioning mode. Use one of the following values:
    • AUTO: Automatically detect the key names and types.
    • STRINGS: Automatically convert the key names to strings.
    • CUSTOM: Encode the key schema in the source URI prefix.
  • GCS_URI_SHARED_PREFIX is the source URI prefix.
  • BOOLEAN specifies whether to require a predicate filter at query time. This flag is optional. The default value is false.
  • GCS_URIS is the path to the Cloud Storage folder, using wildcard format.
  • DEFINITION_FILE is the path to the table definition file on your local machine.

If PARTITIONING_MODE is CUSTOM, include the partition key schema in the source URI prefix, using the following format:

--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

After you create the table definition file, use the bq mk command to create the external table:

bq mk --external_table_definition=DEFINITION_FILE \
DATASET_NAME.TABLE_NAME \
SCHEMA

Where:

  • DEFINITION_FILE is the path to the table definition file.
  • DATASET_NAME is the name of the dataset that contains the table.
  • TABLE_NAME is the name of the table you're creating.
  • SCHEMA specifies a path to a JSON schema file, or specifies the schema in the form field:data_type,field:data_type,.... To use schema auto-detection, omit this argument.

Examples

The following example uses AUTO Hive partitioning mode:

bq mkdef --source_format=CSV \
  --hive_partitioning_mode=AUTO \
  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable \
  gs://myBucket/myTable/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example uses STRING Hive partitioning mode:

bq mkdef --source_format=CSV \
  --hive_partitioning_mode=STRING \
  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable \
  gs://myBucket/myTable/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example uses CUSTOM Hive partitioning mode:

bq mkdef --source_format=CSV \
  --hive_partitioning_mode=CUSTOM \
  --hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING} \
  gs://myBucket/myTable/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

To set Hive partitioning using the BigQuery API, include a hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file.

If you set the hivePartitioningOptions.mode field to CUSTOM, you must encode the partition key schema in the hivePartitioningOptions.sourceUriPrefix field as follows: gs://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set the hivePartitioningOptions.requirePartitionFilter field to true.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.HivePartitioningOptions;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;

// Sample to create external table using hive partitioning
public class SetHivePartitioningOptions {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/customlayout/*";
    String sourceUriPrefix =
        "gs://cloud-samples-data/bigquery/hive-partitioning-samples/customlayout/{pkey:STRING}/";
    setHivePartitioningOptions(datasetName, tableName, sourceUriPrefix, sourceUri);
  }

  public static void setHivePartitioningOptions(
      String datasetName, String tableName, String sourceUriPrefix, String sourceUri) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Configuring partitioning options
      HivePartitioningOptions hivePartitioningOptions =
          HivePartitioningOptions.newBuilder()
              .setMode("CUSTOM")
              .setRequirePartitionFilter(true)
              .setSourceUriPrefix(sourceUriPrefix)
              .build();

      TableId tableId = TableId.of(datasetName, tableName);
      ExternalTableDefinition customTable =
          ExternalTableDefinition.newBuilder(sourceUri, FormatOptions.parquet())
              .setAutodetect(true)
              .setHivePartitioningOptions(hivePartitioningOptions)
              .build();
      bigquery.create(TableInfo.of(tableId, customTable));
      System.out.println("External table created using hivepartitioningoptions");
    } catch (BigQueryException e) {
      System.out.println("External table was not created" + e.toString());
    }
  }
}

Updating externally partitioned tables

Partition keys cannot change when updating permanent external tables.