Querying externally partitioned data

BigQuery supports querying externally partitioned data in Avro, Parquet, ORC, JSON, and CSV formats that is stored on Cloud Storage using a default hive partitioning layout. Hive partitioning support is enabled by setting the appropriate options in the table definition file. You can create and modify tables using the Cloud Console, bq command-line tool, and BigQuery API.

For instructions on querying managed partitioned tables, see Introduction to partitioned tables.

Limitations

  • Hive partitioning support is built assuming a common source URI prefix for all URIs that ends immediately before partition encoding, as follows: gs://BUCKET/PATH_TO_TABLE/.
  • The directory structure of a hive partitioned table is assumed to have the same partitioning keys appear in the same order, with a maximum of ten partition keys per table.
  • The data must follow a default hive partitioning layout.
  • The hive partitioning keys and the columns in the underlying files cannot overlap.
  • All limitations for querying external data sources stored on Cloud Storage apply.
  • Support is for Standard SQL only.

Supported data layouts

The data must follow a default hive partitioned layout. For example, the following files follow the default layout—the key-value pairs are configured as directories with an = sign as a separator, and the partition keys are always in the same order:

gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar

The common source URI prefix in this example is gs://myBucket/myTable.

Unsupported data layouts

If the partition key names are not encoded in the directory path, partition schema detection fails. For example, consider the following path, which does not encode the partition key names:

gs://myBucket/myTable/2019-10-31/en/foo

Files where the schema is not in a consistent order also fail detection. For example, consider the following two files with inverted partition key encodings:

gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/lang=fr/dt=2018-10-31/bar

For these files, partition schema detection would fail.

Partition schema detection modes

Hive partition keys appear as normal columns when querying data from Cloud Storage. BigQuery supports three modes of hive partition schema detection:

  • AUTO: Key names and types are automatically detected. The following types can be detected: STRING, INTEGER, DATE, and TIMESTAMP.
  • STRINGS: Key names are automatically converted to STRING type.
  • CUSTOM: Partition key schema is encoded as specified in the source URI prefix.

Providing a custom partition key schema

To use a CUSTOM schema, you must specify the schema in the source URI prefix field. Using a CUSTOM schema lets you specify the type for each partition key. The values must validly parse as the specified type or the query fails.

For example, if you set the source_uri_prefix flag to gs://myBucket/myTable/{dt:DATE}/{val:STRING}, BigQuery treats val as a STRING, dt as a DATE, and uses gs://myBucket/myTable as the source URI prefix for the matched files.

Partition pruning

BigQuery prunes partitions when possible using query predicates on the partition keys. This lets BigQuery avoid reading unnecessary files, which helps improve performance.

Requiring predicate filters on partition keys in queries

When you create an externally partitioned table, you can require the use of predicate filters on partition keys by enabling the requirePartitionFilter option under HivePartitioningOptions.

When this option is enabled, attempts to query the externally partitioned table without specifying a WHERE clause produce the following error: Cannot query over table <table_name> without a filter over column(s) <partition key names> that can be used for partition elimination.

Creating an external table for hive-partitioned data

You can create an external table for hive-partitioned data in the following ways:

  • Using the Cloud Console.
  • Using the bq command-line tool.
  • Using the client libraries.

Console

  1. Open the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. In the Explorer panel, expand your project and select a dataset.

  3. Expand the Actions option and click Open.

  4. In the details panel, click Create table .

  5. On the Create table page, in the Source section, do the following:

    1. In the Create table from drop-down list, select Cloud Storage.
    2. In the Select file from Cloud Storage bucket field, enter the path to the Cloud Storage folder, using wildcard format. For example, my_bucket/my_files*.
    3. In the File format drop-down list, select the file type.
    4. Select the Source data partitioning checkbox.
    5. In the Select Source URI Prefix field, enter the Cloud Storage URI prefix. For example, gs://my_bucket/my_files.
    6. Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
    7. Optional: To require that all queries against this table must use a predicate filter, check Require partition filter. For more information, see Requiring predicate filters on partition keys in queries.
  6. On the Create table page, in the Destination section, do the following:

    1. Select the project name and the dataset name.
    2. In the Table type drop-down list, select External table.
    3. In the Table name field, enter a name for the external table.
  7. On the Create table page, in the Schema section, enter the schema information. BigQuery supports schema auto-detection for some formats. For more information, see Using schema auto-detection.

  8. Click Create table.

bq

  • To set automatic partition key detection, set the --hive_partitioning_mode flag to AUTO. For example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \
GCS_URIS > TABLE_DEF_FILE
  • To set string-typed partition key detection, set the --hive_partitioning_mode flag to STRINGS. For example:
bq mkdef --source_format=PARQUET --hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \
GCS_URIS > TABLE_DEF_FILE
  • To specify a custom partition key schema using the --source_uri_prefix flag, set the --hive_partitioning_mode flag to CUSTOM. For example:
bq mkdef --source_format=NEWLINE_DELIMITED_JSON --hive_partitioning_mode=CUSTOM \
--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/CUSTOM_SCHEMA_ENCODING \
GCS_URIS FILE_SCHEMA > TABLE_DEF_FILE

For the --hive_partitioning_source_uri_prefix flag, specify the <var>CUSOM_SCHEMA_ENCODING</var> portion of the argument immediately following the <var>GCS_URI_SHARED_PREFIX</var> portion, as follows:

GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

For example:

--hive_partitioning_source_uri_prefix=gs://myBucket/myTable/{dt:DATE}/{val:STRING}

  • To require the use of a predicate filter at query time, add --require_hive_partition_filter=True to the bq mkdef command; for example:
bq mkdef --source_format=ORC --hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX \
--require_hive_partition_filter=True \
GCS_URIS > TABLE_DEF_FILE

API

To set hive partitioning using the BigQuery API, include a hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file.

If you set the hivePartitioningOptions.mode field to CUSTOM, you must encode the partition key schema in the hivePartitioningOptions.sourceUriPrefix field as follows: gs://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set the hivePartitioningOptions.requirePartitionFilter field to true.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.HivePartitioningOptions;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;

// Sample to create external table using hive partitioning
public class SetHivePartitioningOptions {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    String tableName = "MY_TABLE_NAME";
    String sourceUri = "gs://cloud-samples-data/bigquery/hive-partitioning-samples/customlayout/*";
    String sourceUriPrefix =
        "gs://cloud-samples-data/bigquery/hive-partitioning-samples/customlayout/{pkey:STRING}/";
    setHivePartitioningOptions(datasetName, tableName, sourceUriPrefix, sourceUri);
  }

  public static void setHivePartitioningOptions(
      String datasetName, String tableName, String sourceUriPrefix, String sourceUri) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Configuring partitioning options
      HivePartitioningOptions hivePartitioningOptions =
          HivePartitioningOptions.newBuilder()
              .setMode("CUSTOM")
              .setRequirePartitionFilter(true)
              .setSourceUriPrefix(sourceUriPrefix)
              .build();

      TableId tableId = TableId.of(datasetName, tableName);
      ExternalTableDefinition customTable =
          ExternalTableDefinition.newBuilder(sourceUri, FormatOptions.parquet())
              .setAutodetect(true)
              .setHivePartitioningOptions(hivePartitioningOptions)
              .build();
      bigquery.create(TableInfo.of(tableId, customTable));
      System.out.println("External table created using hivepartitioningoptions");
    } catch (BigQueryException e) {
      System.out.println("External table was not created" + e.toString());
    }
  }
}