Stay organized with collections Save and categorize content based on your preferences.

Querying Cloud Bigtable data

This page describes how to use BigQuery to query data stored in Cloud Bigtable.

Bigtable is Google's sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library. For information on the Bigtable data model, see Storage model.

Supported regions and zones

Querying data in Bigtable is available in all supported Bigtable zones. You can find the list of zones here. For multi-cluster instances, BigQuery routes traffic based on Bigtable app profile settings.

Limitations

For information about limitations that apply to external tables, see External table limitations.

Retrieve the Bigtable URI

To create an external table for a Bigtable data source, you must provide the Bigtable URI. To retrieve the Bigtable URI:

  1. Open the Bigtable console.

    Open the Bigtable console

  2. Retrieve the following details about your Bigtable data source:

    • Your project ID
    • Your Bigtable instance ID
    • (Optional) The ID of your Bigtable app profile
    • The name of your Bigtable table
  3. Compose the Bigtable URI using the following format, where:

    • project_id is the project containing your Bigtable instance
    • instance_id is the Bigtable instance ID
    • (Optional) app_profile is the app profile ID that you want to use
    • table_name is the name of the table you're querying

    https://googleapis.com/bigtable/projects/project_id/instances/instance_id[/appProfiles/app_profile]/tables/table_name

Required permissions

To create and query an external table in BigQuery, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to create and query an external table in BigQuery:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • roles/bigquery.user (includes the bigquery.jobs.create permission)
  • roles/bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and access external tables in the datasets that you create. But, you still require the bigquery.jobs.create permission to query the data.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

End user permissions

You can share access to a permanent external table with users (including service accounts) or groups. You cannot share access to a temporary external table. To query the external table, your users or groups need to be granted (at a minimum):

  • The roles/bigquery.dataViewer role at the dataset level or higher to access the dataset that contains the external table
  • The roles/bigquery.user role at the project level or higher in order to run query jobs
  • The roles/bigtable.reader role in Bigtable provides read-only access to metadata and tables

Scopes for Compute Engine instances

When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Bigtable. Applications running on the VM use the service account to call Google Cloud APIs.

If you set up a Compute Engine instance to run as a service account, and that service account accesses an external table linked to a Bigtable data source, you must add the Bigtable read-only data access scope (https://www.googleapis.com/auth/bigtable.data.readonly) to the instance. For more information, see Creating a Compute Engine instance for Bigtable.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.

Using Bigtable app profiles

A Bigtable application profile, or app profile, stores settings that tell your Bigtable instance how to handle incoming requests from an application. When BigQuery connects to a Bigtable instance, it can specify an app profile, and Bigtable uses that app profile for requests from BigQuery.

To learn how Bigtable app profiles work, see About app profiles.

Create a permanent external table

When you create a permanent external table in BigQuery that is linked to a Bigtable data source, there are three ways to specify the format of the external table:

  • If you are using the API or the bq command-line tool, you create a table definition file that defines the schema and metadata for the external table.
  • If you are using SQL, you use the uri option of the CREATE EXTERNAL TABLE statement to specify the Bigtable table to pull data from, and the bigtable_options option to specify the table schema.
  • If you are using the Google Cloud console, you enter the Bigtable column family and qualifiers manually.

The external table data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying Bigtable data source.

Before you create an external table, you must create a dataset to contain the table.

To create a permanent table:

Console

Currently it is not possible to create a Bigtable external table using the Google Cloud console.

SQL

You can create a permanent external table by running the CREATE EXTERNAL TABLE DDL statement. You must specify the table schema explicitly as part of the statement options.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE DATASET.NEW_TABLE
    OPTIONS (
      format = 'CLOUD_BIGTABLE',
      uris = ['URI'],
      bigtable_options = BIGTABLE_OPTIONS );
    

    Replace the following:

    • DATASET: the dataset in which to create the Bigtable external table.
    • NEW_TABLE: the name for the Bigtable external table.
    • URI: the URI for the Bigtable table you want to use as a data source. This URI must follow the format described in Retrieving the Bigtable URI.
    • BIGTABLE_OPTIONS: the schema for the Bigtable table in JSON format. For a list of Bigtable table definition options, see BigtableOptions in the REST API reference.

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

A statement to create an external Bigtable table might look similar to the following:

CREATE EXTERNAL TABLE mydataset.BigtableTable
OPTIONS (
  format = 'CLOUD_BIGTABLE',
  uris = ['https://googleapis.com/bigtable/projects/myproject/instances/myBigtableInstance/tables/table1'],
  bigtable_options =
    """
    {
      bigtableColumnFamilies: [
        {
          "familyId": "familyId1",
          "type": "INTEGER",
          "encoding": "BINARY"
        }
      ],
      readRowkeyAsString: true
    }
    """
);

bq

You create a table in the bq command-line tool using the bq mk command. When you use the bq command-line tool to create a table linked to an external data source, you identify the table's schema using a table definition file.

  1. Use the bq mk command to create a permanent table.

    bq mk \
    --external_table_definition=definition_file \
    dataset.table
    

    Where:

    • definition_file is the path to the table definition file on your local machine.
    • dataset is the name of the dataset that contains the table.
    • table is the name of the table you're creating.

API

Use the tables.insert API method, and create an ExternalDataConfiguration in the Table resource that you pass in.

For the sourceUris property in the Table resource, specify only one Bigtable URI. It must be a valid HTTPS URL.

For the sourceFormatproperty, specify "BIGTABLE".

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.BigtableColumn;
import com.google.cloud.bigquery.BigtableColumnFamily;
import com.google.cloud.bigquery.BigtableOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TableResult;
import com.google.common.collect.ImmutableList;
import org.apache.commons.codec.binary.Base64;

// Sample to queries an external bigtable data source using a permanent table
public class QueryExternalBigtablePerm {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String bigtableInstanceId = "MY_INSTANCE_ID";
    String bigtableTableName = "MY_BIGTABLE_NAME";
    String bigqueryDatasetName = "MY_DATASET_NAME";
    String bigqueryTableName = "MY_TABLE_NAME";
    String sourceUri =
        String.format(
            "https://googleapis.com/bigtable/projects/%s/instances/%s/tables/%s",
            projectId, bigtableInstanceId, bigtableTableName);
    String query = String.format("SELECT * FROM %s ", bigqueryTableName);
    queryExternalBigtablePerm(bigqueryDatasetName, bigqueryTableName, sourceUri, query);
  }

  public static void queryExternalBigtablePerm(
      String datasetName, String tableName, String sourceUri, String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      BigtableColumnFamily.Builder statsSummary = BigtableColumnFamily.newBuilder();

      // Configuring Columns
      BigtableColumn connectedCell =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("connected_cell".getBytes()))
              .setFieldName("connected_cell")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();
      BigtableColumn connectedWifi =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("connected_wifi".getBytes()))
              .setFieldName("connected_wifi")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();
      BigtableColumn osBuild =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("os_build".getBytes()))
              .setFieldName("os_build")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();

      // Configuring column family and columns
      statsSummary
          .setColumns(ImmutableList.of(connectedCell, connectedWifi, osBuild))
          .setFamilyID("stats_summary")
          .setOnlyReadLatest(true)
          .setEncoding("TEXT")
          .setType("STRING")
          .build();

      // Configuring BigtableOptions is optional.
      BigtableOptions options =
          BigtableOptions.newBuilder()
              .setIgnoreUnspecifiedColumnFamilies(true)
              .setReadRowkeyAsString(true)
              .setColumnFamilies(ImmutableList.of(statsSummary.build()))
              .build();

      TableId tableId = TableId.of(datasetName, tableName);
      // Create a permanent table linked to the Bigtable table
      ExternalTableDefinition externalTable =
          ExternalTableDefinition.newBuilder(sourceUri, options).build();
      bigquery.create(TableInfo.of(tableId, externalTable));

      // Example query
      TableResult results = bigquery.query(QueryJobConfiguration.of(query));

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on external permanent table performed successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Query a permanent external table

Once you have created an external Bigtable table, you can run a query against it exactly as if it were a native BigQuery table, subject to the limitations on external data sources.

Query Bigtable data using temporary external tables

To query an external data source without creating a permanent table, you run a command to combine:

  • A table definition file with a query
  • An inline schema definition with a query
  • A JSON schema definition file with a query

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the bq command-line tool and the API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Create and query a temporary external table

To query Bigtable data using a temporary external table, you:

Creating and querying a temporary external table is currently supported by the bq command-line tool and the API.

bq

To query a temporary table using a table definition file, enter the bq query command with the --external_table_definition flag.

(Optional) Supply the --location flag and set the value to your location.

bq --location=location query \
--use_legacy_sql=false \
--external_table_definition=table::definition_file \
'query'

Where:

  • location is the name of your location. The --location flag is optional.
  • table is the name of the temporary table you're creating.
  • definition_file is the path to the table definition file on your local machine.
  • query is the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named follows using a table definition file named follows_def.

bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
  COUNT(rowkey)
 FROM
   follows'

API

  • Create a query. See Querying data for information about creating a query job.

  • (Optional) Specify your location in the location property in the jobReference section of the job resource.

  • Specify the external data source properties by setting the ExternalDataConfiguration for the table resource.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.BigtableColumn;
import com.google.cloud.bigquery.BigtableColumnFamily;
import com.google.cloud.bigquery.BigtableOptions;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.TableResult;
import com.google.common.collect.ImmutableList;
import org.apache.commons.codec.binary.Base64;

// Sample to queries an external bigtable data source using a temporary table
public class QueryExternalBigtableTemp {

  public static void main(String[] args) {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String bigtableInstanceId = "MY_INSTANCE_ID";
    String bigtableTableName = "MY_BIGTABLE_NAME";
    String bigqueryTableName = "MY_TABLE_NAME";
    String sourceUri =
        String.format(
            "https://googleapis.com/bigtable/projects/%s/instances/%s/tables/%s",
            projectId, bigtableInstanceId, bigtableTableName);
    String query = String.format("SELECT * FROM %s ", bigqueryTableName);
    queryExternalBigtableTemp(bigqueryTableName, sourceUri, query);
  }

  public static void queryExternalBigtableTemp(String tableName, String sourceUri, String query) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      BigtableColumnFamily.Builder statsSummary = BigtableColumnFamily.newBuilder();

      // Configuring Columns
      BigtableColumn connectedCell =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("connected_cell".getBytes()))
              .setFieldName("connected_cell")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();
      BigtableColumn connectedWifi =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("connected_wifi".getBytes()))
              .setFieldName("connected_wifi")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();
      BigtableColumn osBuild =
          BigtableColumn.newBuilder()
              .setQualifierEncoded(Base64.encodeBase64String("os_build".getBytes()))
              .setFieldName("os_build")
              .setType("STRING")
              .setEncoding("TEXT")
              .build();

      // Configuring column family and columns
      statsSummary
          .setColumns(ImmutableList.of(connectedCell, connectedWifi, osBuild))
          .setFamilyID("stats_summary")
          .setOnlyReadLatest(true)
          .setEncoding("TEXT")
          .setType("STRING")
          .build();

      // Configuring BigtableOptions is optional.
      BigtableOptions options =
          BigtableOptions.newBuilder()
              .setIgnoreUnspecifiedColumnFamilies(true)
              .setReadRowkeyAsString(true)
              .setColumnFamilies(ImmutableList.of(statsSummary.build()))
              .build();

      // Configure the external data source and query job.
      ExternalTableDefinition externalTable =
          ExternalTableDefinition.newBuilder(sourceUri, options).build();
      QueryJobConfiguration queryConfig =
          QueryJobConfiguration.newBuilder(query)
              .addTableDefinition(tableName, externalTable)
              .build();

      // Example query
      TableResult results = bigquery.query(queryConfig);

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on external temporary table performed successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

Performance considerations

The performance of queries against Bigtable external data sources depends on three factors:

  • The number of rows
  • The amount of data read
  • The extent of parallelization

BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Bigtable cluster and how many splits you have for your table.

Note that Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.

Querying Bigtable from BigQuery consumes Bigtable CPU cycles. CPU consumption by BigQuery might affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Bigtable affects long-tail queries and increases latency at the 99th percentile.

You should monitor Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance enables you to better handle both BigQuery traffic and traffic from other concurrent requests.

Generated schema

By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.

Example

You are storing user profiles for a fictional social network. One data model for this might be a profile column family with individual columns for gender, age and email:

rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice  | female        | 30         | alice@gmail.com

Using the default schema, a Google Standard SQL query to count the number of male users over 30 is:

SELECT
  COUNT(1)
FROM
  `dataset.table`
OMIT
  RECORD IF NOT SOME(profile.column.name = "gender"
    AND profile.column.cell.value = "male")
  OR NOT SOME(profile.column.name = "age"
    AND INTEGER(profile.column.cell.value) > 30)

Querying the data is less challenging if gender and age are exposed as sub- fields. To expose them as sub-fields, list gender and age as named columns in the profile column family when defining the table. You can also instruct BigQuery to expose the latest values from this column family since. Typically, only the latest value (and possibly the only value) is of interest.

After exposing the columns as sub-fields, the Google Standard SQL query to count the number of male users over 30 is:

SELECT
  COUNT(1)
FROM
  `dataset.table`
WHERE
  profile.gender.cell.value="male"
  AND profile.age.cell.value > 30

Notice how gender and age are referenced directly as fields. The JSON configuration for this setup is:

  "bigtableOptions": {
    "readRowkeyAsString": "true",
    "columnFamilies": [
      {
          "familyId": "profile",
          "onlyReadLatest": "true",
          "columns": [
              {
                  "qualifierString": "gender",
                  "type": "STRING"
              },
              {
                  "qualifierString": "age",
                  "type": "INTEGER"
              }
          ]
      }
    ]
  }

Value encoding

Bigtable stores data as raw bytes, independent to data encoding. However, byte values are of limited use in SQL query analysis. Bigtable provides two basic types of scalar decoding: text and HBase-binary.

The text format assumes that all values are stored as alphanumeric text strings. For example, an integer 768 will be stored as the string "768". The binary encoding assumes that HBase's Bytes.toBytes class of methods were used to encode the data and applies an appropriate decoding method.

Query filters

Queries with a row equality filter only read that specific row. For example, in Google Standard SQL syntax:

SELECT
  COUNT(follows.column.name)
FROM
  `dataset.table`
WHERE
  rowkey = "alice";

Range filters such as rowkey > '1' and rowkey < '8' are also supported, but only when rowkey is read as a string with the readRowkeyAsString option.