Querying Google Cloud Bigtable Data

This page describes how to use BigQuery to query data stored in Google Cloud Bigtable.

Cloud Bigtable is Google's sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Cloud Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library. For information on the Cloud Bigtable data model, see Storage Model.

Supported regions and zones

Querying data in Cloud Bigtable is currently only available in the following regions and zones:

Region Zone(s)
us-central1

us-central1-b

us-central1-c

europe-west1

europe-west1-b

europe-west1-c

Retrieving the Google Cloud Bigtable URI

To create an external table for a Cloud Bigtable data source, you must provide the Cloud Bigtable URI. To retrieve the Cloud Bigtable URI:

  1. Open the Cloud Bigtable console.

    Open the Cloud Bigtable console

  2. Retrieve the following details about your Cloud Bigtable data source:

    • Your project ID
    • Your Cloud Bigtable instance ID
    • The name of your Cloud Bigtable table
  3. Compose the Cloud Bigtable URI using the following format, where:

    • [PROJECT_ID] is the project containing your Cloud Bigtable instance
    • [INSTANCE_ID] is the Cloud Bigtable instance ID
    • [TABLE_NAME] is the name of the table you're querying

    https://googleapis.com/bigtable/projects/[PROJECT_ID]/instances/[INSTANCE_ID]/tables/[TABLE_NAME]

Access controls and scopes

Access controls for permanent external tables

You can share access to a permanent external table linked to a Google Cloud Bigtable data source. You cannot share access to a temporary external table.

You can share access to a permanent external table with users (including service accounts) or groups. To query the external table, your users or groups need (at a minimum):

  • READER or bigquery.dataViewer access to the dataset containing the external table
  • bigquery.user access to the project containing the dataset (in order to run query jobs)
  • bigtable.reader access to Cloud Bigtable which provides read-only access to metadata and tables

Scopes for Compute Engine instances

When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud Platform products, including Cloud Bigtable. Applications running on the VM use the service account to call Google Cloud APIs.

If you set up a Compute Engine instance to run as a service account, and that service account accesses an external table linked to a Cloud Bigtable data source, you must add the Cloud Bigtable read-only data access scope (https://www.googleapis.com/auth/bigtable.data.readonly) to the instance. For more information, see Creating a Compute Engine Instance for Cloud Bigtable.

For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service Accounts.

Permanent versus temporary external tables

You can query an external data source in BigQuery using a permanent table or a temporary table. When you use a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Cloud Bigtable data using permanent external tables

To query a Cloud Bigtable data source using a permanent table, you create a table in a BigQuery dataset that is linked to your Cloud Bigtable data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying Cloud Bigtable data source.

When you create a permanent external table in BigQuery that is linked to a Cloud Bigtable data source, there are two ways to specify the table's schema:

  • If you are using the API or the CLI, you create a table definition file that defines the schema and metadata for the external data source.
  • If you are using the web UI, you enter the Cloud Bigtable column family and qualifiers manually.

To query Cloud Bigtable data using a permanent external table, you:

  • Create a table definition file (for the CLI or API)
  • Create a table in BigQuery linked to the external data source
  • Query the data using the permanent table

Creating and querying a permanent external table

To create and query a permanent table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Bigtable and in the source field, enter the Cloud Bigtable URI in the following format: https://googleapis.com/bigtable/projects/[PROJECT_ID]/instances/[INSTANCE_ID]/tables/[TABLE_NAME]. Where [PROJECT_ID] is the project containing your Cloud Bigtable instance, [INSTANCE_ID] is the Cloud Bigtable instance ID, and [TABLE_NAME] is the name of the table you're querying.
    • For File format, verify Cloud Bigtable is selected.
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the permanent table you're creating in BigQuery.
    • Verify that Table type is set to External table.
  5. In the Column Family and Qualifiers box, use Add Family to specify column families and qualifiers or click Edit as text and enter a JSON array of column families and qualifiers. This list restricts the column families that can be referenced in queries and specifies their value types. For example:

          [
            {
              "familyId": "family_int",
              "type": "INTEGER",
              "encoding": "BINARY",
              "onlyReadLatest": "true",
              "columns": [
                {
                  "qualifierString": "foo",
                  "onlyReadLatest": "true",
                }
              ]
            }
          ]
       
    You can use the column qualifier list to do type conversions. If you leave the list empty, all column families are included in the table schema and their values are read as BYTES. For more information, see the Cloud Bigtable options in the externalDataConfiguration API documentation.

  6. Select applicable items in the Options section, then click Create Table.

After the permanent table is created, you can run a query against the table as if it were a native BigQuery table, subject to the limits on external data sources. To query the table in the web UI, click Compose Query, enter your query in the New Query box, and then click Run Query.

After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

You create a table in the BigQuery command-line tool using the bq mk command. When you use the CLI to create a table linked to an external data source, you identify the table's schema using a table definition file.

  1. Use the bq mk command to create a permanent table.

    bq mk --external_table_definition=[DEFINITION_FILE] [DATASET].[TABLE_NAME]
    

    Where:

    • [DEFINITION_FILE] is the path to the table definition file on your local machine
    • [DATASET] is the name of the dataset that contains the table
    • [TABLE_NAME] is the name of the table you're creating

You can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.

API

Querying Cloud Bigtable data using temporary external tables

To query an external data source without creating a permanent table, you run a command to combine:

  • A table definition file with a query
  • An inline schema definition with a query
  • A JSON schema definition file with a query

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Creating and querying a temporary external table

To query Cloud Bigtable data using a temporary external table, you:

Creating and querying a temporary external table is currently supported by the CLI and API.

CLI

To query a temporary table using a table definition file, enter the following command.

bq query --use_legacy_sql=false --external_table_definition=[TABLE_NAME]::[DEFINITION_FILE] '[QUERY]'

Where:

  • [TABLE_NAME] is the name of the temporary table you're creating
  • [DEFINITION_FILE] is the path to the table definition file on your local machine
  • [QUERY] is the query you're submitting to the temporary table

For example, the following command creates and queries a temporary table named follows using a table definition file named follows_def.

bq query --use_legacy_sql=false --external_table_definition=follows::/tmp/follows_def 'SELECT COUNT(rowkey) FROM follows;'

API

Performance considerations

The performance of queries against Cloud Bigtable external data sources depends on three factors:

  • The number of rows
  • The amount of data read
  • The extent of parallelization

BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Cloud Bigtable cluster and how many splits you have for your table.

Note that Cloud Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.

Querying Cloud Bigtable from BigQuery consumes Cloud Bigtable CPU cycles. CPU consumption by BigQuery may affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Cloud Bigtable affects long-tail queries and increases latency at the 99th percentile.

You should monitor Cloud Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Cloud Bigtable monitoring dashboard in the Cloud Platform Console. Increasing the number of nodes for your instance enables you to better handle both BigQuery traffic and traffic from other concurrent requests.

Generated Schema

By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Cloud Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.

Example

You are storing user profiles for a fictional social network. One data model for this might be a profile column family with individual columns for gender, age and email:

rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice  | female        | 30         | alice@gmail.com

Using the default schema, a standard SQL query to count the number of male users over 30 is:

SELECT COUNT(1) FROM `[DATASET].[TABLE_NAME]`
OMIT RECORD IF NOT SOME(profile.column.name = "gender" AND profile.column.cell.value = "male")
OR NOT SOME(profile.column.name = "age" AND INTEGER(profile.column.cell.value) > 30);

Querying the data is less challenging if gender and age are exposed as sub- fields. To expose them as sub-fields, list gender and age as named columns in the profile column family when defining the table. You can also instruct BigQuery to expose the latest values from this column family since. Typically, only the latest value (and possibly the only value) is of interest.

After exposing the columns as sub-fields, the standard SQL query to count the number of male users over 30 is:

SELECT COUNT(1) FROM `[DATASET].[TABLE_NAME]`
WHERE profile.gender.cell.value="male" AND profile.age.cell.value > 30;

Notice how gender and age are referenced directly as fields. The JSON configuration for this setup is:

      "bigtableOptions": {
        "readRowkeyAsString": "true",
        "columnFamilies": [
          {
              "familyId": "profile",
              "onlyReadLatest": "true",
              "columns": [
                  {
                      "qualifierString": "gender",
                      "type": "STRING"
                  },
                  {
                      "qualifierString": "age",
                      "type": "INTEGER"
                  }
              ]
          }
        ]
      }
    

Value encoding

Cloud Bigtable stores data as raw bytes, agnostic to data encoding. However, byte values are of limited use in SQL query analysis. Cloud Bigtable provides two basic types of scalar decoding: text and HBase-binary.

The text format assumes that all values are stored as alphanumeric text strings. For example, an integer 768 will be stored as the string "768". The binary encoding assumes that HBase's Bytes.toBytes class of methods were used to encode the data and applies an appropriate decoding method.

Query filters

Queries with a row equality filter only read that specific row. For example, in standard SQL syntax:

SELECT COUNT(follows.column.name) FROM `[DATASET].[TABLE_NAME]` WHERE rowkey = "alice";

Range filters such as rowkey > '1' and rowkey < '8' are also supported, but only when rowkey is read as a string with the readRowkeyAsString option.

Send feedback about...

BigQuery Documentation