Querying Google Cloud Bigtable Data

This page describes how to use BigQuery to query data stored in Google Cloud Bigtable.

Cloud Bigtable is Google's sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Cloud Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library. For information on the Cloud Bigtable data model, see Storage Model.

Before you begin

  • If you plan to use the bq command-line tool instead of the web UI, ensure that you have activated BigQuery and installed the Google Cloud SDK. For more information, see the Before you begin section of the command-line quickstart.
  • If you are using the command-line tool, set a default project so that you can omit the project_id flag. For information on how to set a default project, see the Cloud SDK documentation for the gcloud config set section of the command-line tool quickstart.

Permanent versus temporary tables

You can query an external data source in BigQuery using a permanent table or a temporary table. When you use a permanent table, you create a table in a BigQuery dataset that is linked to your external data source. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.

When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Querying Cloud Bigtable data using permanent tables

To query an external data source using a permanent table, you create a table in a BigQuery dataset that is linked to your external data source using a table definition. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying external data source.

If you are using the CLI or API, you begin by creating a table definition file that is used to define the schema for the external data source. You can also create and use your own schema file or enter the schema inline. If you are using the web UI, you enter the table schema manually.

To query Cloud Bigtable data using a permanent table, you:

  • Create a table definition file (for the CLI or API)
  • Create a table in BigQuery linked to the external data source
  • Query the data using the permanent table

Creating a table definition file

A table definition file defines the schema for an external data source. The table definition also contains metadata about a table, such as the table's source format and related properties. You can use a table definition file with both permanent and temporary tables.

When you create a table definition file for Cloud Bigtable, you manually generate the file in JSON format. Using the mkdef command to create a table definition using schema auto-detect is not currently supported for Cloud Bigtable data sources. Alternatively, you can manually describe the table's column families in the web UI without first creating a JSON file.

A JSON table definition file for Cloud Bigtable looks like the following. Using this table definition file, BigQuery reads data from a single column family, interpreting the values as binary encoded integers.

{
    "sourceFormat": "BIGTABLE",
    "sourceUris": [
        "https://googleapis.com/bigtable/projects/[PROJECT_ID]/instances/[INSTANCE_ID]/tables/[TABLE_NAME]"
    ],
    "bigtableOptions": {
        "columnFamilies" : [
            {
                "familyId": "family_int",
                "type": "INTEGER",
                "encoding": "BINARY"
            }
        ],
    }
}

Creating and querying a permanent table

You can create a permanent table linked to your external Cloud Bigtable data source using the web UI, CLI, or API. If you use the CLI or API, you pass a table definition file to define the source data. If you use the web UI, you manually enter the Cloud Bigtable column families.

You can create a BigQuery table linked to your Cloud Bigtable external data source using the web UI, CLI, or API. To create and query a permanent table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on an empty dataset, click the down arrow icon down arrow icon image, and click Create new table. You can also hover on an existing table in a dataset.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Bigtable and in the source field, enter the Cloud Bigtable URI in the following format: https://googleapis.com/bigtable/projects/[PROJECT_ID]/instances/[INSTANCE_ID]/tables/[TABLE_NAME]. Where [PROJECT_ID] is the project containing your Cloud Bigtable cluster, [INSTANCE_ID] is the Cloud Bigtable instance ID, and [TABLE_NAME] is the name of the table you're querying.
    • For File format, verify Cloud Bigtable is selected.
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the permanent table you're creating in BigQuery.
    • Verify that Table type is set to External table.
  5. In the Column Families box, enter a JSON array of column families. This list restricts the column families that can be referenced in queries and specifies their value types. For example:

          [
            {
              "familyId": "family_int",
              "type": "INTEGER",
              "encoding": "BINARY",
              "onlyReadLatest": "true",
              "columns": [
                {
                  "qualifierString": "foo",
                  "onlyReadLatest": "true",
                }
              ]
            }
          ]
       
    You can use the column families list to do type conversions. If you leave the list empty, all column families are included in the table schema and their values are read as BYTES. For more information, see the Cloud Bigtable options in the externalDataConfiguration API documentation.

  6. Select applicable items in the Options section and then click Create Table.

Once the permanent table is created, you can run a query against the table as if it were a native BigQuery table, subject to the limits of external data sources. To query the table in the web UI, click Compose Query, enter your query in the New Query box, and then click Run Query.

After your query completes, you can export the results as CSV or JSON files, save the results as a table, or save the results to Google Sheets. See Download, save, and export data for more information.

CLI

  1. Use the bq mk command to create a permanent table.

     bq mk --project_id=[PROJECT_ID] --external_table_definition=[FILE_PATH] [DATASET].[TABLE_NAME]
    

    Where [FILE_PATH] is the path to the table definition file, [DATASET] is the name of the dataset where the table is created, and [TABLE_NAME] is the name of the table you're creating.

  2. Run a query against the table by running the following command:

    bq query --project_id=[PROJECT_ID] 'select count(rowkey) from [TABLE_NAME];'
    

    Where [PROJECT_ID] is a valid project ID and [TABLE_NAME] is the name of your table.

API

  • For externalDataConfiguration.sourceUris, exactly one URI can be specified, and it has be a fully specified, valid HTTPS URL for a Google Cloud Bigtable table. The URI format for Cloud Bigtable is: https://googleapis.com/bigtable/projects/[PROJECT_ID]/instances/[INSTANCE_ID]/tables/[TABLE_NAME].

  • Specify the data format by setting the externalDataConfiguration.sourceFormat property. For Google Cloud Bigtable, specify "BIGTABLE".

Querying Cloud Bigtable data using temporary tables

To query an external data source without creating a permanent table, you run a command that combines a table definition with a query. The table definition is used to create the temporary table, and the query runs against the temporary table. Querying an external data source using a temporary table is supported by the BigQuery CLI and API.

When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

Creating and querying a temporary table

Creating and querying a temporary table is currently supported by the CLI and API. When you use temporary tables, you submit both a query and a table definition file.

CLI

  1. Run a query against a temporary table using the following command:

    bq query --project_id=<project ID> --external_table_definition=[TABLE_NAME]::[FILE_PATH] 'select count(rowkey) from [TABLE_NAME];'

    Where [FILE_PATH] is the path to the table definition file, and [TABLE_NAME] is the name of the temporary table you're creating.

API

Performance considerations

The performance of queries against Cloud Bigtable external data sources depends on 3 factors:

  • The number of rows
  • The amount of data read
  • The extent of parallelization

BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Cloud Bigtable cluster and how many splits you have for your table.

Note that Cloud Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.

Querying Cloud Bigtable from BigQuery consumes Cloud Bigtable CPU cycles. CPU consumption by BigQuery may affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Cloud Bigtable affects long-tail queries and increases latency at the 99th percentile.

You should monitor Cloud Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Cloud Bigtable monitoring dashboard in the Cloud Platform Console. Increasing your cluster size will enable you to better handle both BigQuery traffic and traffic from other concurrent requests.

Generated Schema

By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Cloud Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.

Example

You are storing user profiles for a fictional social network. One data model for this might be a profile column family with individual columns for gender, age and email:

rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice  | female        | 30         | alice@gmail.com

Using the default schema, a query to count the number of male users over 30 is:

select count(1) from user_table
OMIT RECORD IF not SOME(profile.column.name = "gender" and profile.column.cell.value = "male")
or not SOME(profile.column.name = "age" and INTEGER(profile.column.cell.value) > 30);

Querying the data is less challenging if gender and age are exposed as sub-fields. To expose them as sub-fields, list gender and age as named columns in the profile column family when defining the table. You can also instruct BigQuery to expose the latest values from this column family since. Typically, only the latest value (and possibly the only value) is of interest. After exposing the columns as sub-fields, the SQL query to count the number of male users over 30 is:

select count(1) from user_table
where profile.gender.cell.value="male" and profile.age.cell.value > 30;

Notice how gender and age are referenced directly as fields. The JSON configuration for this setup is:

      "bigtableOptions": {
        "readRowkeyAsString": "true",
        "columnFamilies": [
          {
              "familyId": "profile",
              "onlyReadLatest": "true",
              "columns": [
                  {
                      "qualifierString": "gender",
                      "type": "STRING"
                  },
                  {
                      "qualifierString": "age",
                      "type": "INTEGER"
                  }
              ]
          }
        ]
      }
    

Value encoding

Cloud Bigtable stores data as raw bytes, agnostic to data encoding. However, byte values are of limited use in SQL query analysis. Cloud Bigtable provides 2 basic types of scalar decoding - text and HBase-binary.

The text format assumes that all values are stored as alphanumeric text strings. For example, an integer 768 will be stored as the string "768". The binary encoding assumes that HBase's Bytes.toBytes class of methods were used to encode the data and applies an appropriate decoding method.

Query filters

Queries with a row equality filter only read that specific row. For example:

select count(follows.column.name) from user_table where rowkey = "alice";

Range filters such as rowkey > '1' and rowkey < '8' are also supported, but only when rowkey is read as a string with the readRowkeyAsString option.

Send feedback about...

BigQuery Documentation