Querying Cloud Bigtable data
This page describes how to use BigQuery to query data stored in Cloud Bigtable.
Cloud Bigtable is Google's sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Cloud Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library. For information on the Cloud Bigtable data model, see Storage model.
Supported regions and zones
Querying data in Cloud Bigtable is available in the all supported Cloud Bigtable zones. You can find the list of zones here. For multi-cluster instances, BigQuery routes traffic based on Cloud Bigtable app profile settings.
Retrieving the Cloud Bigtable URI
To create an external table for a Cloud Bigtable data source, you must provide the Cloud Bigtable URI. To retrieve the Cloud Bigtable URI:
Open the Cloud Bigtable console.
Retrieve the following details about your Cloud Bigtable data source:
- Your project ID
- Your Cloud Bigtable instance ID
- The name of your Cloud Bigtable table
- The ID of your Cloud Bigtable app profile
Compose the Cloud Bigtable URI using the following format, where:
- project_id is the project containing your Cloud Bigtable instance
- instance_id is the Cloud Bigtable instance ID
- table_name is the name of the table you're querying
- app_profile is the app profile ID that you want to use.
https://googleapis.com/bigtable/projects/project_id/instances/instance_id/tables/table_name[/appProfiles/app_profile]
Access controls and scopes
Access controls for permanent external tables
You can share access to a permanent external table linked to a Cloud Bigtable data source. You cannot share access to a temporary external table.
You can share access to a permanent external table with users (including service accounts) or groups. To query the external table, your users or groups need to be granted (at a minimum):
- The
bigquery.dataViewer
role at the dataset level or higher to access the dataset that contains the external table - The
bigquery.user
role at the project level or higher in order to run query jobs - The
bigtable.reader
role in Cloud Bigtable which provides read-only access to metadata and tables
Scopes for Compute Engine instances
When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Cloud Bigtable. Applications running on the VM use the service account to call Google Cloud APIs.
If you set up a Compute Engine instance to run as a service account,
and that service account accesses an external table linked to a
Cloud Bigtable data source, you must add the Cloud Bigtable read-only data
access scope (https://www.googleapis.com/auth/bigtable.data.readonly
) to the
instance. For more information, see
Creating a Compute Engine instance for Cloud Bigtable.
For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.
Permanent versus temporary external tables
You can query an external data source in BigQuery by using a permanent table or a temporary table. A permanent table is a table that is created in a dataset and is linked to your external data source. Because the table is permanent, you can use access controls to share the table with others who also have access to the underlying external data source, and you can query the table at any time.
When you query an external data source using a temporary table, you submit a command that includes a query and creates a non-permanent table linked to the external data source. When you use a temporary table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Using Bigtable app profiles
A Bigtable application profile, or app profile, stores settings that tell your Bigtable instance how to handle incoming requests from an application. When BigQuery connects to a Bigtable instance, it can specify an app profile, and Bigtable uses that app profile for requests from BigQuery.
To learn how Bigtable app profiles work, see About app profiles.
Querying Cloud Bigtable data using permanent external tables
To query a Cloud Bigtable data source using a permanent table, you create a table in a BigQuery dataset that is linked to your Cloud Bigtable data source. The data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying Cloud Bigtable data source.
When you create a permanent external table in BigQuery that is linked to a Cloud Bigtable data source, there are two ways to specify the table's schema:
- If you are using the API or the
bq
command-line tool, you create a table definition file that defines the schema and metadata for the external data source. - If you are using the Google Cloud console, you enter the Cloud Bigtable column family and qualifiers manually.
To query Cloud Bigtable data using a permanent external table, you:
- Create a table definition file
(for the API or
bq
command-line tool) - Create a table in BigQuery linked to the external data source
- Query the data using the permanent table
Creating and querying a permanent external table
To create and query a permanent table:
Console
Currently it is not possible to query data from Cloud Bigtable using the Google Cloud console.
bq
You create a table in the bq
command-line tool using the bq mk
command. When you use
the bq
command-line tool to create a table linked to an external data source, you
identify the table's schema using a
table definition file.
Use the
bq mk
command to create a permanent table.bq mk \ --external_table_definition=definition_file \ dataset.table
Where:
- definition_file is the path to the table definition file on your local machine.
- dataset is the name of the dataset that contains the table.
- table is the name of the table you're creating.
You can then run a query against the table as if it were a native BigQuery table, subject to the limitations on external data sources.
API
For the
sourceUris
property in the table resource , you can specify exactly one Cloud Bigtable URI, and it must be a fully specified, valid HTTPS URL.Specify the data format properties by setting the
sourceFormat
property. For Cloud Bigtable, specify"BIGTABLE"
.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
Querying Cloud Bigtable data using temporary external tables
To query an external data source without creating a permanent table, you run a command to combine:
- A table definition file with a query
- An inline schema definition with a query
- A JSON schema definition file with a query
The table definition file or supplied schema is used to create the temporary external table,
and the query runs against the temporary external table. Querying an external data source using a
temporary table is supported by the bq
command-line tool and the API.
When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others. Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
Creating and querying a temporary external table
To query Cloud Bigtable data using a temporary external table, you:
- Create a table definition file
- Submit both a query and a table definition file
Creating and querying a temporary external table is currently supported by the
bq
command-line tool and the API.
bq
To query a temporary table using a table definition file, enter the
bq query
command with the --external_table_definition
flag.
(Optional) Supply the --location
flag and set the value to your
location.
bq --location=location query \ --use_legacy_sql=false \ --external_table_definition=table::definition_file \ 'query'
Where:
- location is the name of your location.
The
--location
flag is optional. - table is the name of the temporary table you're creating.
- definition_file is the path to the table definition file on your local machine.
- query is the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named follows
using a table definition file named follows_def
.
bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
COUNT(rowkey)
FROM
follows'
API
Create a query. See Querying data for information about creating a query job.
(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.Specify the external data source properties by setting the
ExternalDataConfiguration
for the table resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
Performance considerations
The performance of queries against Cloud Bigtable external data sources depends on three factors:
- The number of rows
- The amount of data read
- The extent of parallelization
BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Cloud Bigtable cluster and how many splits you have for your table.
Note that Cloud Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.
Querying Cloud Bigtable from BigQuery consumes Cloud Bigtable CPU cycles. CPU consumption by BigQuery may affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Cloud Bigtable affects long- tail queries and increases latency at the 99th percentile.
You should monitor Cloud Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Cloud Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance enables you to better handle both BigQuery traffic and traffic from other concurrent requests.
Generated schema
By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Cloud Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.
Example
You are storing user profiles for a fictional social network. One data model for
this might be a profile
column family with individual
columns for gender
, age
and email
:
rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice | female | 30 | alice@gmail.com
Using the default schema, a standard SQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` OMIT RECORD IF NOT SOME(profile.column.name = "gender" AND profile.column.cell.value = "male") OR NOT SOME(profile.column.name = "age" AND INTEGER(profile.column.cell.value) > 30)
Querying the data is less challenging if gender
and age
are exposed as sub-
fields. To expose them as sub-fields, list gender
and age
as named columns
in the profile
column family when defining the table. You can also instruct
BigQuery to expose the latest values from this column family
since. Typically, only the latest value (and possibly the only value) is of
interest.
After exposing the columns as sub-fields, the standard SQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` WHERE profile.gender.cell.value="male" AND profile.age.cell.value > 30
Notice how gender
and age
are referenced directly as fields. The JSON
configuration for this setup is:
"bigtableOptions": { "readRowkeyAsString": "true", "columnFamilies": [ { "familyId": "profile", "onlyReadLatest": "true", "columns": [ { "qualifierString": "gender", "type": "STRING" }, { "qualifierString": "age", "type": "INTEGER" } ] } ] }
Value encoding
Cloud Bigtable stores data as raw bytes, independent to data encoding. However, byte values are of limited use in SQL query analysis. Cloud Bigtable provides two basic types of scalar decoding: text and HBase-binary.
The text format assumes that all values are stored as alphanumeric text strings.
For example, an integer 768 will be stored as the string "768". The binary
encoding assumes that HBase's Bytes.toBytes
class of methods were used to encode the data and applies an appropriate
decoding method.
Query filters
Queries with a row equality filter only read that specific row. For example, in standard SQL syntax:
SELECT COUNT(follows.column.name) FROM `dataset.table` WHERE rowkey = "alice";
Range filters such as rowkey > '1'
and rowkey < '8'
are also supported, but
only when rowkey is read as a string with the readRowkeyAsString
option.