Querying Cloud Bigtable data
This page describes how to use BigQuery to query data stored in Cloud Bigtable.
Bigtable is Google's sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library. For information on the Bigtable data model, see Storage model.
Supported regions and zones
Querying data in Bigtable is available in all supported Bigtable zones. You can find the list of zones here. For multi-cluster instances, BigQuery routes traffic based on Bigtable app profile settings.
Limitations
For information about limitations that apply to external tables, see External table limitations.
Retrieve the Bigtable URI
To create an external table for a Bigtable data source, you must provide the Bigtable URI. To retrieve the Bigtable URI:
Open the Bigtable console.
Retrieve the following details about your Bigtable data source:
- Your project ID
- Your Bigtable instance ID
- (Optional) The ID of your Bigtable app profile
- The name of your Bigtable table
Compose the Bigtable URI using the following format, where:
- project_id is the project containing your Bigtable instance
- instance_id is the Bigtable instance ID
- (Optional) app_profile is the app profile ID that you want to use
- table_name is the name of the table you're querying
https://googleapis.com/bigtable/projects/project_id/instances/instance_id[/appProfiles/app_profile]/tables/table_name
Required permissions
To create and query an external table in BigQuery, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.getData
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to create and query an external table in BigQuery:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
(includes thebigquery.jobs.create
permission)roles/bigquery.user
(includes thebigquery.jobs.create
permission)roles/bigquery.jobUser
(includes thebigquery.jobs.create
permission)
Additionally, if you have the bigquery.datasets.create
permission, you can create and access external tables in the datasets that you create. But, you still require the bigquery.jobs.create
permission to query the data.
For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.
End user permissions
You can share access to a permanent external table with users (including service accounts) or groups. You cannot share access to a temporary external table. To query the external table, your users or groups need to be granted (at a minimum):
- The
roles/bigquery.dataViewer
role at the dataset level or higher to access the dataset that contains the external table - The
roles/bigquery.user
role at the project level or higher in order to run query jobs - The
roles/bigtable.reader
role in Bigtable provides read-only access to metadata and tables
Scopes for Compute Engine instances
When you create a Compute Engine instance, you can specify a list of scopes for the instance. The scopes control the instance's access to Google Cloud products, including Bigtable. Applications running on the VM use the service account to call Google Cloud APIs.
If you set up a Compute Engine instance to run as a
service account,
and that service account accesses an external table linked to a
Bigtable data source, you must add the Bigtable
read-only data access scope
(https://www.googleapis.com/auth/bigtable.data.readonly
) to the
instance. For more information, see
Creating a Compute Engine instance for Bigtable.
For information on applying scopes to a Compute Engine instance, see Changing the service account and access scopes for an instance. For more information on Compute Engine service accounts, see Service accounts.
Using Bigtable app profiles
A Bigtable application profile, or app profile, stores settings that tell your Bigtable instance how to handle incoming requests from an application. When BigQuery connects to a Bigtable instance, it can specify an app profile, and Bigtable uses that app profile for requests from BigQuery.
To learn how Bigtable app profiles work, see About app profiles.
Create a permanent external table
When you create a permanent external table in BigQuery that is linked to a Bigtable data source, there are three ways to specify the format of the external table:
- If you are using the API or the
bq
command-line tool, you create a table definition file that defines the schema and metadata for the external table. - If you are using SQL, you use the
uri
option of theCREATE EXTERNAL TABLE
statement to specify the Bigtable table to pull data from, and thebigtable_options
option to specify the table schema. - If you are using the Google Cloud console, you enter the Bigtable column family and qualifiers manually.
The external table data is not stored in the BigQuery table. Because the table is permanent, you can use dataset-level access controls to share the table with others who also have access to the underlying Bigtable data source.
Before you create an external table, you must create a dataset to contain the table.
To create a permanent table:
Console
Currently it is not possible to create a Bigtable external table using the Google Cloud console.
SQL
You can create a permanent external table by running the
CREATE EXTERNAL TABLE
DDL statement.
You must specify the table schema explicitly as part of the statement
options.
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE EXTERNAL TABLE DATASET.NEW_TABLE OPTIONS ( format = 'CLOUD_BIGTABLE', uris = ['URI'], bigtable_options = BIGTABLE_OPTIONS );
Replace the following:
DATASET
: the dataset in which to create the Bigtable external table.NEW_TABLE
: the name for the Bigtable external table.URI
: the URI for the Bigtable table you want to use as a data source. This URI must follow the format described in Retrieving the Bigtable URI.BIGTABLE_OPTIONS
: the schema for the Bigtable table in JSON format. For a list of Bigtable table definition options, seeBigtableOptions
in the REST API reference.
Click
Run.
For more information about how to run queries, see Running interactive queries.
A statement to create an external Bigtable table might look similar to the following:
CREATE EXTERNAL TABLE mydataset.BigtableTable
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/myproject/instances/myBigtableInstance/tables/table1'],
bigtable_options =
"""
{
bigtableColumnFamilies: [
{
"familyId": "familyId1",
"type": "INTEGER",
"encoding": "BINARY"
}
],
readRowkeyAsString: true
}
"""
);
bq
You create a table in the bq
command-line tool using the
bq mk
command. When
you use the bq
command-line tool to create a table linked to an external data source,
you identify the table's schema using a
table definition file.
Use the
bq mk
command to create a permanent table.bq mk \ --external_table_definition=DEFINITION_FILE \ DATASET.TABLE
Replace the following:
DEFINITION_FILE
: the path to the table definition file on your local machine.DATASET
: the name of the dataset that contains the table.TABLE
: the name of the table you're creating.
API
Use the tables.insert
API method, and create an
ExternalDataConfiguration
in the Table
resource
that you pass in.
For the sourceUris
property in the Table
resource,
specify only one Bigtable URI. It must be a
valid HTTPS URL.
For the sourceFormat
property, specify "BIGTABLE"
.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
Query a permanent external table
Once you have created an external Bigtable table, you can run a query against it exactly as if it were a standard BigQuery table, subject to the limitations on external data sources.
Query Bigtable data using temporary external tables
Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
To query an external data source without creating a permanent table, you provide a table definition for the temporary table, and then use that table definition in a command or call to query the temporary table. You can provide the table definition in any of the following ways:
- A table definition file
- An inline schema definition
- A JSON schema file
The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table.
When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others.
Create and query a temporary external table
To query Bigtable data using a temporary external table, you:
- Create a table definition file
- Submit both a query and a table definition file
Creating and querying a temporary external table is currently supported by the
bq
command-line tool and the API.
bq
To query a temporary table using a table definition file, enter the
bq query
command with the --external_table_definition
flag.
(Optional) Supply the --location
flag and set the value to your
location.
bq --location=LOCATION query \ --use_legacy_sql=false \ --external_table_definition=TABLE::DEFINITION_FILE \ 'QUERY'
Replace the following:
LOCATION
: the name of your location. The--location
flag is optional.TABLE
: the name of the temporary table you're creating.DEFINITION_FILE
: the path to the table definition file on your local machine.QUERY
: the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named follows
using a table definition file named follows_def
.
bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
COUNT(rowkey)
FROM
follows'
API
Create a query. See Querying data for information about creating a query job.
(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.Specify the external data source properties by setting the
ExternalDataConfiguration
for the table resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
Performance considerations
The performance of queries against Bigtable external data sources depends on three factors:
- The number of rows
- The amount of data read
- The extent of parallelization
BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Bigtable cluster and how many splits you have for your table.
Note that Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.
Querying Bigtable from BigQuery consumes Bigtable CPU cycles. CPU consumption by BigQuery might affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Bigtable affects long-tail queries and increases latency at the 99th percentile.
You should monitor Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance enables you to better handle both BigQuery traffic and traffic from other concurrent requests.
Generated schema
By default, BigQuery exposes the values in a column family as an array of columns and within that, an array of values written at different timestamps. This schema preserves the natural layout of data in Bigtable, but SQL queries can be challenging. It is possible to promote columns to subfields within the parent column family and to read only the latest value from each cell. This represents both of the arrays in the default schema as scalar values.
Example
You are storing user profiles for a fictional social network. One data model for
this might be a profile
column family with individual
columns for gender
, age
and email
:
rowkey | profile:gender| profile:age| profile:email
-------| --------------| -----------| -------------
alice | female | 30 | alice@gmail.com
Using the default schema, a Google Standard SQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` OMIT RECORD IF NOT SOME(profile.column.name = "gender" AND profile.column.cell.value = "male") OR NOT SOME(profile.column.name = "age" AND INTEGER(profile.column.cell.value) > 30)
Querying the data is less challenging if gender
and age
are exposed as sub-
fields. To expose them as sub-fields, list gender
and age
as named columns
in the profile
column family when defining the table. You can also instruct
BigQuery to expose the latest values from this column family
since. Typically, only the latest value (and possibly the only value) is of
interest.
After exposing the columns as sub-fields, the Google Standard SQL query to count the number of male users over 30 is:
SELECT COUNT(1) FROM `dataset.table` WHERE profile.gender.cell.value="male" AND profile.age.cell.value > 30
Notice how gender
and age
are referenced directly as fields. The JSON
configuration for this setup is:
"bigtableOptions": { "readRowkeyAsString": "true", "columnFamilies": [ { "familyId": "profile", "onlyReadLatest": "true", "columns": [ { "qualifierString": "gender", "type": "STRING" }, { "qualifierString": "age", "type": "INTEGER" } ] } ] }
Value encoding
Bigtable stores data as raw bytes, independent to data encoding. However, byte values are of limited use in SQL query analysis. Bigtable provides two basic types of scalar decoding: text and HBase-binary.
The text format assumes that all values are stored as alphanumeric text strings.
For example, an integer 768 will be stored as the string "768". The binary
encoding assumes that HBase's
Bytes.toBytes
class of methods were used to encode the data and applies an appropriate
decoding method.
Query filters
Queries with a row equality filter only read that specific row. For example, in Google Standard SQL syntax:
SELECT COUNT(follows.column.name) FROM `dataset.table` WHERE rowkey = "alice";
Range filters such as rowkey > '1'
and rowkey < '8'
are also supported, but
only when rowkey is read as a string with the readRowkeyAsString
option.