Query Bigtable data
This document describes how to query data stored in a Bigtable external table.
Bigtable is Google's sparsely populated NoSQL database that can scale to billions of rows, thousands of columns, and petabytes of data. For information on the Bigtable data model, see Storage model.
Query permanent external tables
Before you begin, you or someone in your organization must create an external table for you to use. For details and required permissions, see Create a BigQuery external table.
Required roles
To query Bigtable permanent external tables, ensure you have the following roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer
) - BigQuery User (
roles/bigquery.user
) - Bigtable Reader (
roles/bigtable.reader
)
Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.
To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:
Required permissions
bigquery.jobs.create
bigquery.readsessions.create
(Only required if you are streaming data with the BigQuery Storage Write API)bigquery.tables.get
bigquery.tables.getData
You might also be able to get these permissions with custom roles or other predefined roles.
Query the table
You can run a query against a permanent external Bigtable table exactly as if it were a standard BigQuery table, subject to the limitations on external data sources. For more information, see Run interactive and batch queries.
Query temporary external tables
Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.
To query an external data source without creating a permanent table, you provide a table definition for the temporary table, and then use that table definition in a command or call to query the temporary table. You can provide the table definition in any of the following ways:
- A table definition file
- An inline schema definition
- A JSON schema file
The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table.
When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others.
Using a temporary external table instead of a permanent external table has some limitations, including the following:
- You must have the Bigtable Admin (
roles/bigtable.admin
) role. - This approach does not let you use the Google Cloud console to infer the schema of the Bigtable table and automatically create the table definition. You must create the table definition yourself.
Required roles
To query Bigtable temporary external tables, ensure you have the following roles:
- BigQuery Data Viewer (
roles/bigquery.dataViewer
) - BigQuery User (
roles/bigquery.user
) - Bigtable Admin (
roles/bigtable.admin
)
Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.
To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:
Required permissions
bigquery.jobs.create
bigquery.readsessions.create
(Only required if you are streaming data with the BigQuery Storage Write API)bigquery.tables.get
bigquery.tables.getData
You might also be able to get these permissions with custom roles or other predefined roles.
Create and query the table
To query Bigtable data using a temporary external table, you:
- Create a table definition file
- Submit both a query and a table definition file
Creating and querying a temporary external table is supported by the bq command-line tool and the API.
bq
To query a temporary table using a table definition file, enter the
bq query
command with the --external_table_definition
flag.
(Optional) Supply the --location
flag and set the value to your
location.
bq --location=LOCATION query \ --use_legacy_sql=false \ --external_table_definition=TABLE::DEFINITION_FILE \ 'QUERY'
Replace the following:
LOCATION
: the name of your location. The--location
flag is optional.TABLE
: the name of the temporary table you're creating.DEFINITION_FILE
: the path to the table definition file on your local machine.QUERY
: the query you're submitting to the temporary table.
For example, the following command creates and queries a temporary table
named follows
using a table definition file named follows_def
.
bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
COUNT(rowkey)
FROM
follows'
API
Create a query. See Querying data for information about creating a query job.
(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.Specify the external data source properties by setting the
ExternalDataConfiguration
for the table resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
Performance considerations
The performance of queries against Bigtable external data sources depends on three factors:
- The number of rows
- The amount of data read
- The extent of parallelization
BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Bigtable cluster and how many splits you have for your table.
Note that Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information on splitting a table by row key, see Managing tables.
Querying Bigtable from BigQuery consumes Bigtable CPU cycles. CPU consumption by BigQuery might affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Bigtable affects long-tail queries and increases latency at the 99th percentile.
You should monitor Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance lets you handle both BigQuery traffic and traffic from other concurrent requests.
Query filters
Queries with a row equality filter only read that specific row. For example, in GoogleSQL syntax:
SELECT COUNT(follows.column.name) FROM `dataset.table` WHERE rowkey = "alice";
Range filters such as rowkey > '1'
and rowkey < '8'
are also supported, but
only when rowkey is read as a string with the readRowkeyAsString
option.