Query and analyze Bigtable data with BigQuery

BigQuery is a managed data warehouse that can help you query and analyze your Bigtable data using SQL queries. BigQuery is useful for data analysts, data engineers, data scientists, or anyone who wants to use Bigtable data to answer business questions.

BigQuery lets you query your Bigtable data from BigQuery. This feature is helpful when you want to join your Bigtable data to BigQuery tables.

This document provides an overview of querying Bigtable data with BigQuery. Before you read this page you should be familiar with the Bigtable overview and BigQuery overview

Using BigQuery to query a Bigtable table is ideal for tables that have the same column families and column qualifiers in every row.

External table creation

Before you can query your Bigtable data, you — or an administrator in your organization — must create an external table, which is a BigQuery table containing metadata pointers to your Bigtable table that you send your queries to. For more on external tables, see Introduction to external data sources.

You must create the external table in the same region as the Bigtable table. This means, for example, that if the table is in an instance that has clusters in europe-central2-a (Warsaw), europe-west1-c (Belgium), and asia-east1-a (Tokyo), then you must create the external table in Warsaw, Belgium, or Tokyo.

For most cases, when you create an external table, set readRowkeyAsString and ignoreUnspecifiedColumnFamilies to true.

When ignoreUnspecifiedColumnFamilies is true, when you create a table definition that includes only some columns in a column family, only the selected columns are promoted as columns in the external table. Data in the unselected columns is grouped under a general column column.

To create your external table, follow the instructions at Create a Bigtable external table.

Query data in the external table

After you have an external table for your Bigtable table, you can send SQL queries to it using one of the following methods:

To learn how to compose and run a query, see Run a query. For Bigtable-specific instructions including required permissions and code samples, see Query Bigtable data.

Scheduled queries

Scheduled queries are useful when you want to import Bigtable data into BigQuery on a recurring basis. They are also helpful for use cases that might otherwise require you to build a data pipeline and stream the data to BigQuery. For instructions on managing scheduled queries, see Scheduling queries.

Avoid full table scans

Similar to when you send read requests directly to your Bigtable table, when you query the external table for the table, you generally want to avoid full table scans. Full table scans increase CPU utilization and take considerably longer than selective queries. They also require more BigQuery throughput.

If your query involves all rows, it triggers a full table scan. On the other hand, if you limit the query and request a range of rows or specified non-contiguous rows, then the entire table is not scanned. Examples in GoogleSQL syntax of limiting the query include the following:

  • WHERE rowkey = "abc123"
  • WHERE rowkey BETWEEN "abc123" PRECEDING AND "abc999" FOLLOWING
  • WHERE rowkey > 999999 (if you read the row key as a string)

Joins

If you plan to use a join to analyze your Bigtable table data in conjunction with data from another source, you should create a subquery that extracts the relevant fields from Bigtable for the planned join. For more best practices when joining tables, see Optimize query computation.

Costs

When you create an external table and query it, you are charged for BigQuery costs and for an increase in the number of Bigtable nodes that are required to handle the traffic. Because your external table is in the same region as your Bigtable table, no network costs are incurred.

If you tend to run your queries during regular business hours, consider enabling Bigtable autoscaling so that the number of nodes increases when needed and then decreases when the jobs are complete. Autoscaling is also an effective tactic if you run scheduled queries that don't have firm deadlines.

Another way to limit costs is to avoid a full table scan.

For more information on cost optimization for BigQuery, see Estimate and control costs.

Limitations

The following limitations apply:

  • You can't use BigQuery to query Bigtable data that is stored in JSON format.
  • Query results that contain serialized data that has nested types, such as protocol buffer (protobufs) and Avro formats, might render incorrectly or be difficult to read in the Google Cloud console.

What's next