Run federated queries with Data Boost

This page explains how to use Spanner Data Boost when you run federated queries from BigQuery to a Spanner database. With Data Boost, federated queries run with minimal impact to existing workloads on the provisioned Spanner instance. The Data Boost queries from BigQuery to a Spanner database can join BigQuery data with Spanner data.

Spanner federation lets BigQuery query data residing in Spanner in real time, without copying or moving data. To learn more about Spanner federated queries, see Spanner federated queries. To learn about Data Boost, see Data Boost overview.

Before you begin

Before you can run federated queries with Data Boost, you need to complete the following tasks:

Create a Spanner instance and database

If you don't have a Spanner instance and database, follow the steps in Create and query a database using the Google Cloud console to create them.

Enable the BigQuery connection API

The BigQuery connection API lets you manage BigQuery connections to external data sources such as a Spanner database.

  • Enable the BigQuery connection API.

    Enable the API

For more information, see BigQuery connection API in the BigQuery documentation.

Grant IAM permissions for Data Boost to principals

A principal must be granted the following permissions to run federated queries with Data Boost:

  • spanner.instances.get - lets you get the configuration of an instance.
  • spanner.databases.useDataBoost - lets you use the compute resources of Spanner Data Boost to process partitioned queries

For more information about Spanner permissions, see Identity and Access Management (IAM) permissions.

To grant these required permissions, we recommend that you use the Cloud Spanner Database Reader With DataBoost (roles/spanner.databaseReaderWithDataBoost) IAM role. You can add that role to any principal that needs to be able to run federated queries with Data Boost. To learn more about predefined roles in Spanner, see Predefined roles. To learn how to create a custom IAM role, see Create a custom role.

Run a federated Data Boost query

To run a Data Boost query from BigQuery to an external source, you need a BigQuery connection to the external source and the ID of the connection. When you run a federated Spanner query with Data Boost, the external source is a Spanner database. After you create your connection ID, it's used by BigQuery to run a Data Boost query of a Spanner database.

Use one of the following options to create a BigQuery connection ID, and then use the connection ID to run a Data Boost query from BigQuery:

  1. Start in Spanner - Create the BigQuery external connection ID in the Spanner console. After your connection ID is created in the Spanner console, you're redirected to the BigQuery console to run a federated Data Boost query to a Spanner database.

  2. Start in BigQuery - Create the Data Boost external connection ID in the BigQuery console or using the bq command-line tool. After you create the connection ID, you stay in the BigQuery console to run a federated Data Boost query to a Spanner database.

Start in Spanner to run a Data Boost query

To run a federated Data Boost query starting in the Spanner Studio, do the following:

  1. Go to the Spanner Instances page in the Google Cloud console.

    Go to the Instances page

    The console shows a list of your Spanner instances.

  2. Select a Spanner instance, and then select a database.

  3. On the Database overview page, in the navigation menu, click Spanner Studio.

  4. Click View in BiqQuery.

  5. In the View in BigQuery dialog, enter a connection ID.

    The connection ID is used to create a new BigQuery external connection to your Spanner database. You reference your external connection using the following pattern:

    PROJECT-ID.LOCATION.CONNECTION-ID
    

    An error occurs if the ID already exists.

  6. Fill in the rest of the dialog and do the following:

    • Select Read data in parallel.
    • Select Use Spanner Data Boost.
  7. Click View in BigQuery.

    BigQuery Studio opens with the following query:

    SELECT * FROM EXTERNAL_QUERY("PROJECT-ID.LOCATION.CONNECTION-ID", "SELECT * FROM INFORMATION_SCHEMA.TABLES;");
    

    You can replace this with your federated query. For example, you might make a query that's similar to the following example. This example makes a federated query from a table named orders in a Spanner database and joins the results with a BigQuery table named mydataset.customers.

    SELECT c.customer_id, c.name, rq.first_order_date
    FROM mydataset.customers AS c
    LEFT OUTER JOIN EXTERNAL_QUERY(
      'my-project.us.example-db',
      '''SELECT customer_id, MIN(order_date) AS first_order_date
      FROM orders
      GROUP BY customer_id''') AS rq
      ON rq.customer_id = c.customer_id
    GROUP BY c.customer_id, c.name, rq.first_order_date;

Start in BigQuery to run a Data Boost query

To create an external data connection from BigQuery to a Spanner database and use that connection to run a federated Data Boost query from BigQuery, select one of the following options:

Console

  1. Go to Create Spanner connections in the BigQuery documentation and follow the instructions on the Console tab.

  2. In the External data source pane, do the following:

    • Select Read data in parallel.
    • Select Use Spanner Data Boost.

bq

  1. Go to Create Spanner connections in the BigQuery documentation and follow instructions on the bq* tab.

  2. Set the following connection properties to true:

    • useParallelism
    • useDataBoost

The following example uses the bq mk command to create a new connection named my_connection with the two required properties for Data Boost:

bq mk --connection --connection_type='CLOUD_SPANNER' --location='us' \
--properties='{"database":"projects/my-project/instances/my-instance/databases/my-database", "useParallelism":true, "useDataBoost": true}' my_connection

What's next