Spanner federated queries

As a data analyst, you can query data in Spanner from BigQuery using federated queries.

BigQuery Spanner federation enables BigQuery to query data residing in Spanner in real-time, without copying or moving data.

Before you begin

  • Ensure that your BigQuery administrator has created a Spanner connection and shared it with you. See Choose the right connection.
  • To get the permissions that you need to query a Spanner instance, ask your administrator to grant you the BigQuery Connection User (roles/bigquery.connectionUser) Identity and Access Management (IAM) role. You also need to ask your administrator to grant you one of the following:
    • If you are a fine-grained access control user, you need access to a database role that has the SELECT privilege on all Spanner schema objects in your queries.
    • If you are not a fine-grained access control user, you need the Cloud Spanner Database Reader (roles/spanner.databaseReader) IAM role.

    For information about granting IAM roles, see Manage access to projects, folders, and organizations. For information about fine-grained access control, see About fine-grained access control.

Choose the right connection

If you are a Spanner fine-grained access control user, when you run a federated query, you must use a Spanner connection that specifies a database role. Then all queries that you run with this connection use that database role.

If you use a connection that doesn't specify a database role, you must have the IAM roles indicated in Before you begin.

Spanner Data Boost

Data Boost is a fully managed, serverless feature that provides independent compute resources for supported Spanner workloads. Data Boost lets you execute analytics queries and data exports with near-zero impact to existing workloads on the provisioned Spanner instance. Data Boost lets you run federated queries with independent compute capacity separate from your provisioned instances to avoid impacting existing workloads on Spanner. Data Boost is most impactful when you run complex ad hoc queries, or when you want to process large amounts of data without impacting the existing Spanner workload. Running federated queries with Data Boost can lead to significantly lower CPU consumption, and in some cases, lower query latency.

Before you begin

To get the permission that you need to enable access to Data Boost, ask your administrator to grant you the Cloud Spanner Database Admin (roles/spanner.databaseAdmin) IAM role on the Spanner database. For more information about granting roles, see Manage access.

This predefined role contains the spanner.databases.useDataBoost permission, which is required to enable access to Data Boost.

You might also be able to get this permission with custom roles or other predefined roles.

Enable Data Boost

To enable Data Boost on your federated queries to Spanner, you must first make a connection to Spanner. After enabling Data Boost within the connection, query data to send a federated query to Spanner.

Query data

To send a federated query to Spanner from a GoogleSQL query, use the EXTERNAL_QUERY function.

Formulate your Spanner query in either GoogleSQL or PostgreSQL, depending on the specified dialect of the database.

The following example makes a federated query to a Spanner database named orders and joins the results with a BigQuery table named mydataset.customers.

SELECT c.customer_id, c.name, rq.first_order_date
FROM mydataset.customers AS c
LEFT OUTER JOIN EXTERNAL_QUERY(
  'my-project.us.example-db',
  '''SELECT customer_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY customer_id''') AS rq
  ON rq.customer_id = c.customer_id
GROUP BY c.customer_id, c.name, rq.first_order_date;

Read data in parallel

Spanner can divide certain queries into smaller pieces, or partitions, and fetch the partitions in parallel. For more information, see Read data in parallel in the Spanner documentation.

To enable parallel reads in federated queries, configure this setting when you create the connection resource. This option divides the SQL query into smaller partitions and fetches each partition in parallel. However, this option is restricted to queries that meet one of the following conditions:

Other queries return an error. To view the query execution plan for a Spanner query, see Understand how Spanner executes queries.

Manage query execution priority

You can assign priority (high, medium, or low) to individual queries, by specifying the query_execution_priority option, as shown below:

SELECT *
FROM EXTERNAL_QUERY(
  'my-project.us.example-db',
  '''SELECT customer_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY customer_id''',
  '{"query_execution_priority":"high"}');

The default priority is medium.

Queries with priority high will compete with transactional traffic. Queries with priority low are best-effort, and might get preempted by background load, for example scheduled backups.

View a Spanner table schema

You can use the EXTERNAL_QUERY function to query information_schema views to access database metadata, such as listing all tables in the database or showing a table schema. The following example returns information about the columns in the table MyTable:

Google SQL database

SELECT *
FROM EXTERNAL_QUERY(
  'my-project.us.example-db',
  '''SELECT t.column_name, t.spanner_type, t.is_nullable
    FROM information_schema.columns AS t
    WHERE
      t.table_catalog = ''
      AND t.table_schema = ''
     AND t.table_name = 'MyTable'
    ORDER BY t.ordinal_position
  ''');

PostgreSQL database

SELECT * from EXTERNAL_QUERY(
 'my-project.us.postgresql.example-db',
  '''SELECT t.column_name, t.data_type, t.is_nullable
    FROM information_schema.columns AS t
    WHERE
      t.table_schema = 'public' and t.table_name='MyTable'
    ORDER BY t.ordinal_position
  ''');

For more information, see the following information schema references in the Spanner documentation:

Troubleshooting

This section helps you troubleshoot issues you might encounter when sending a federated query to Spanner.

Issue: Query is not root partitionable.
Resolution: If you configure the connection to read data in parallel, either the first operator in the query execution plan must be a distributed union, or your execution plan must not have any distributed unions. To resolve this error, view the query execution plan and rewrite the query. For more information, see Understand how Spanner executes queries.
Issue: Deadline exceeded.
Resolution: Select the option to read data in parallel and rewrite the query to be root partitionable. For more information, see Understand how Spanner executes queries.

What's next