Cloud Spanner federated queries
This page describes how to query data in Cloud Spanner from BigQuery using federated queries.
Overview
BigQuery Spanner federation enables BigQuery to query data residing in Spanner in real-time, without copying or moving data.
After the initial one-time set up, you can write a query with the SQL
EXTERNAL_QUERY
function.
Permissions
To create the connection to the Spanner database, you must have the permissions described in Working with connections.
To query the Spanner database, you must have the following permissions:
spanner.databases.select
spanner.instances.get
spanner.sessions.create
To perform parallel reads, you must have the
spanner.databases.partitionQuery
permission.
Before you begin
Enable the BigQuery connection service
- Open the BigQuery Connection API page in the API library.
- From the drop-down menu, select the project that contains your external data source.
Click the ENABLE button.
Setting up Spanner database connections
Once the BigQuery Connection API is enabled, create a connection to the Spanner database.
Console
To create a connection resource, go to the BigQuery page in the console.
In the
Add data menu, select External data source.In the External data source pane, enter the following information:
- For Connection type, select Spanner.
- For Connection ID, enter an identifier for the connection resource. Letter, numbers, and underscores are allowed.
- For Connection location, select a BigQuery location (or region) that is compatible with your external data source region.
- Optional: For Friendly name, enter a user-friendly name for the
connection, such as
My connection resource
. The friendly name can be any value that helps you identify the connection resource if you need to modify it later. - Optional: For Description, enter a description for this connection resource.
- For Database name, enter the name of the Spanner
in the following format:
"projects/PROJECT_ID/instances/INSTANCE/databases/DATABASE"
- Optional: To perform parallel reads, select Read data in parallel. For more information, see Read data in parallel
Click Create connection.
bq
To create the connection, use the
bq mk
command
with the --connection
flag.
bq mk --connection \ --connection_type=CLOUD_SPANNER \ --properties='PROPERTIES' \ --location=LOCATION \ --display_name='FRIENDLY_NAME' \ --description 'DESCRIPTION' \ CONNECTION_ID
Replace the following:
PROPERTIES: A JSON object with the following fields:
"database"
: The Spanner database for the connection. Specify as a string with the following format:"projects/PROJECT_ID/instances/INSTANCE/databases/DATABASE"
."use_parallelism"
: Optional. Iftrue
, this connection performs parallel reads. The default value isfalse
. For more information, see Read data in parallel
LOCATION: A BigQuery location that is compatible with your external data source region.
FRIENDLY_NAME: Optional. A user-friendly name for the connection.
DESCRIPTION: Optional. A description for this connection.
CONNECTION_ID: Optional. An identifier for the connection resource. The connection ID can contain letters, numbers and underscores. If you don't provide a connection ID, BigQuery automatically generates a unique ID.
The following example creates a new connection resource named
my_connection_id
.
bq mk --connection \
--connection_type='CLOUD_SPANNER' \
--properties='{"database":"projects/my_project/instances/my_instance/databases/database1"}' \
--project_id=federation-test \
--location=us \
my_connection_id
API
Within the BigQuery Connection API, call the CreateConnection
method within
the ConnectionService
service to create a connection resource.
Query data in Spanner
To send a federated query to Spanner from a
BigQuery standard SQL query, use the
EXTERNAL_QUERY
function.
Formulate your Spanner query in either Google Standard SQL or PostgreSQL, depending on the specified dialect of the database.
The following example makes a federated query to a Spanner
database named orders
and joins the results with a BigQuery
table named mydataset.customers
.
SELECT c.customer_id, c.name, rq.first_order_date FROM mydataset.customers AS c LEFT OUTER JOIN EXTERNAL_QUERY( 'my-project.us.example-db', '''SELECT customer_id, MIN(order_date) AS first_order_date FROM orders GROUP BY customer_id''') AS rq ON rq.customer_id = c.customer_id GROUP BY c.customer_id, c.name, rq.first_order_date;
Read data in parallel
Spanner can divide certain queries into smaller pieces, or partitions, and fetch the partitions in parallel. For more information, see Read data in parallel in the Spanner documentation.
To enable parallel reads in federated queries, configure this setting when you create the connection resource. This option divides the SQL query into smaller partitions and fetches each partition in parallel. However, this option is restricted to queries whose first operator in the execution plan is a distributed union operator. Other queries return an error. To view the query execution plan for a Spanner query, see Understand how Cloud Spanner executes queries.
Manage query execution priority
You can assign priority (high
, medium
, or low
) to individual queries,
by specifying the query_execution_priority
option, as shown below:
SELECT * FROM EXTERNAL_QUERY( 'my-project.us.example-db', '''SELECT customer_id, MIN(order_date) AS first_order_date FROM orders GROUP BY customer_id''', '{"query_execution_priority":"high"}');
The default priority is medium
.
Queries with priority high
will compete with transactional traffic.
Queries with priority low
are best-effort, and might get preempted by
background load, for example scheduled backups.
View a Spanner table schema
You can use the EXTERNAL_QUERY
function to query information_schema views to
access database metadata, such as listing all tables in the database or showing
a table schema. The following example returns information about the columns in
the table MyTable
:
Google SQL database
SELECT * FROM EXTERNAL_QUERY( 'my-project.us.example-db', '''SELECT t.column_name, t.spanner_type, t.is_nullable FROM information_schema.columns AS t WHERE t.table_catalog = '' AND t.table_schema = '' AND t.table_name = 'MyTable' ORDER BY t.ordinal_position ''');
PostgreSQL database
SELECT * from EXTERNAL_QUERY(
'my-project.us.postgresql.example-db',
'''SELECT t.column_name, t.data_type, t.is_nullable
FROM information_schema.columns AS t
WHERE
t.table_schema = 'public' and t.table_name='MyTable'
ORDER BY t.ordinal_position
''');
For more information, see the following information schema references in the Spanner documentation:
Data type mappings
When you execute a Spanner federated query, the data from Spanner is converted to BigQuery standard SQL types.
Spanner Google Standard SQL type | Spanner PostgreSQL type | BigQuery type |
---|---|---|
ARRAY |
- | ARRAY |
BOOL |
bool |
BOOL |
BYTES |
bytea |
BYTES |
DATE |
date |
DATE |
FLOAT64 |
float8 |
FLOAT64 |
INT64 |
bigint |
INT64 |
NUMERIC |
numeric * |
NUMERIC |
STRING |
varchar |
STRING |
STRUCT
|
- | Not supported for Spanner federated queries |
TIMESTAMP |
timestamptz |
TIMESTAMP with nanoseconds truncated |
* PostgreSQL numeric values with a precision that is greater than
the precision that BigQuery supports are rounded. Values
that are larger than the maximum value generate an Invalid NUMERIC value
error.
If your external query contains a data type that is unsupported for federated queries, the query fails immediately. You can cast the unsupported data type to a supported data type.
Supported regions
Federated queries are only supported in locations that support both Spanner and BigQuery.
- For a list of supported Spanner locations, see Regional configurations.
- For a list of supported BigQuery locations, see Dataset locations.
There are two types of locations:
A region is a specific geographic place, such as London.
A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
You can create a connection and execute a federated query across regions according to the following rules.
Multi-regions
A BigQuery multi-region can query any data source region in the same large geographic area (US, EU), for example:
- The BigQuery US multi-region can query any single
region in the US geographic area, such as
us-central1
,us-east4
,us-west2
, and so on. - The BigQuery EU multi-region can query any single
region in
member states of the
European Union, such as
europe-north1
,europe-west3
, and so on. - The connection used in the query must reside in the same location as the
query location. For example queries executed from the
US
multi-region must reference a connection located in theUS
multi-region.
The
query processing location
is the multi-region location, either US
or EU
.
Single regions
A BigQuery single region can only query a resource in the same region. For example:
- The BigQuery single region
us-east4
can only query Spanner inus-east4
.
In this example, the query processing location is the BigQuery single region.
Limitations
Spanner federated queries are subject to the following limitations:
Performance. A federated query is likely to not be as fast as querying only BigQuery storage. BigQuery needs to wait for the source database to execute the external query and temporarily move data from the external data source to BigQuery. Also, the source database might not be optimized for complex analytical queries.
Federated queries are read-only. The external query that is executed in the source database must be read-only. Therefore, DML or DDL statements are not supported.
Unsupported data types. If your external query contains a data type that is unsupported in BigQuery, the query fails immediately. You can cast the unsupported data type to a different supported data type.
Troubleshooting
This section helps you troubleshoot issues you might encounter when sending a federated query to Spanner.
- Issue: Query is not root partitionable.
- Resolution: If you configure the connection to read data in parallel, the first operator in the query execution plan must be Distributed Union. To resolve this error, view the query execution plan and rewrite the query. For more information, see Understand how Cloud Spanner executes queries.
- Issue: Deadline exceeded.
- Resolution: Select the option to read data in parallel and rewrite the query to be root partitionable. For more information, see Understand how Cloud Spanner executes queries.