The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.
Pipeline requirements
- The source Spanner table must exist prior to running the pipeline.
- The BigQuery dataset must exist prior to running the pipeline.
- A JSON file that describes your BigQuery schema.
The file must contain a top-level JSON array titled
fields
. The contents of thefields
array must use the following pattern:{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
.The following JSON describes an example BigQuery schema:
{ "fields": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
The Spanner to BigQuery batch template doesn't support importing data into
STRUCT
(Record) fields in the target BigQuery table.
Template parameters
Parameter | Description |
---|---|
spannerInstanceId |
The instance ID of the Spanner database to read from. |
spannerDatabaseId |
The database ID of the Spanner database to export. |
spannerTableId |
The table name of the Spanner database to export. |
sqlQuery |
The SQL query to use to read data from the Spanner database. |
outputTableSpec |
The BigQuery table to write to. For example,
<project>:<dataset>.<table_name> . |
spannerProjectId |
Optional: The ID of the project that the Spanner database resides in. The default value for this parameter is the project where the Dataflow pipeline is running. |
spannerRpcPriority |
Optional: The
request priority
for Spanner calls. Possible values are HIGH ,
MEDIUM , and LOW . The default value is
HIGH . |
bigQuerySchemaPath |
Optional: The Cloud Storage path (gs:// ) to the JSON file that defines your BigQuery schema.
For example, gs://path/to/my/schema.json . |
writeDisposition |
Optional: The BigQuery WriteDisposition .
Supported values are WRITE_APPEND , WRITE_EMPTY , and WRITE_TRUNCATE . The default value is WRITE_APPEND . |
createDisposition |
Optional: The BigQuery CreateDisposition .
Supported values are CREATE_IF_NEEDED and CREATE_NEVER . The default value is CREATE_IF_NEEDED . |
useStorageWriteApi |
Optional:
If true , the pipeline uses the
BigQuery Storage Write API. The default value is false . For more information, see
Using the Storage Write API.
|
useStorageWriteApiAtLeastOnce |
Optional:
When using the Storage Write API, specifies the write semantics. To use
at-least-once semantics, set this parameter to true . To use exactly-once semantics,
set the parameter to false . This parameter applies only when
useStorageWriteApi is true . The default value is false .
|
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \ --project=PROJECT_ID \ --region=REGION_NAME \ --parameters \ spannerInstanceId=SPANNER_INSTANCE_ID,\ spannerDatabaseId=SPANNER_DATABASE_ID,\ spannerTableId=SPANNER_TABLE_ID,\ sqlQuery=SQL_QUERY,\ outputTableSpec=OUTPUT_TABLE_SPEC,\
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launchParameter": { "jobName": "JOB_NAME", "parameters": { "spannerInstanceId": "SPANNER_INSTANCE_ID", "spannerDatabaseId": "SPANNER_DATABASE_ID", "spannerTableId": "SPANNER_TABLE_ID", "sqlQuery": "SQL_QUERY", "outputTableSpec": "OUTPUT_TABLE_SPEC", }, "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex", "environment": { "maxWorkers": "10" } } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.