The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.
Pipeline requirements
- The source Spanner table must exist prior to running the pipeline.
- The BigQuery dataset must exist prior to running the pipeline.
- A JSON file that describes your BigQuery schema.
The file must contain a top-level JSON array titled
fields
. The contents of thefields
array must use the following pattern:{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
.The following JSON describes an example BigQuery schema:
{ "fields": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
The Spanner to BigQuery batch template doesn't support importing data into
STRUCT
(Record) fields in the target BigQuery table.
Template parameters
Required parameters
- spannerInstanceId : The instance ID of the Spanner database to read from.
- spannerDatabaseId : The database ID of the Spanner database to export.
- outputTableSpec : The BigQuery output table location to write the output to. For example,
<PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>
.Depending on thecreateDisposition
specified, the output table might be created automatically using the user provided Avro schema.
Optional parameters
- spannerProjectId : The ID of the project that the Spanner database resides in. The default value for this parameter is the project where the Dataflow pipeline is running.
- spannerTableId : The table name of the Spanner database to export. Ignored if sqlQuery is set.
- spannerRpcPriority : The request priority (https://cloud.google.com/spanner/docs/reference/rest/v1/RequestOptions) for Spanner calls. Possible values are
HIGH
,MEDIUM
, andLOW
. The default value isHIGH
. - sqlQuery : The SQL query to use to read data from the Spanner database. Required if spannerTableId is empty.
- bigQuerySchemaPath : The Cloud Storage path (gs://) to the JSON file that defines your BigQuery schema. (Example: gs://your-bucket/your-schema.json).
- writeDisposition : The BigQuery WriteDisposition (https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload) value. For example,
WRITE_APPEND
,WRITE_EMPTY
, orWRITE_TRUNCATE
. Defaults toWRITE_APPEND
. - createDisposition : The BigQuery CreateDisposition (https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload). For example,
CREATE_IF_NEEDED
andCREATE_NEVER
. Defaults toCREATE_IF_NEEDED
. - useStorageWriteApi : If
true
, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value isfalse
. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api). - useStorageWriteApiAtLeastOnce : When using the Storage Write API, specifies the write semantics. To use at-least-once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to
true
. To use exactly-once semantics, set the parameter tofalse
. This parameter applies only whenuseStorageWriteApi
istrue
. The default value isfalse
.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \ --project=PROJECT_ID \ --region=REGION_NAME \ --parameters \ spannerInstanceId=SPANNER_INSTANCE_ID,\ spannerDatabaseId=SPANNER_DATABASE_ID,\ spannerTableId=SPANNER_TABLE_ID,\ sqlQuery=SQL_QUERY,\ outputTableSpec=OUTPUT_TABLE_SPEC,\
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launchParameter": { "jobName": "JOB_NAME", "parameters": { "spannerInstanceId": "SPANNER_INSTANCE_ID", "spannerDatabaseId": "SPANNER_DATABASE_ID", "spannerTableId": "SPANNER_TABLE_ID", "sqlQuery": "SQL_QUERY", "outputTableSpec": "OUTPUT_TABLE_SPEC", }, "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex", "environment": { "maxWorkers": "10" } } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
SPANNER_INSTANCE_ID
: the Spanner instance IDSPANNER_DATABASE_ID
: the Spanner database IDSPANNER_TABLE_ID
: the Spanner table nameSQL_QUERY
: the SQL queryOUTPUT_TABLE_SPEC
: the BigQuery table location
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.