Spanner to Cloud Storage Text template

The Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Spanner table, and writes it to Cloud Storage as CSV text files.

Pipeline requirements

The input Spanner table must exist before running the pipeline.

Template parameters

Required parameters

spannerTable: The Spanner table to read the data from.
spannerProjectId: The ID of the Google Cloud project that contains the Spanner database to read data from.
spannerInstanceId: The instance ID of the requested table.
spannerDatabaseId: The database ID of the requested table.
textWritePrefix: The Cloud Storage path prefix that specifies where the data is written. For example, gs://mybucket/somefolder/.

Optional parameters

csvTempDirectory: The Cloud Storage path where temporary CSV files are written. For example, gs://your-bucket/your-path.
spannerPriority: The request priority (https://cloud.google.com/spanner/docs/reference/rest/v1/RequestOptions) for Spanner calls. Possible values are HIGH, MEDIUM, LOW. The default value is MEDIUM.
spannerHost: The Cloud Spanner endpoint to call in the template. Only used for testing. For example, https://batch-spanner.googleapis.com. Defaults to: https://batch-spanner.googleapis.com.
spannerSnapshotTime: The timestamp that corresponds to the version of the Spanner database that you want to read from. The timestamp must be specified in the RFC 3339 (https://tools.ietf.org/html/rfc3339) UTC Zulu Time format. The timestamp must be in the past and maximum timestamp staleness (https://cloud.google.com/spanner/docs/timestamp-bounds#maximum_timestamp_staleness) applies. For example, 1990-12-31T23:59:60Z. Defaults to empty.
dataBoostEnabled: Set to true to use the compute resources of Spanner Data Boost to run the job with near-zero impact on Spanner OLTP workflows. When true, requires the spanner.databases.useDataBoost Identity and Access Management (IAM) permission. For more information, see Data Boost overview (https://cloud.google.com/spanner/docs/databoost/databoost-overview). Defaults to: false.

Run the template

Console

Go to the Dataflow Create job from template page.

Go to Create job from template

In the Job name field, enter a unique job name.
Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.
For a list of regions where you can run a Dataflow job, see Dataflow locations.
From the Dataflow template drop-down menu, select the Cloud Spanner to Text Files on Cloud Storage template.
In the provided parameter fields, enter your parameter values.
Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/Spanner_to_GCS_Text \
    --region REGION_NAME \
    --parameters \
spannerProjectId=SPANNER_PROJECT_ID,\
spannerDatabaseId=DATABASE_ID,\
spannerInstanceId=INSTANCE_ID,\
spannerTable=TABLE_ID,\
textWritePrefix=gs://BUCKET_NAME/output/

Replace the following:

JOB_NAME: a unique job name of your choice
VERSION: the version of the template that you want to use
You can use the following values:
- latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/
- the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
SPANNER_PROJECT_ID: the Google Cloud project ID of the Spanner database from which you want to read data
DATABASE_ID: the Spanner database ID
BUCKET_NAME: the name of your Cloud Storage bucket
INSTANCE_ID: the Spanner instance ID
TABLE_ID: the Spanner table ID

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "SPANNER_PROJECT_ID",
       "spannerDatabaseId": "DATABASE_ID",
       "spannerInstanceId": "INSTANCE_ID",
       "spannerTable": "TABLE_ID",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}