Datastream to Spanner template

The Datastream to Spanner template is a streaming pipeline that reads Datastream events from a Cloud Storage bucket and writes them to a Spanner database. It is intended for data migration from Datastream sources to Spanner. Specify either the gcsPubSubSubscription parameter to read data from Pub/Sub notifications, OR provide the inputFilePattern parameter to directly read data from files in Cloud Storage.

All tables required for migration must exist in the destination Spanner database prior to template execution. Hence schema migration from a source database to destination Spanner must be completed prior to data migration. Data can exist in the tables prior to migration. This template does not propagate Datastream schema changes to the Spanner database.

Data consistency is guaranteed only at the end of migration when all data has been written to Spanner. To store ordering information for each record written to Spanner, this template creates an additional table (called a shadow table) for each table in the Spanner database. This is used to ensure consistency at the end of migration. The shadow tables are not deleted after migration and can be used for validation purposes at the end of migration.

Any errors that occur during operation, such as schema mismatches, malformed JSON files, or errors resulting from executing transforms, are recorded in an error queue. The error queue is a Cloud Storage folder which stores all the Datastream events that had encountered errors along with the error reason in text format. The errors can be transient or permanent and are stored in appropriate Cloud Storage folders in the error queue. The transient errors are retried automatically while the permanent errors are not. In case of permanent errors, you have the option of making corrections to the change events and moving them to the retriable bucket while the template is running.

Pipeline requirements

A Datastream stream in Running or Not started state.
A Cloud Storage bucket where Datastream events are replicated.
A Spanner database with existing tables. These tables can be empty or contain data.

Template parameters

Required parameters

instanceId: The Spanner instance where the changes are replicated.
databaseId: The Spanner database where the changes are replicated.

Optional parameters

inputFilePattern: The Cloud Storage file location that contains the Datastream files to replicate. Typically, this is the root path for a stream. Support for this feature has been disabled. Please use this feature only for retrying entries that land in severe DLQ.
inputFileFormat: The format of the output file produced by Datastream. For example avro,json. Defaults to avro.
sessionFilePath: Session file path in Cloud Storage that contains mapping information from HarbourBridge.
projectId: The Spanner project ID.
spannerHost: The Cloud Spanner endpoint to call in the template. For example, https://batch-spanner.googleapis.com. Defaults to: https://batch-spanner.googleapis.com.
gcsPubSubSubscription: The Pub/Sub subscription being used in a Cloud Storage notification policy. For the name, use the format projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>.
streamName: The name or template for the stream to poll for schema information and source type.
shadowTablePrefix: The prefix used to name shadow tables. Default: shadow_.
shouldCreateShadowTables: This flag indicates whether shadow tables must be created in Cloud Spanner database. Defaults to: true.
rfcStartDateTime: The starting DateTime used to fetch from Cloud Storage (https://tools.ietf.org/html/rfc3339). Defaults to: 1970-01-01T00:00:00.00Z.
fileReadConcurrency: The number of concurrent DataStream files to read. Defaults to: 30.
deadLetterQueueDirectory: The file path used when storing the error queue output. The default file path is a directory under the Dataflow job's temp location.
dlqRetryMinutes: The number of minutes between dead letter queue retries. Defaults to 10.
dlqMaxRetryCount: The max number of times temporary errors can be retried through DLQ. Defaults to 500.
dataStreamRootUrl: Datastream API Root URL. Defaults to: https://datastream.googleapis.com/.
datastreamSourceType: This is the type of source database that Datastream connects to. Example - mysql/oracle. Need to be set when testing without an actual running Datastream.
roundJsonDecimals: This flag if set, rounds the decimal values in json columns to a number that can be stored without loss of precision. Defaults to: false.
runMode: This is the run mode type, whether regular or with retryDLQ. Defaults to: regular.
transformationContextFilePath: Transformation context file path in cloud storage used to populate data used in transformations performed during migrations Eg: The shard id to db name to identify the db from which a row was migrated.
directoryWatchDurationInMinutes: The Duration for which the pipeline should keep polling a directory in GCS. Datastreamoutput files are arranged in a directory structure which depicts the timestamp of the event grouped by minutes. This parameter should be approximately equal tomaximum delay which could occur between event occurring in source database and the same event being written to GCS by Datastream. 99.9 percentile = 10 minutes. Defaults to: 10.
spannerPriority: The request priority for Cloud Spanner calls. The value must be one of: [HIGH,MEDIUM,LOW]. Defaults to HIGH.
dlqGcsPubSubSubscription: The Pub/Sub subscription being used in a Cloud Storage notification policy for DLQ retry directory when running in regular mode. For the name, use the format projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_NAME>. When set, the deadLetterQueueDirectory and dlqRetryMinutes are ignored.
transformationJarPath: Custom JAR file location in Cloud Storage for the file that contains the custom transformation logic for processing records in forward migration. Defaults to empty.
transformationClassName: Fully qualified class name having the custom transformation logic. It is a mandatory field in case transformationJarPath is specified. Defaults to empty.
transformationCustomParameters: String containing any custom parameters to be passed to the custom transformation class. Defaults to empty.
filteredEventsDirectory: This is the file path to store the events filtered via custom transformation. Default is a directory under the Dataflow job's temp location. The default value is enough under most conditions.
shardingContextFilePath: Sharding context file path in cloud storage is used to populate the shard id in spanner database for each source shard.It is of the format Map<stream_name, Map<db_name, shard_id>>.
tableOverrides: These are the table name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1, SpannerTableName1}, {SourceTableName2, SpannerTableName2}]This example shows mapping Singers table to Vocalists and Albums table to Records. For example, [{Singers, Vocalists}, {Albums, Records}]. Defaults to empty.
columnOverrides: These are the column name overrides from source to spanner. They are written in thefollowing format: [{SourceTableName1.SourceColumnName1, SourceTableName1.SpannerColumnName1}, {SourceTableName2.SourceColumnName1, SourceTableName2.SpannerColumnName1}]Note that the SourceTableName should remain the same in both the source and spanner pair. To override table names, use tableOverrides.The example shows mapping SingerName to TalentName and AlbumName to RecordName in Singers and Albums table respectively. For example, [{Singers.SingerName, Singers.TalentName}, {Albums.AlbumName, Albums.RecordName}]. Defaults to empty.
schemaOverridesFilePath: A file which specifies the table and the column name overrides from source to spanner. Defaults to empty.
shadowTableSpannerDatabaseId: Optional separate database for shadow tables. If not specified, shadow tables will be created in the main database. If specified, ensure shadowTableSpannerInstanceId is specified as well. Defaults to empty.
shadowTableSpannerInstanceId: Optional separate instance for shadow tables. If not specified, shadow tables will be created in the main instance. If specified, ensure shadowTableSpannerDatabaseId is specified as well. Defaults to empty.
failureInjectionParameter: Failure injection parameter. Only used for testing. Defaults to empty.

Run the template

Console

Go to the Dataflow Create job from template page.

Go to Create job from template

In the Job name field, enter a unique job name.
Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.
For a list of regions where you can run a Dataflow job, see Dataflow locations.
From the Dataflow template drop-down menu, select the Cloud Datastream to Spanner template.
In the provided parameter fields, enter your parameter values.
Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Datastream_to_Spanner \
    --parameters \
inputFilePattern=GCS_FILE_PATH,\
streamName=STREAM_NAME,\
instanceId=CLOUDSPANNER_INSTANCE,\
databaseId=CLOUDSPANNER_DATABASE,\
deadLetterQueueDirectory=DLQ

Replace the following:

PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
JOB_NAME: a unique job name of your choice
REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
VERSION: the version of the template that you want to use
You can use the following values:
- latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/
- the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH: the Cloud Storage path that is used to store datastream events. For example: gs://bucket/path/to/data/
CLOUDSPANNER_INSTANCE: your Spanner instance.
CLOUDSPANNER_DATABASE: your Spanner database.
DLQ: the Cloud Storage path for the error queue directory.

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "containerSpecGcsPath": "gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Datastream_to_Spanner",
      "parameters": {
          "inputFilePattern": "GCS_FILE_PATH",
          "streamName": "STREAM_NAME"
          "instanceId": "CLOUDSPANNER_INSTANCE"
          "databaseId": "CLOUDSPANNER_DATABASE"
          "deadLetterQueueDirectory": "DLQ"
      }
   }
}