Cloud Storage Text to Spanner template

The Cloud Storage Text to Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Spanner database.

Pipeline requirements

The target Spanner database and table must exist.
You must have read permissions for the Cloud Storage bucket and write permissions for the target Spanner database.
The input Cloud Storage path containing the CSV files must exist.
You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
If the target Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.

The manifest file, encoded in ASCII or UTF-8, must match the following format:

Manifest format and example

The format of the manifest file corresponds to the following message type, shown here in protocol buffer format:

message ImportManifest {
  // The per-table import manifest.
  message TableManifest {
    // Required. The name of the destination table.
    string table_name = 1;
    // Required. The CSV files to import. This value can be either a filepath or a glob pattern.
    repeated string file_patterns = 2;
    // The schema for a table column.
    message Column {
      // Required for each Column that you specify. The name of the column in the
      // destination table.
      string column_name = 1;
      // Required for each Column that you specify. The type of the column.
      string type_name = 2;
    }
    // Optional. The schema for the table columns.
    repeated Column columns = 3;
  }
  // Required. The TableManifest of the tables to be imported.
  repeated TableManifest tables = 1;

  enum ProtoDialect {
    GOOGLE_STANDARD_SQL = 0;
    POSTGRESQL = 1;
  }
  // Optional. The dialect of the receiving database. Defaults to GOOGLE_STANDARD_SQL.
  ProtoDialect dialect = 2;
}

The following example shows a manifest file for importing tables called Albums and Singers into a GoogleSQL-dialect database. The Albums table uses the column schema that the job retrieves from the database, and the Singers table uses the schema that the manifest file specifies:

{
  "tables": [
    {
      "table_name": "Albums",
      "file_patterns": [
        "gs://bucket1/Albums_1.csv",
        "gs://bucket1/Albums_2.csv"
      ]
    },
    {
      "table_name": "Singers",
      "file_patterns": [
        "gs://bucket1/Singers*.csv"
      ],
      "columns": [
        {"column_name": "SingerId", "type_name": "INT64"},
        {"column_name": "FirstName", "type_name": "STRING"},
        {"column_name": "LastName", "type_name": "STRING"}
      ]
    }
  ]
}

Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.

Data must match one of the following types:

GoogleSQL

    BOOL
    INT64
    FLOAT64
    NUMERIC
    STRING
    DATE
    TIMESTAMP
    BYTES
    JSON

PostgreSQL

    boolean
    bigint
    double precision
    numeric
    character varying, text
    date
    timestamp with time zone
    bytea

Template parameters

Required parameters

instanceId: The instance ID of the Spanner database.
databaseId: The database ID of the Spanner database.
importManifest: The path in Cloud Storage to use when importing manifest files. For example, gs://your-bucket/your-folder/your-manifest.json.

Optional parameters

spannerHost: The Cloud Spanner endpoint to call in the template. Only used for testing. For example, https://batch-spanner.googleapis.com. Defaults to: https://batch-spanner.googleapis.com.
columnDelimiter: The column delimiter that the source file uses. The default value is ,. For example, ,.
fieldQualifier: The character that must surround any value in the source file that contains the columnDelimiter. The default value is double quotes.
trailingDelimiter: Specifies whether the lines in the source files have trailing delimiters, that is, whether the columnDelimiter character appears at the end of each line, after the last column value. The default value is true.
escape: The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
nullString: The string that represents a NULL value. By default, this parameter is not set and the template does not use the null string.
dateFormat: The format used to parse date columns. By default, the pipeline tries to parse the date columns as yyyy-M-d[' 00:00:00'], for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using the java.time.format.DateTimeFormatter (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html) patterns.
timestampFormat: The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_INSTANT) format. For other cases, specify your own pattern string, for example, using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of Jan 21 1998 01:02:03.456+08:00.
spannerProjectId: The ID of the Google Cloud project that contains the Spanner database. If not set, the project ID of the default Google Cloud project is used.
spannerPriority: The request priority for Spanner calls. Possible values are HIGH, MEDIUM, and LOW. The default value is MEDIUM.
handleNewLine: If true, the input data can contain newline characters. Otherwise, newline characters cause an error. The default value is false. Enabling newline handling can reduce performance.
invalidOutputPath: The Cloud Storage path to use when writing rows that cannot be imported. For example, gs://your-bucket/your-path. Defaults to empty.

If you need to use customized date or timestamp formats, make sure they're valid java.time.format.DateTimeFormatter patterns. The following table shows additional examples of customized formats for date and timestamp columns:

Type	Input value	Format	Remark
`DATE`	2011-3-31		By default, the template can parse this format. You don't need to specify the `dateFormat` parameter.
`DATE`	2011-3-31 00:00:00		By default, the template can parse this format. You don't need to specify the format. If you like, you can use `yyyy-M-d' 00:00:00'`.
`DATE`	01 Apr, 18	dd MMM, yy
`DATE`	Wednesday, April 3, 2019 AD	EEEE, LLLL d, yyyy G
`TIMESTAMP`	2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z		The default format `ISO_INSTANT` can parse this type of timestamp. You don't need to provide the `timestampFormat` parameter.
`TIMESTAMP`	1568402363		By default, the template can parse this type of timestamp and treat it as Unix epoch time.
`TIMESTAMP`	Tue, 3 Jun 2008 11:05:30 GMT	EEE, d MMM yyyy HH:mm:ss VV
`TIMESTAMP`	2018/12/31 110530.123PST	yyyy/MM/dd HHmmss.SSSz
`TIMESTAMP`	2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z	yyyy-MM-dd'T'HH:mm:ss[.SSS]VV	If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. You can use `yyyy-MM-dd'T'HH:mm:ss[.SSS]VV` to handle both cases. You cannot use `yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z'`, because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to a `java.time.Instant`. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid `java.time.Instant`.

Run the template

Console

Go to the Dataflow Create job from template page.

Go to Create job from template

In the Job name field, enter a unique job name.
Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.
For a list of regions where you can run a Dataflow job, see Dataflow locations.
From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
In the provided parameter fields, enter your parameter values.
Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/GCS_Text_to_Cloud_Spanner \
    --region REGION_NAME \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
importManifest=GCS_PATH_TO_IMPORT_MANIFEST

Replace the following:

JOB_NAME: a unique job name of your choice
VERSION: the version of the template that you want to use
You can use the following values:
- latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/
- the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
INSTANCE_ID: your Spanner instance ID
DATABASE_ID: your Spanner database ID
GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/GCS_Text_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Replace the following:

PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
JOB_NAME: a unique job name of your choice
VERSION: the version of the template that you want to use
You can use the following values:
- latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/
- the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
LOCATION: the region where you want to deploy your Dataflow job—for example, us-central1
INSTANCE_ID: your Spanner instance ID
DATABASE_ID: your Spanner database ID
GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file

Template source code

Java

This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.

What's next

Learn about Dataflow templates.
See the list of Google-provided templates.