Cloud Storage Text to Spanner template

The Cloud Storage Text to Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Spanner database.

Pipeline requirements

  • The target Spanner database and table must exist.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Spanner database.
  • The input Cloud Storage path containing the CSV files must exist.
  • You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
  • If the target Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
  • The manifest file, encoded in ASCII or UTF-8, must match the following format:

  • Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
  • Data must match one of the following types:

    GoogleSQL

        BOOL
        INT64
        FLOAT64
        NUMERIC
        STRING
        DATE
        TIMESTAMP
        BYTES
        JSON

    PostgreSQL

        boolean
        bigint
        double precision
        numeric
        character varying, text
        date
        timestamp with time zone
        bytea

Template parameters

Required parameters

  • instanceId : The instance ID of the Spanner database.
  • databaseId : The database ID of the Spanner database.
  • importManifest : The path in Cloud Storage to use when importing manifest files. (Example: gs://your-bucket/your-folder/your-manifest.json).

Optional parameters

  • spannerHost : The Cloud Spanner endpoint to call in the template. Only used for testing. (Example: https://batch-spanner.googleapis.com). Defaults to: https://batch-spanner.googleapis.com.
  • columnDelimiter : The column delimiter that the source file uses. The default value is ','. (Example: ,).
  • fieldQualifier : The character that must surround any value in the source file that contains the columnDelimiter. The default value is ".
  • trailingDelimiter : Specifies whether the lines in the source files have trailing delimiters, that is, whether the columnDelimiter character appears at the end of each line, after the last column value). The default value is true.
  • escape : The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
  • nullString : The string that represents a NULL value. By default, this parameter is not set and the template does not use the null string.
  • dateFormat : The format used to parse date columns. By default, the pipeline tries to parse the date columns as yyyy-M-d[' 00:00:00'], for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using the java.time.format.DateTimeFormatter (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html) patterns.
  • timestampFormat : The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_INSTANT) format. For other cases, specify your own pattern string, for example, using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of "Jan 21 1998 01:02:03.456+08:00".
  • spannerProjectId : The ID of the Google Cloud project that contains the Spanner database. If not set, the project ID of the default Google Cloud project is used.
  • spannerPriority : The request priority for Spanner calls. Possible values are HIGH, MEDIUM, and LOW. The default value is MEDIUM.
  • handleNewLine : If true, the input data can contain newline characters. Otherwise, newline characters cause an error. The default value is false. Enabling newline handling can reduce performance.
  • invalidOutputPath : The Cloud Storage path to use when writing rows that cannot be imported. (Example: gs://your-bucket/your-path). Defaults to empty.

If you need to use customized date or timestamp formats, make sure they're valid java.time.format.DateTimeFormatter patterns. The following table shows additional examples of customized formats for date and timestamp columns:

Type Input value Format Remark
DATE 2011-3-31 By default, the template can parse this format. You don't need to specify the dateFormat parameter.
DATE 2011-3-31 00:00:00 By default, the template can parse this format. You don't need to specify the format. If you like, you can use yyyy-M-d' 00:00:00'.
DATE 01 Apr, 18 dd MMM, yy
DATE Wednesday, April 3, 2019 AD EEEE, LLLL d, yyyy G
TIMESTAMP 2019-01-02T11:22:33Z
2019-01-02T11:22:33.123Z
2019-01-02T11:22:33.12356789Z
The default format ISO_INSTANT can parse this type of timestamp. You don't need to provide the timestampFormat parameter.
TIMESTAMP 1568402363 By default, the template can parse this type of timestamp and treat it as Unix epoch time.
TIMESTAMP Tue, 3 Jun 2008 11:05:30 GMT EEE, d MMM yyyy HH:mm:ss VV
TIMESTAMP 2018/12/31 110530.123PST yyyy/MM/dd HHmmss.SSSz
TIMESTAMP 2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z yyyy-MM-dd'T'HH:mm:ss[.SSS]VV If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. You can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both cases. You cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z', because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to a java.time.Instant. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/GCS_Text_to_Cloud_Spanner \
    --region REGION_NAME \
    --parameters \
instanceId=INSTANCE_ID,\
databaseId=DATABASE_ID,\
importManifest=GCS_PATH_TO_IMPORT_MANIFEST

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
  • INSTANCE_ID: your Spanner instance ID
  • DATABASE_ID: your Spanner database ID
  • GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/GCS_Text_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Replace the following:

  • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • LOCATION: the region where you want to deploy your Dataflow job—for example, us-central1
  • INSTANCE_ID: your Spanner instance ID
  • DATABASE_ID: your Spanner database ID
  • GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file

What's next