The Cloud Storage Text to Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Spanner database.
Pipeline requirements
- The target Spanner database and table must exist.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Spanner database.
- The input Cloud Storage path containing the CSV files must exist.
- You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
- If the target Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
-
The manifest file, encoded in ASCII or UTF-8, must match the following format:
- Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
- Data must match one of the following types:
GoogleSQL
BOOL INT64 FLOAT64 NUMERIC STRING DATE TIMESTAMP BYTES JSON
PostgreSQL
boolean bigint double precision numeric character varying, text date timestamp with time zone bytea
Template parameters
Required parameters
- instanceId : The instance ID of the Spanner database.
- databaseId : The database ID of the Spanner database.
- importManifest : The path in Cloud Storage to use when importing manifest files. (Example: gs://your-bucket/your-folder/your-manifest.json).
Optional parameters
- spannerHost : The Cloud Spanner endpoint to call in the template. Only used for testing. (Example: https://batch-spanner.googleapis.com). Defaults to: https://batch-spanner.googleapis.com.
- columnDelimiter : The column delimiter that the source file uses. The default value is ','. (Example: ,).
- fieldQualifier : The character that must surround any value in the source file that contains the columnDelimiter. The default value is ".
- trailingDelimiter : Specifies whether the lines in the source files have trailing delimiters, that is, whether the
columnDelimiter
character appears at the end of each line, after the last column value). The default value istrue
. - escape : The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
- nullString : The string that represents a
NULL
value. By default, this parameter is not set and the template does not use the null string. - dateFormat : The format used to parse date columns. By default, the pipeline tries to parse the date columns as
yyyy-M-d[' 00:00:00']
, for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using the java.time.format.DateTimeFormatter (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html) patterns. - timestampFormat : The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_INSTANT) format. For other cases, specify your own pattern string, for example, using
MMM dd yyyy HH:mm:ss.SSSVV
for timestamps in the form of"Jan 21 1998 01:02:03.456+08:00"
. - spannerProjectId : The ID of the Google Cloud project that contains the Spanner database. If not set, the project ID of the default Google Cloud project is used.
- spannerPriority : The request priority for Spanner calls. Possible values are HIGH, MEDIUM, and LOW. The default value is MEDIUM.
- handleNewLine : If
true
, the input data can contain newline characters. Otherwise, newline characters cause an error. The default value isfalse
. Enabling newline handling can reduce performance. - invalidOutputPath : The Cloud Storage path to use when writing rows that cannot be imported. (Example: gs://your-bucket/your-path). Defaults to empty.
If you need to use customized date or timestamp formats, make sure they're valid
java.time.format.DateTimeFormatter
patterns. The following table shows additional examples of customized formats for date and timestamp columns:Type Input value Format Remark DATE
2011-3-31 By default, the template can parse this format. You don't need to specify the dateFormat
parameter.DATE
2011-3-31 00:00:00 By default, the template can parse this format. You don't need to specify the format. If you like, you can use yyyy-M-d' 00:00:00'
.DATE
01 Apr, 18 dd MMM, yy DATE
Wednesday, April 3, 2019 AD EEEE, LLLL d, yyyy G TIMESTAMP
2019-01-02T11:22:33Z
2019-01-02T11:22:33.123Z
2019-01-02T11:22:33.12356789Z
The default format ISO_INSTANT
can parse this type of timestamp. You don't need to provide thetimestampFormat
parameter.TIMESTAMP
1568402363 By default, the template can parse this type of timestamp and treat it as Unix epoch time. TIMESTAMP
Tue, 3 Jun 2008 11:05:30 GMT EEE, d MMM yyyy HH:mm:ss VV TIMESTAMP
2018/12/31 110530.123PST yyyy/MM/dd HHmmss.SSSz TIMESTAMP
2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z yyyy-MM-dd'T'HH:mm:ss[.SSS]VV If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. You can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV
to handle both cases. You cannot useyyyy-MM-dd'T'HH:mm:ss[.SSS]'Z'
, because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to ajava.time.Instant
. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a validjava.time.Instant
.Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/GCS_Text_to_Cloud_Spanner \ --region REGION_NAME \ --parameters \ instanceId=INSTANCE_ID,\ databaseId=DATABASE_ID,\ importManifest=GCS_PATH_TO_IMPORT_MANIFEST
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
INSTANCE_ID
: your Spanner instance IDDATABASE_ID
: your Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
API
To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see
projects.templates.launch
.POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/GCS_Text_to_Cloud_Spanner { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST" }, "environment": { "machineType": "n1-standard-2" } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
INSTANCE_ID
: your Spanner instance IDDATABASE_ID
: your Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-11-01 UTC.