The Cloud Storage Text to Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Spanner database.
Pipeline requirements
- The target Spanner database and table must exist.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Spanner database.
- The input Cloud Storage path containing the CSV files must exist.
- You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
- If the target Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
-
The manifest file, encoded in ASCII or UTF-8, must match the following format:
Manifest format and example
The format of the manifest file corresponds to the following message type, shown here in protocol buffer format:
message ImportManifest { // The per-table import manifest. message TableManifest { // Required. The name of the destination table. string table_name = 1; // Required. The CSV files to import. This value can be either a filepath or a glob pattern. repeated string file_patterns = 2; // The schema for a table column. message Column { // Required for each Column that you specify. The name of the column in the // destination table. string column_name = 1; // Required for each Column that you specify. The type of the column. string type_name = 2; } // Optional. The schema for the table columns. repeated Column columns = 3; } // Required. The TableManifest of the tables to be imported. repeated TableManifest tables = 1; enum ProtoDialect { GOOGLE_STANDARD_SQL = 0; POSTGRESQL = 1; } // Optional. The dialect of the receiving database. Defaults to GOOGLE_STANDARD_SQL. ProtoDialect dialect = 2; }
The following example shows a manifest file for importing tables called
Albums
andSingers
into a GoogleSQL-dialect database. TheAlbums
table uses the column schema that the job retrieves from the database, and theSingers
table uses the schema that the manifest file specifies:{ "tables": [ { "table_name": "Albums", "file_patterns": [ "gs://bucket1/Albums_1.csv", "gs://bucket1/Albums_2.csv" ] }, { "table_name": "Singers", "file_patterns": [ "gs://bucket1/Singers*.csv" ], "columns": [ {"column_name": "SingerId", "type_name": "INT64"}, {"column_name": "FirstName", "type_name": "STRING"}, {"column_name": "LastName", "type_name": "STRING"} ] } ] }
- Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
- Data must match one of the following types:
BOOL INT64 FLOAT64 NUMERIC STRING DATE TIMESTAMP BYTES JSON
boolean bigint double precision numeric character varying, text date timestamp with time zone bytea
Template parameters
Required parameters
- instanceId: The instance ID of the Spanner database.
- databaseId: The database ID of the Spanner database.
- importManifest: The path in Cloud Storage to use when importing manifest files. For example,
gs://your-bucket/your-folder/your-manifest.json
.
Optional parameters
- spannerHost: The Cloud Spanner endpoint to call in the template. Only used for testing. For example,
https://batch-spanner.googleapis.com
. Defaults to: https://batch-spanner.googleapis.com. - columnDelimiter: The column delimiter that the source file uses. The default value is
,
. For example,,
. - fieldQualifier: The character that must surround any value in the source file that contains the columnDelimiter. The default value is double quotes.
- trailingDelimiter: Specifies whether the lines in the source files have trailing delimiters, that is, whether the
columnDelimiter
character appears at the end of each line, after the last column value. The default value istrue
. - escape: The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
- nullString: The string that represents a
NULL
value. By default, this parameter is not set and the template does not use the null string. - dateFormat: The format used to parse date columns. By default, the pipeline tries to parse the date columns as
yyyy-M-d[' 00:00:00']
, for example, as2019-01-31
or2019-1-1 00:00:00
. If your date format is different, specify the format using the java.time.format.DateTimeFormatter (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html) patterns. - timestampFormat: The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html#ISO_INSTANT) format. For other cases, specify your own pattern string, for example, using
MMM dd yyyy HH:mm:ss.SSSVV
for timestamps in the form ofJan 21 1998 01:02:03.456+08:00
. - spannerProjectId: The ID of the Google Cloud project that contains the Spanner database. If not set, the project ID of the default Google Cloud project is used.
- spannerPriority: The request priority for Spanner calls. Possible values are
HIGH
,MEDIUM
, andLOW
. The default value isMEDIUM
. - handleNewLine: If
true
, the input data can contain newline characters. Otherwise, newline characters cause an error. The default value isfalse
. Enabling newline handling can reduce performance. - invalidOutputPath: The Cloud Storage path to use when writing rows that cannot be imported. For example,
gs://your-bucket/your-path
. Defaults to empty.
If you need to use customized date or timestamp formats, make sure they're valid
java.time.format.DateTimeFormatter
patterns. The following table shows additional examples of customized formats for date and
timestamp columns:
Type | Input value | Format | Remark |
---|---|---|---|
DATE |
2011-3-31 | By default, the template can parse this format.
You don't need to specify the dateFormat parameter. |
|
DATE |
2011-3-31 00:00:00 | By default, the template can parse this format.
You don't need to specify the format. If you like, you can use
yyyy-M-d' 00:00:00' . |
|
DATE |
01 Apr, 18 | dd MMM, yy | |
DATE |
Wednesday, April 3, 2019 AD | EEEE, LLLL d, yyyy G | |
TIMESTAMP |
2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z |
The default format ISO_INSTANT can parse this type of timestamp.
You don't need to provide the timestampFormat parameter. |
|
TIMESTAMP |
1568402363 | By default, the template can parse this type of timestamp and treat it as Unix epoch time. | |
TIMESTAMP |
Tue, 3 Jun 2008 11:05:30 GMT | EEE, d MMM yyyy HH:mm:ss VV | |
TIMESTAMP |
2018/12/31 110530.123PST | yyyy/MM/dd HHmmss.SSSz | |
TIMESTAMP |
2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z | yyyy-MM-dd'T'HH:mm:ss[.SSS]VV | If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the
default format can parse this type of timestamp. You don't need to provide your own format
parameter.
You can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both
cases. You cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z' , because the postfix 'Z' must
be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is
converted to a
java.time.Instant .
Therefore, it must be specified in UTC or have time zone information associated with it.
Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant .
|
Run the template
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
In your shell or terminal, run the template:
gcloud dataflow jobs runJOB_NAME \ --gcs-location gs://dataflow-templates-REGION_NAME /VERSION /GCS_Text_to_Cloud_Spanner \ --regionREGION_NAME \ --parameters \ instanceId=INSTANCE_ID ,\ databaseId=DATABASE_ID ,\ importManifest=GCS_PATH_TO_IMPORT_MANIFEST
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
INSTANCE_ID
: your Spanner instance IDDATABASE_ID
: your Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID /locations/LOCATION /templates:launch?gcsPath=gs://dataflow-templates-LOCATION /VERSION /GCS_Text_to_Cloud_Spanner { "jobName": "JOB_NAME ", "parameters": { "instanceId": "INSTANCE_ID ", "databaseId": "DATABASE_ID ", "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST " }, "environment": { "machineType": "n1-standard-2" } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
INSTANCE_ID
: your Spanner instance IDDATABASE_ID
: your Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
Template source code
This template's source code is in the GoogleCloudPlatform/DataflowTemplates repository on GitHub.
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.