Cloud Storage to Cloud Spanner template
Use the Dataproc Serverless Cloud Storage to Spanner template to extract data from Cloud Storage to Spanner.
Use the template
Run the template using the gcloud CLI or Dataproc API.
gcloud
Before using any of the command data below, make the following replacements:
- PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
- REGION: Required. Compute Engine region.
- SUBNET: Optional. If a subnet is not specified, the subnet
in the specified REGION in the
default
network is selected.Example:
projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
- TEMPLATE_VERSION: Required. Specify
latest
for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta
(visit gs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries
to list available template versions). - CLOUD_STORAGE_INPUT_PATH: Required. Cloud Storage path from where input data will be read.
Example:
gs://example-bucket/example-folder/
- FORMAT: Required. Input data format. Options:
avro
,parquet
, ororc
. Note: Ifavro
, you must add "file:///usr/lib/spark/connector/spark-avro.jar
" to thejars
gcloud CLI flag or API field.Example (the
file://
prefix references a Dataproc Serverless jar file):--jars=file:///usr/lib/spark/connector/spark-avro.jar,
[ ... other jars] - INSTANCE: Required. Spanner instance ID.
- DATABASE: Required. Spanner database ID.
- TABLE: Required. Spanner output table name.
- SPANNER_JDBC_DIALECT: Required. Spanner JDBC dialect.
Options:
googlesql
orpostgresql
. Defaults togooglesql
. - MODE: Optional. Write mode for Spanner output.
Options:
Append
,Overwrite
,Ignore
, orErrorifExists
. Defaults toErrorifExists
. - PRIMARY_KEY: Required. Comma separated Primary key columns needed when creating Spanner output table.
- BATCHSIZE: Optional. Number of records to insert in one round trip into Spanner Table. Defaults to 1000.
- SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
- PROPERTY and PROPERTY_VALUE:
Optional. Comma-separated list of
Spark property=
value
pairs. - LABEL and LABEL_VALUE:
Optional. Comma-separated list of
label
=value
pairs. - LOG_LEVEL: Optional. Level of logging. Can be one of
ALL
,DEBUG
,ERROR
,FATAL
,INFO
,OFF
,TRACE
, orWARN
. Default:INFO
. -
KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.
Example:
projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
Execute the following command:
Linux, macOS, or Cloud Shell
gcloud dataproc batches submit spark \ --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \ --version="1.2" \ --project="PROJECT_ID" \ --region="REGION" \ --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" \ --subnet="SUBNET" \ --kms-key="KMS_KEY" \ --service-account="SERVICE_ACCOUNT" \ --properties="PROPERTY=PROPERTY_VALUE" \ --labels="LABEL=LABEL_VALUE" \ -- --template GCSTOSPANNER \ --templateProperty log.level="LOG_LEVEL" \ --templateProperty project.id="PROJECT_ID" \ --templateProperty gcs.spanner.input.format="FORMAT" \ --templateProperty gcs.spanner.input.location="CLOUD_STORAGE_INPUT_PATH" \ --templateProperty gcs.spanner.output.instance="INSTANCE" \ --templateProperty gcs.spanner.output.database="DATABASE" \ --templateProperty gcs.spanner.output.table="TABLE" \ --templateProperty gcs.spanner.output.saveMode="MODE" \ --templateProperty gcs.spanner.output.primaryKey="PRIMARY_KEY" \ --templateProperty gcs.spanner.output.batchInsertSize="BATCHSIZE" \ --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"
Windows (PowerShell)
gcloud dataproc batches submit spark ` --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ` --version="1.2" ` --project="PROJECT_ID" ` --region="REGION" ` --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ` --subnet="SUBNET" ` --kms-key="KMS_KEY" ` --service-account="SERVICE_ACCOUNT" ` --properties="PROPERTY=PROPERTY_VALUE" ` --labels="LABEL=LABEL_VALUE" ` -- --template GCSTOSPANNER ` --templateProperty log.level="LOG_LEVEL" ` --templateProperty project.id="PROJECT_ID" ` --templateProperty gcs.spanner.input.format="FORMAT" ` --templateProperty gcs.spanner.input.location="CLOUD_STORAGE_INPUT_PATH" ` --templateProperty gcs.spanner.output.instance="INSTANCE" ` --templateProperty gcs.spanner.output.database="DATABASE" ` --templateProperty gcs.spanner.output.table="TABLE" ` --templateProperty gcs.spanner.output.saveMode="MODE" ` --templateProperty gcs.spanner.output.primaryKey="PRIMARY_KEY" ` --templateProperty gcs.spanner.output.batchInsertSize="BATCHSIZE" ` --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"
Windows (cmd.exe)
gcloud dataproc batches submit spark ^ --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^ --version="1.2" ^ --project="PROJECT_ID" ^ --region="REGION" ^ --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ^ --subnet="SUBNET" ^ --kms-key="KMS_KEY" ^ --service-account="SERVICE_ACCOUNT" ^ --properties="PROPERTY=PROPERTY_VALUE" ^ --labels="LABEL=LABEL_VALUE" ^ -- --template GCSTOSPANNER ^ --templateProperty log.level="LOG_LEVEL" ^ --templateProperty project.id="PROJECT_ID" ^ --templateProperty gcs.spanner.input.format="FORMAT" ^ --templateProperty gcs.spanner.input.location="CLOUD_STORAGE_INPUT_PATH" ^ --templateProperty gcs.spanner.output.instance="INSTANCE" ^ --templateProperty gcs.spanner.output.database="DATABASE" ^ --templateProperty gcs.spanner.output.table="TABLE" ^ --templateProperty gcs.spanner.output.saveMode="MODE" ^ --templateProperty gcs.spanner.output.primaryKey="PRIMARY_KEY" ^ --templateProperty gcs.spanner.output.batchInsertSize="BATCHSIZE" ^ --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
- REGION: Required. Compute Engine region.
- SUBNET: Optional. If a subnet is not specified, the subnet
in the specified REGION in the
default
network is selected.Example:
projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
- TEMPLATE_VERSION: Required. Specify
latest
for the latest template version, or the date of a specific version, for example,2023-03-17_v0.1.0-beta
(visit gs://dataproc-templates-binaries or rungcloud storage ls gs://dataproc-templates-binaries
to list available template versions). - CLOUD_STORAGE_INPUT_PATH: Required. Cloud Storage path from where input data will be read.
Example:
gs://example-bucket/example-folder/
- FORMAT: Required. Input data format. Options:
avro
,parquet
, ororc
. Note: Ifavro
, you must add "file:///usr/lib/spark/connector/spark-avro.jar
" to thejars
gcloud CLI flag or API field.Example (the
file://
prefix references a Dataproc Serverless jar file):--jars=file:///usr/lib/spark/connector/spark-avro.jar,
[ ... other jars] - INSTANCE: Required. Spanner instance ID.
- DATABASE: Required. Spanner database ID.
- TABLE: Required. Spanner output table name.
- SPANNER_JDBC_DIALECT: Required. Spanner JDBC dialect.
Options:
googlesql
orpostgresql
. Defaults togooglesql
. - MODE: Optional. Write mode for Spanner output.
Options:
Append
,Overwrite
,Ignore
, orErrorifExists
. Defaults toErrorifExists
. - PRIMARY_KEY: Required. Comma separated Primary key columns needed when creating Spanner output table.
- BATCHSIZE: Optional. Number of records to insert in one round trip into Spanner Table. Defaults to 1000.
- SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
- PROPERTY and PROPERTY_VALUE:
Optional. Comma-separated list of
Spark property=
value
pairs. - LABEL and LABEL_VALUE:
Optional. Comma-separated list of
label
=value
pairs. - LOG_LEVEL: Optional. Level of logging. Can be one of
ALL
,DEBUG
,ERROR
,FATAL
,INFO
,OFF
,TRACE
, orWARN
. Default:INFO
. -
KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed key.
Example:
projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches
Request JSON body:
{ "environmentConfig":{ "executionConfig":{ "subnetworkUri":"SUBNET", "kmsKey": "KMS_KEY", "serviceAccount": "SERVICE_ACCOUNT" } }, "labels": { "LABEL": "LABEL_VALUE" }, "runtimeConfig": { "version": "1.2", "properties": { "PROPERTY": "PROPERTY_VALUE" } }, "sparkBatch": { "mainClass": "com.google.cloud.dataproc.templates.main.DataProcTemplate", "args": [ "--template","GCSTOSPANNER", "--templateProperty","project.id=PROJECT_ID", "--templateProperty","log.level=LOG_LEVEL", "--templateProperty","gcs.spanner.input.format=FORMAT", "--templateProperty","gcs.spanner.input.location=CLOUD_STORAGE_INPUT_PATH", "--templateProperty","gcs.spanner.output.instance=INSTANCE", "--templateProperty","gcs.spanner.output.database=DATABASE", "--templateProperty","gcs.spanner.output.table=TABLE", "--templateProperty","gcs.spanner.output.saveMode=MODE", "--templateProperty","gcs.spanner.output.primaryKey=PRIMARY_KEY", "--templateProperty","gcs.spanner.output.batchInsertSize=BATCHSIZE", "--templateProperty spanner.jdbc.dialect=SPANNER_JDBC_DIALECT" ], "jarFileUris":[ "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ] } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata", "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID", "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583", "createTime": "2023-02-24T03:31:03.440329Z", "operationType": "BATCH", "description": "Batch" } }