Google provides a set of open source Dataflow templates. For general information about templates, see Dataflow templates. For a list of all Google-provided templates, see Get started with Google-provided templates.
This guide documents batch templates.
BigQuery to Cloud Storage TFRecords
The BigQuery to Cloud Storage TFRecords template is a pipeline that reads data from a BigQuery query and writes it to a Cloud Storage bucket in TFRecord format. You can specify the training, testing, and validation percentage splits. By default, the split is 1 or 100% for the training set and 0 or 0% for testing and validation sets. When setting the dataset split, the sum of training, testing, and validation needs to add up to 1 or 100% (for example, 0.6+0.2+0.2). Dataflow automatically determines the optimal number of shards for each output dataset.
Requirements for this pipeline:
- The BigQuery dataset and table must exist.
- The output Cloud Storage bucket must exist before pipeline execution. Training, testing, and validation subdirectories do not need to preexist and are autogenerated.
Template parameters
Parameter | Description |
---|---|
readQuery |
A BigQuery SQL query that extracts data from the source. For example, select * from dataset1.sample_table . |
outputDirectory |
The top-level Cloud Storage path prefix at which to write the training, testing, and validation TFRecord files. For example, gs://mybucket/output . Subdirectories for resulting training, testing, and validation TFRecord files are automatically generated from outputDirectory . For example, gs://mybucket/output/train |
trainingPercentage |
(Optional) The percentage of query data allocated to training TFRecord files. The default value is 1, or 100%. |
testingPercentage |
(Optional) The percentage of query data allocated to testing TFRecord files. The default value is 0, or 0%. |
validationPercentage |
(Optional) The percentage of query data allocated to validation TFRecord files. The default value is 0, or 0%. |
outputSuffix |
(Optional) The file suffix for the training, testing, and validation TFRecord files that are written. The default value is .tfrecord . |
Running the BigQuery to Cloud Storage TFRecord files template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the BigQuery to TFRecords template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records \ --region REGION_NAME \ --parameters \ readQuery=READ_QUERY,\ outputDirectory=OUTPUT_DIRECTORY,\ trainingPercentage=TRAINING_PERCENTAGE,\ testingPercentage=TESTING_PERCENTAGE,\ validationPercentage=VALIDATION_PERCENTAGE,\ outputSuffix=OUTPUT_FILENAME_SUFFIX
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
READ_QUERY
: the BigQuery query to runOUTPUT_DIRECTORY
: the Cloud Storage path prefix for output datasetsTRAINING_PERCENTAGE
: the decimal percentage split for the training datasetTESTING_PERCENTAGE
: the decimal percentage split for the testing datasetVALIDATION_PERCENTAGE
: the decimal percentage split for the validation datasetOUTPUT_FILENAME_SUFFIX
: the preferred output TensorFlow Record file suffix
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records { "jobName": "JOB_NAME", "parameters": { "readQuery":"READ_QUERY", "outputDirectory":"OUTPUT_DIRECTORY", "trainingPercentage":"TRAINING_PERCENTAGE", "testingPercentage":"TESTING_PERCENTAGE", "validationPercentage":"VALIDATION_PERCENTAGE", "outputSuffix":"OUTPUT_FILENAME_SUFFIX" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
READ_QUERY
: the BigQuery query to runOUTPUT_DIRECTORY
: the Cloud Storage path prefix for output datasetsTRAINING_PERCENTAGE
: the decimal percentage split for the training datasetTESTING_PERCENTAGE
: the decimal percentage split for the testing datasetVALIDATION_PERCENTAGE
: the decimal percentage split for the validation datasetOUTPUT_FILENAME_SUFFIX
: the preferred output TensorFlow Record file suffix
BigQuery export to Parquet (via Storage API)
The BigQuery export to Parquet template is a batch pipeline that reads data from a BigQuery table and writes it to a Cloud Storage bucket in Parquet format. This template utilizes the BigQuery Storage API to export the data.
Requirements for this pipeline:
- The input BigQuery table must exist before running the pipeline.
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
tableRef |
The BigQuery input table location. For example, <my-project>:<my-dataset>.<my-table> . |
bucket |
The Cloud Storage folder in which to write the Parquet files. For example, gs://mybucket/exports . |
numShards |
(Optional) The number of output file shards. The default value is 1. |
fields |
(Optional) A comma-separated list of fields to select from the input BigQuery table. |
Running the BigQuery to Cloud Storage Parquet template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the BigQuery export to Parquet (via Storage API) template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet \ --region=REGION_NAME \ --parameters \ tableRef=BIGQUERY_TABLE,\ bucket=OUTPUT_DIRECTORY,\ numShards=NUM_SHARDS,\ fields=FIELDS
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGQUERY_TABLE
: your BigQuery table nameOUTPUT_DIRECTORY
: your Cloud Storage folder for output filesNUM_SHARDS
: the desired number of output file shardsFIELDS
: the comma-separated list of fields to select from the input BigQuery table
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "tableRef": "BIGQUERY_TABLE", "bucket": "OUTPUT_DIRECTORY", "numShards": "NUM_SHARDS", "fields": "FIELDS" }, "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet", } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGQUERY_TABLE
: your BigQuery table nameOUTPUT_DIRECTORY
: your Cloud Storage folder for output filesNUM_SHARDS
: the desired number of output file shardsFIELDS
: the comma-separated list of fields to select from the input BigQuery table
BigQuery to Elasticsearch
The BigQuery to Elasticsearch template is a batch pipeline that ingests data from a BigQuery table into Elasticsearch as documents. The template can either read the entire table or read specific records using a supplied query.
Requirements for this pipeline
- The source BigQuery table must exist.
- A Elasticsearch host on a GCP instance or on Elastic Cloud with Elasticsearch version 7.0 or above and should be acccessible from the Dataflow worker machines.
Template parameters
Parameter | Description |
---|---|
connectionUrl |
Elasticsearch URL in the format https://hostname:[port] or specify CloudID if using Elastic Cloud. |
apiKey |
Base64 Encoded API key used for authentication. |
index |
The Elasticsearch index toward which the requests will be issued, For example, my-index . |
inputTableSpec |
(Optional) BigQuery table to read from to insert into Elasticsearch. Either table or query must be provided. For example, projectId:datasetId.tablename . |
query |
(Optional) SQL query to pull data from BigQuery. Either table or query must be provided. |
useLegacySql |
(Optional) Set to true to use legacy SQL (only applicable if supplying query). Default: false . |
batchSize |
(Optional) Batch size in number of documents. Default: 1000 . |
batchSizeBytes |
(Optional) Batch size in number of bytes. Default: 5242880 (5mb). |
maxRetryAttempts |
(Optional) Max retry attempts, must be > 0. Default: no retries . |
maxRetryDuration |
(Optional) Max retry duration in milliseconds, must be > 0. Default: no retries . |
Running the BigQuery to Elasticsearch template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the BigQuery to Elasticsearch template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --region=REGION_NAME \ --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch \ --parameters \ inputTableSpec=INPUT_TABLE_SPEC,\ connectionUrl=CONNECTION_URL,\ apiKey=APIKEY,\ index=INDEX
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceREGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
INPUT_TABLE_SPEC
: your BigQuery table name.CONNECTION_URL
: your Elasticsearch URL.APIKEY
: your base64 encoded API key for authentication.INDEX
: your Elasticsearch index.
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "inputTableSpec": "INPUT_TABLE_SPEC", "connectionUrl": "CONNECTION_URL", "apiKey": "APIKEY", "index": "INDEX" }, "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/BigQuery_to_Elasticsearch", } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceLOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
INPUT_TABLE_SPEC
: your BigQuery table name.CONNECTION_URL
: your Elasticsearch URL.APIKEY
: your base64 encoded API key for authentication.INDEX
: your Elasticsearch index.
Bigtable to Cloud Storage Avro
The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Bigtable table must exist.
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to export. |
outputDirectory |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Avro filename. For example, output- . |
Running the Bigtable to Cloud Storage Avro file template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Bigtable to Avro Files on Cloud Storage template .
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro \ --region REGION_NAME \ --parameters \ bigtableProjectId=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ outputDirectory=OUTPUT_DIRECTORY,\ filenamePrefix=FILENAME_PREFIX
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Avro filename, for example,output-
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "outputDirectory": "OUTPUT_DIRECTORY", "filenamePrefix": "FILENAME_PREFIX", }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Avro filename, for example,output-
Bigtable to Cloud Storage Parquet
The Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Bigtable table must exist.
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to export. |
outputDirectory |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Parquet filename. For example, output- . |
numShards |
The number of output file shards. For example 2 . |
Running the Bigtable to Cloud Storage Parquet file template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Bigtable to Parquet Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet \ --region REGION_NAME \ --parameters \ bigtableProjectId=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ outputDirectory=OUTPUT_DIRECTORY,\ filenamePrefix=FILENAME_PREFIX,\ numShards=NUM_SHARDS
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet filename, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "outputDirectory": "OUTPUT_DIRECTORY", "filenamePrefix": "FILENAME_PREFIX", "numShards": "NUM_SHARDS" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet filename, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
Bigtable to Cloud Storage SequenceFile
The Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Bigtable table must exist.
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to export. |
bigtableAppProfileId |
The ID of the Bigtable application profile to be used for the export. If you do not specify an app profile, Bigtable uses the instance's default app profile. |
destinationPath |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the SequenceFile filename. For example, output- . |
Running the Bigtable to Cloud Storage SequenceFile template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Bigtable to SequenceFile Files on Cloud Storage template .
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile \ --region REGION_NAME \ --parameters \ bigtableProject=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ bigtableAppProfileId=APPLICATION_PROFILE_ID,\ destinationPath=DESTINATION_PATH,\ filenamePrefix=FILENAME_PREFIX
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Bigtable application profile to be used for the exportDESTINATION_PATH
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the SequenceFile filename, for example,output-
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile { "jobName": "JOB_NAME", "parameters": { "bigtableProject": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "bigtableAppProfileId": "APPLICATION_PROFILE_ID", "destinationPath": "DESTINATION_PATH", "filenamePrefix": "FILENAME_PREFIX", }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Bigtable application profile to be used for the exportDESTINATION_PATH
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the SequenceFile filename, for example,output-
Datastore to Cloud Storage Text [Deprecated]
This template is deprecated and will be removed in Q1 2022. Please migrate to Firestore to Cloud Storage Text template.
The Datastore to Cloud Storage Text template is a batch pipeline that reads Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.
Requirements for this pipeline:
Datastore must be set up in the project before running the pipeline.
Template parameters
Parameter | Description |
---|---|
datastoreReadGqlQuery |
A GQL query that specifies which
entities to grab. For example, SELECT * FROM MyKind . |
datastoreReadProjectId |
The Google Cloud project ID of the Datastore instance that you want to read data from. |
datastoreReadNamespace |
The namespace of the requested entities. To use the default namespace, leave this parameter blank. |
javascriptTextTransformGcsPath |
(Optional)
The Cloud Storage URI of the .js file that defines the JavaScript user-defined
function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js .
|
javascriptTextTransformFunctionName |
(Optional)
The name of the JavaScript user-defined function (UDF) that you want to use.
For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ } , then the function name is
myTransform . For sample JavaScript UDFs, see
UDF Examples.
|
textWritePrefix |
The Cloud Storage path prefix to specify where the data is written. For example,
gs://mybucket/somefolder/ . |
Running the Datastore to Cloud Storage Text template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Datastore to Text Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Datastore_to_GCS_Text \ --region REGION_NAME \ --parameters \ datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\ datastoreReadProjectId=DATASTORE_PROJECT_ID,\ datastoreReadNamespace=DATASTORE_NAMESPACE,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ textWritePrefix=gs://BUCKET_NAME/output/
Replace the following:
JOB_NAME
: a unique job name of your choiceREGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
BUCKET_NAME
: the name of your Cloud Storage bucketDATASTORE_PROJECT_ID
: the Cloud project ID where the Datastore instance existsDATASTORE_KIND
: the type of your Datastore entitiesDATASTORE_NAMESPACE
: the namespace of your Datastore entitiesJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Datastore_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND" "datastoreReadProjectId": "DATASTORE_PROJECT_ID", "datastoreReadNamespace": "DATASTORE_NAMESPACE", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "textWritePrefix": "gs://BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceLOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
BUCKET_NAME
: the name of your Cloud Storage bucketDATASTORE_PROJECT_ID
: the Cloud project ID where the Datastore instance existsDATASTORE_KIND
: the type of your Datastore entitiesDATASTORE_NAMESPACE
: the namespace of your Datastore entitiesJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
Firestore to Cloud Storage Text
The Firestore to Cloud Storage Text template is a batch pipeline that reads Firestore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.
Requirements for this pipeline:
Firestore must be set up in the project before running the pipeline.
Template parameters
Parameter | Description |
---|---|
firestoreReadGqlQuery |
A GQL query that specifies which
entities to grab. For example, SELECT * FROM MyKind . |
firestoreReadProjectId |
The Google Cloud project ID of the Firestore instance that you want to read data from. |
firestoreReadNamespace |
The namespace of the requested entities. To use the default namespace, leave this parameter blank. |
javascriptTextTransformGcsPath |
(Optional)
The Cloud Storage URI of the .js file that defines the JavaScript user-defined
function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js .
|
javascriptTextTransformFunctionName |
(Optional)
The name of the JavaScript user-defined function (UDF) that you want to use.
For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ } , then the function name is
myTransform . For sample JavaScript UDFs, see
UDF Examples.
|
textWritePrefix |
The Cloud Storage path prefix to specify where the data is written. For example,
gs://mybucket/somefolder/ . |
Running the Firestore to Cloud Storage Text template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Firestore to Text Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Firestore_to_GCS_Text \ --region REGION_NAME \ --parameters \ firestoreReadGqlQuery="SELECT * FROM FIRESTORE_KIND",\ firestoreReadProjectId=FIRESTORE_PROJECT_ID,\ firestoreReadNamespace=FIRESTORE_NAMESPACE,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ textWritePrefix=gs://BUCKET_NAME/output/
Replace the following:
JOB_NAME
: a unique job name of your choiceREGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
BUCKET_NAME
: the name of your Cloud Storage bucketFIRESTORE_PROJECT_ID
: the Cloud project ID where the Firestore instance existsFIRESTORE_KIND
: the type of your Firestore entitiesFIRESTORE_NAMESPACE
: the namespace of your Firestore entitiesJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Firestore_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "firestoreReadGqlQuery": "SELECT * FROM FIRESTORE_KIND" "firestoreReadProjectId": "FIRESTORE_PROJECT_ID", "firestoreReadNamespace": "FIRESTORE_NAMESPACE", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "textWritePrefix": "gs://BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceLOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
BUCKET_NAME
: the name of your Cloud Storage bucketFIRESTORE_PROJECT_ID
: the Cloud project ID where the Firestore instance existsFIRESTORE_KIND
: the type of your Firestore entitiesFIRESTORE_NAMESPACE
: the namespace of your Firestore entitiesJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
Cloud Spanner to Cloud Storage Avro
The Cloud Spanner to Avro Files on Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:
- A
spanner-export.json
file. - A
TableName-manifest.json
file for each table in the database you exported. - One or more
TableName.avro-#####-of-#####
files.
For example, exporting a database with two tables, Singers
and Albums
,
creates the following file set:
Albums-manifest.json
Albums.avro-00000-of-00002
Albums.avro-00001-of-00002
Singers-manifest.json
Singers.avro-00000-of-00003
Singers.avro-00001-of-00003
Singers.avro-00002-of-00003
spanner-export.json
Requirements for this pipeline:
- The Cloud Spanner database must exist.
- The output Cloud Storage bucket must exist.
- In addition to the IAM roles necessary to run Dataflow jobs, you must also have the appropriate IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.
Template parameters
Parameter | Description | |
---|---|---|
instanceId |
The instance ID of the Cloud Spanner database that you want to export. | |
databaseId |
The database ID of the Cloud Spanner database that you want to export. | |
outputDir |
The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files. | |
snapshotTime |
(Optional) The timestamp that corresponds to the version of the Cloud Spanner database
that you want to read. The timestamp must be specified as per
RFC 3339 UTC "Zulu" format.
For example, 1990-12-31T23:59:60Z . The timestamp must be in the past and
Maximum timestamp
staleness applies. |
|
tableNames |
(Optional) A comma separated list of tables specifying the subset of the Cloud Spanner database to be exported. The list must include all the related tables (Parent tables, Foreign key referenced tables). If they are not explicitly listed, the 'shouldExportRelatedTables' flag must be set for a successful export. | |
shouldExportRelatedTables |
(Optional) The flag used in conjunction with 'tableNames' parameter to include all the related tables to be exported. | |
spannerProjectId |
(Optional) The Google Cloud Project ID of the Cloud Spanner database that you want to read data from. |
Running the Cloud Spanner to Avro Files on Cloud Storage template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
For the job to show up in the Spanner Instances page of the console, the job name must match the following format:
cloud-spanner-export-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
Replace the following:
SPANNER_INSTANCE_ID
: your Spanner instance's IDSPANNER_DATABASE_NAME
: your Spanner database's name
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Spanner to Avro Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro \ --region REGION_NAME \ --staging-location GCS_STAGING_LOCATION \ --parameters \ instanceId=INSTANCE_ID,\ databaseId=DATABASE_ID,\ outputDir=GCS_DIRECTORY
Replace the following:
JOB_NAME
: a unique job name of your choiceFor the job to show in the Cloud Spanner portion of the console, the job name must match the format
cloud-spanner-export-INSTANCE_ID-DATABASE_ID
.VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
GCS_STAGING_LOCATION
: the path for writing temporary files; for example,gs://mybucket/temp
INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_DIRECTORY
: the Cloud Storage path that the Avro files are exported to
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "outputDir": "gs://GCS_DIRECTORY" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceFor the job to show in the Cloud Spanner portion of the console, the job name must match the format
cloud-spanner-export-INSTANCE_ID-DATABASE_ID
.VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
GCS_STAGING_LOCATION
: the path for writing temporary files; for example,gs://mybucket/temp
INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_DIRECTORY
: the Cloud Storage path that the Avro files are exported to
Cloud Spanner to Cloud Storage Text
The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, and writes it to Cloud Storage as CSV text files.
Requirements for this pipeline:
- The input Spanner table must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
spannerProjectId |
The Google Cloud Project ID of the Cloud Spanner database that you want to read data from. |
spannerDatabaseId |
The database ID of the requested table. |
spannerInstanceId |
The instance ID of the requested table. |
spannerTable |
The table to read the data from. |
textWritePrefix |
The directory where output text files are written. Add / at the end. For example, gs://mybucket/somefolder/ . |
spannerSnapshotTime |
(Optional) The timestamp that corresponds to the version of the Cloud Spanner database
that you want to read. The timestamp must be specified as per
RFC 3339 UTC "Zulu" format.
For example, 1990-12-31T23:59:60Z . The timestamp must be in the past and
Maximum timestamp
staleness applies. |
Running the Cloud Spanner to Cloud Storage Text template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Spanner to Text Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Spanner_to_GCS_Text \ --region REGION_NAME \ --parameters \ spannerProjectId=SPANNER_PROJECT_ID,\ spannerDatabaseId=DATABASE_ID,\ spannerInstanceId=INSTANCE_ID,\ spannerTable=TABLE_ID,\ textWritePrefix=gs://BUCKET_NAME/output/
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
SPANNER_PROJECT_ID
: the Cloud project ID of the Spanner database from which you want to read dataDATABASE_ID
: the Spanner database IDBUCKET_NAME
: the name of your Cloud Storage bucketINSTANCE_ID
: the Spanner instance IDTABLE_ID
: the Spanner table ID
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Spanner_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "spannerProjectId": "SPANNER_PROJECT_ID", "spannerDatabaseId": "DATABASE_ID", "spannerInstanceId": "INSTANCE_ID", "spannerTable": "TABLE_ID", "textWritePrefix": "gs://BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
SPANNER_PROJECT_ID
: the Cloud project ID of the Spanner database from which you want to read dataDATABASE_ID
: the Spanner database IDBUCKET_NAME
: the name of your Cloud Storage bucketINSTANCE_ID
: the Spanner instance IDTABLE_ID
: the Spanner table ID
Cloud Storage Avro to Bigtable
The Cloud Storage Avro to Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.
Requirements for this pipeline:
- The Bigtable table must exist and have the same column families as exported in the Avro files.
- The input Avro files must exist in a Cloud Storage bucket before running the pipeline.
- Bigtable expects a specific schema from the input Avro files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to import. |
inputFilePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Avro file to Bigtable template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Bigtable template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable \ --region REGION_NAME \ --parameters \ bigtableProjectId=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ inputFilePattern=INPUT_FILE_PATTERN
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "inputFilePattern": "INPUT_FILE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage Avro to Cloud Spanner
The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files exported from Cloud Spanner stored in Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database must exist and must be empty.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path must exist, and it must include a
spanner-export.json
file that contains a JSON description of files to import.
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database. |
databaseId |
The database ID of the Cloud Spanner database. |
inputDir |
The Cloud Storage path where the Avro files are imported from. |
Running the Cloud Storage Avro to Cloud Spanner template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
For the job to show up in the Spanner Instances page of the console, the job name must match the following format:
cloud-spanner-import-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME
Replace the following:
SPANNER_INSTANCE_ID
: your Spanner instance's IDSPANNER_DATABASE_NAME
: your Spanner database's name
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Spanner template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner \ --region REGION_NAME \ --staging-location GCS_STAGING_LOCATION \ --parameters \ instanceId=INSTANCE_ID,\ databaseId=DATABASE_ID,\ inputDir=GCS_DIRECTORY
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INSTANCE_ID
: the ID of the Spanner instance that contains the databaseDATABASE_ID
: the ID of the Spanner database to import toGCS_DIRECTORY
: the Cloud Storage path where the Avro files are imported from, for example,gs://mybucket/somefolder
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "inputDir": "gs://GCS_DIRECTORY" }, "environment": { "machineType": "n1-standard-2" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INSTANCE_ID
: the ID of the Spanner instance that contains the databaseDATABASE_ID
: the ID of the Spanner database to import toGCS_DIRECTORY
: the Cloud Storage path where the Avro files are imported from, for example,gs://mybucket/somefolder
Cloud Storage Parquet to Bigtable
The Cloud Storage Parquet to Bigtable template is a pipeline that reads data from Parquet files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.
Requirements for this pipeline:
- The Bigtable table must exist and have the same column families as exported in the Parquet files.
- The input Parquet files must exist in a Cloud Storage bucket before running the pipeline.
- Bigtable expects a specific schema from the input Parquet files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to import. |
inputFilePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Parquet file to Bigtable template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Parquet Files on Cloud Storage to Cloud Bigtable template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable \ --region REGION_NAME \ --parameters \ bigtableProjectId=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ inputFilePattern=INPUT_FILE_PATTERN
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "inputFilePattern": "INPUT_FILE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage SequenceFile to Bigtable
The Cloud Storage SequenceFile to Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.
Requirements for this pipeline:
- The Bigtable table must exist.
- The input SequenceFiles must exist in a Cloud Storage bucket before running the pipeline.
- The input SequenceFiles must have been exported from Bigtable or HBase.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to import. |
bigtableAppProfileId |
The ID of the Bigtable application profile to be used for the import. If you do not specify an app profile, Bigtable uses the instance's default app profile. |
sourcePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage SequenceFile to Bigtable template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the SequenceFile Files on Cloud Storage to Cloud Bigtable template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable \ --region REGION_NAME \ --parameters \ bigtableProject=BIGTABLE_PROJECT_ID,\ bigtableInstanceId=INSTANCE_ID,\ bigtableTableId=TABLE_ID,\ bigtableAppProfileId=APPLICATION_PROFILE_ID,\ sourcePattern=SOURCE_PATTERN
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Bigtable application profile to be used for the exportSOURCE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProject": "BIGTABLE_PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "bigtableAppProfileId": "APPLICATION_PROFILE_ID", "sourcePattern": "SOURCE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
BIGTABLE_PROJECT_ID
: the ID of the Google Cloud project of the Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Bigtable instance that contains the tableTABLE_ID
: the ID of the Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Bigtable application profile to be used for the exportSOURCE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage Text to BigQuery
The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and append the result to a BigQuery table.
Requirements for this pipeline:
- Create a JSON file that describes your BigQuery schema.
Ensure that there is a top-level JSON array titled
BigQuery Schema
and that its contents follow the pattern{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
.The Cloud Storage Text to BigQuery batch template doesn't support importing data into
STRUCT
(Record) fields in the target BigQuery table.The following JSON describes an example BigQuery schema:
{ "BigQuery Schema": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
- Create a JavaScript (
.js
) file with your UDF function that supplies the logic to transform the lines of text. Your function must return a JSON string.For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.
function transform(line) { var values = line.split(','); var obj = new Object(); obj.location = values[0]; obj.name = values[1]; obj.age = values[2]; obj.color = values[3]; obj.coffee = values[4]; var jsonString = JSON.stringify(obj); return jsonString; }
Template parameters
Parameter | Description |
---|---|
javascriptTextTransformFunctionName |
The name of the JavaScript user-defined function (UDF) that you want to use.
For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ } , then the function name is
myTransform . For sample JavaScript UDFs, see
UDF Examples.
|
JSONPath |
The gs:// path to the JSON file that defines your BigQuery schema, stored in
Cloud Storage. For example, gs://path/to/my/schema.json . |
javascriptTextTransformGcsPath |
The Cloud Storage URI of the .js file that defines the JavaScript user-defined
function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js .
|
inputFilePattern |
The gs:// path to the text in Cloud Storage you'd like to process. For
example, gs://path/to/my/text/data.txt . |
outputTable |
The BigQuery table name you want to create to store your processed data in.
If you reuse an existing BigQuery table, the data is appended to the destination table.
For example, my-project-name:my-dataset.my-table . |
bigQueryLoadingTemporaryDirectory |
The temporary directory for the BigQuery loading process.
For example, gs://my-bucket/my-files/temp_dir . |
Running the Cloud Storage Text to BigQuery template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to BigQuery (Batch) template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery \ --region REGION_NAME \ --parameters \ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ inputFilePattern=PATH_TO_TEXT_DATA,\ outputTable=BIGQUERY_TABLE,\ bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
JAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_BIGQUERY_SCHEMA_JSON
: the Cloud Storage path to the JSON file containing the schema definitionPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA
: your Cloud Storage path to your text datasetBIGQUERY_TABLE
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directory
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery { "jobName": "JOB_NAME", "parameters": { "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "inputFilePattern":"PATH_TO_TEXT_DATA", "outputTable":"BIGQUERY_TABLE", "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
JAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_BIGQUERY_SCHEMA_JSON
: the Cloud Storage path to the JSON file containing the schema definitionPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
PATH_TO_TEXT_DATA
: your Cloud Storage path to your text datasetBIGQUERY_TABLE
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directory
Cloud Storage Text to Datastore [Deprecated]
This template is deprecated and will be removed in Q1 2022. Please migrate to Cloud Storage Text to Firestore template.
The Cloud Storage Text to Datastore template is a batch pipeline that reads from text files stored in Cloud Storage and writes JSON encoded Entities to Datastore. Each line in the input text files must be in the specified JSON format.
Requirements for this pipeline:
- Datastore must be enabled in the destination project.
Template parameters
Parameter | Description |
---|---|
textReadPattern |
A Cloud Storage path pattern that specifies the location of your text data files.
For example, gs://mybucket/somepath/*.json . |
javascriptTextTransformGcsPath |
(Optional)
The Cloud Storage URI of the .js file that defines the JavaScript user-defined
function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js .
|
javascriptTextTransformFunctionName |
(Optional)
The name of the JavaScript user-defined function (UDF) that you want to use.
For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ } , then the function name is
myTransform . For sample JavaScript UDFs, see
UDF Examples.
|
datastoreWriteProjectId |
The Google Cloud project id of where to write the Datastore entities |
datastoreHintNumWorkers |
(Optional) Hint for the expected number of workers in the Datastore ramp-up throttling step. Default is 500 . |
errorWritePath |
The error log output file to use for write failures that occur during processing. For
example, gs://bucket-name/errors.txt . |
Running the Cloud Storage Text to Datastore template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Datastore template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Datastore \ --region REGION_NAME \ --parameters \ textReadPattern=PATH_TO_INPUT_TEXT_FILES,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ datastoreWriteProjectId=PROJECT_ID,\ errorWritePath=ERROR_FILE_WRITE_PATH
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
PATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Datastore { "jobName": "JOB_NAME", "parameters": { "textReadPattern": "PATH_TO_INPUT_TEXT_FILES", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "datastoreWriteProjectId": "PROJECT_ID", "errorWritePath": "ERROR_FILE_WRITE_PATH" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
PATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
Cloud Storage Text to Firestore
The Cloud Storage Text to Firestore template is a batch pipeline that reads from text files stored in Cloud Storage and writes JSON encoded Entities to Firestore. Each line in the input text files must be in the specified JSON format.
Requirements for this pipeline:
- Firestore must be enabled in the destination project.
Template parameters
Parameter | Description |
---|---|
textReadPattern |
A Cloud Storage path pattern that specifies the location of your text data files.
For example, gs://mybucket/somepath/*.json . |
javascriptTextTransformGcsPath |
(Optional)
The Cloud Storage URI of the .js file that defines the JavaScript user-defined
function (UDF) you want to use. For example, gs://my-bucket/my-udfs/my_file.js .
|
javascriptTextTransformFunctionName |
(Optional)
The name of the JavaScript user-defined function (UDF) that you want to use.
For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ } , then the function name is
myTransform . For sample JavaScript UDFs, see
UDF Examples.
|
firestoreWriteProjectId |
The Google Cloud project id of where to write the Firestore entities |
firestoreHintNumWorkers |
(Optional) Hint for the expected number of workers in the Firestore ramp-up throttling step. Default is 500 . |
errorWritePath |
The error log output file to use for write failures that occur during processing. For
example, gs://bucket-name/errors.txt . |
Running the Cloud Storage Text to Firestore template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Firestore template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Firestore \ --region REGION_NAME \ --parameters \ textReadPattern=PATH_TO_INPUT_TEXT_FILES,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ firestoreWriteProjectId=PROJECT_ID,\ errorWritePath=ERROR_FILE_WRITE_PATH
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
PATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Firestore { "jobName": "JOB_NAME", "parameters": { "textReadPattern": "PATH_TO_INPUT_TEXT_FILES", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "firestoreWriteProjectId": "PROJECT_ID", "errorWritePath": "ERROR_FILE_WRITE_PATH" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
PATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: the name of the JavaScript user-defined function (UDF) that you want to useFor example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.PATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage URI of the.js
file that defines the JavaScript user-defined function (UDF) you want to use—for example,gs://my-bucket/my-udfs/my_file.js
ERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
Cloud Storage Text to Pub/Sub (Batch)
This template creates a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.
This template does not set any timestamp on the individual records. The event time is equal to the publishing time during execution. If your pipeline relies on an accurate event time for processing, you must not use this pipeline.
Requirements for this pipeline:
- The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files might cause issues downstream because each line within the files will be published as a message to Pub/Sub.
- The Pub/Sub topic must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFilePattern |
The input file pattern to read from. For example, gs://bucket-name/files/*.json . |
outputTopic |
The Pub/Sub input topic to write to. The name must be in the format of
projects/<project-id>/topics/<topic-name> . |
Running the Cloud Storage Text to Pub/Sub (Batch) template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Pub/Sub (Batch) template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub \ --region REGION_NAME \ --parameters \ inputFilePattern=gs://BUCKET_NAME/files/*.json,\ outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
TOPIC_NAME
: your Pub/Sub topic nameBUCKET_NAME
: the name of your Cloud Storage bucket
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub { "jobName": "JOB_NAME", "parameters": { "inputFilePattern": "gs://BUCKET_NAME/files/*.json", "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
TOPIC_NAME
: your Pub/Sub topic nameBUCKET_NAME
: the name of your Cloud Storage bucket
Cloud Storage Text to Cloud Spanner
The Cloud Storage Text to Cloud Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database and table must exist.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path containing the CSV files must exist.
- You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
- If the target Cloud Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
-
The manifest file, encoded in ASCII or UTF-8, must match the following format:
- Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
- Data must match one of the following types:
Google Standard SQL
BOOL INT64 FLOAT64 NUMERIC STRING DATE TIMESTAMP BYPES JSON
PostgreSQL
boolean bigint double precision numeric character varying, text date timestamp with time zone bytea
Template parameters
Parameter Description instanceId
The instance ID of the Cloud Spanner database. databaseId
The database ID of the Cloud Spanner database. importManifest
The path in Cloud Storage to the import manifest file. columnDelimiter
The column delimiter that the source file uses. The default value is ,
.fieldQualifier
The character that must surround any value in the source file that contains the columnDelimiter
. The default value is"
.trailingDelimiter
Specifies whether the lines in the source files have trailing delimiters (that is, if the columnDelimiter
character appears at the end of each line, after the last column value). The default value istrue
.escape
The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character. nullString
The string that represents a NULL
value. By default, this parameter is not set and the template does not use the null string.dateFormat
The format used to parse date columns. By default, the pipeline tries to parse the date columns as yyyy-M-d[' 00:00:00']
, for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using thejava.time.format.DateTimeFormatter
patterns.timestampFormat
The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT
format. For other cases, specify your own pattern string, for example, usingMMM dd yyyy HH:mm:ss.SSSVV
for timestamps in the form of"Jan 21 1998 01:02:03.456+08:00"
.If you need to use customized date or timestamp formats, make sure they're valid
java.time.format.DateTimeFormatter
patterns. The following table shows additional examples of customized formats for date and timestamp columns:Type Input value Format Remark DATE
2011-3-31 By default, the template can parse this format. You don't need to specify the dateFormat
parameter.DATE
2011-3-31 00:00:00 By default, the template can parse this format. You don't need to specify the format. If you like, you can use yyyy-M-d' 00:00:00'
.DATE
01 Apr, 18 dd MMM, yy DATE
Wednesday, April 3, 2019 AD EEEE, LLLL d, yyyy G TIMESTAMP
2019-01-02T11:22:33Z
2019-01-02T11:22:33.123Z
2019-01-02T11:22:33.12356789Z
The default format ISO_INSTANT
can parse this type of timestamp. You don't need to provide thetimestampFormat
parameter.TIMESTAMP
1568402363 By default, the template can parse this type of timestamp and treat it as Unix epoch time. TIMESTAMP
Tue, 3 Jun 2008 11:05:30 GMT EEE, d MMM yyyy HH:mm:ss VV TIMESTAMP
2018/12/31 110530.123PST yyyy/MM/dd HHmmss.SSSz TIMESTAMP
2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z yyyy-MM-dd'T'HH:mm:ss[.SSS]VV If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. You can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV
to handle both cases. You cannot useyyyy-MM-dd'T'HH:mm:ss[.SSS]'Z'
, because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to ajava.time.Instant
. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a validjava.time.Instant
.Running the Text Files on Cloud Storage to Cloud Spanner template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner \ --region REGION_NAME \ --parameters \ instanceId=INSTANCE_ID,\ databaseId=DATABASE_ID,\ importManifest=GCS_PATH_TO_IMPORT_MANIFEST
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
API
To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see
projects.templates.launch
.POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST" }, "environment": { "machineType": "n1-standard-2" } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest file
Cloud Storage to Elasticsearch
The Cloud Storage to Elasticsearch template is a batch pipeline that reads data from csv files stored in a Cloud Storage Bucket and writes the data into Elasticsearch as JSON documents.
Requirements for this pipeline:
- The Cloud Storage bucket must exist.
- A Elasticsearch host on a GCP instance or on Elasticsearch Cloud that is accessible from Dataflow must exist.
- A BigQuery table for error output must exist.
Template parameters
Parameter Description inputFileSpec
The Cloud Storage file pattern to search for CSV files. Example: gs://mybucket/test-*.csv
.connectionUrl
Elasticsearch URL in the format https://hostname:[port]
or specify CloudID if using Elastic Cloud.apiKey
Base64 Encoded API key used for authentication. index
The Elasticsearch index toward which the requests will be issued such as my-index
.deadletterTable
The BigQuery Deadletter table to send failed inserts to. Example: <your-project>:<your-dataset>.<your-table-name>
.containsHeaders
(Optional) Boolean to denote whether headers are included in the CSV. Default true
.delimiter
(Optional) The delimiter that the CSV uses. Example: ,
csvFormat
(Optional) The CSV format according to Apache Commons CSV format. Default: Default
.jsonSchemaPath
(Optional) The Path to JSON schema. Default: null
.largeNumFiles
(Optional) Set to true if number of files is in the tens of thousands. Default: false
.javascriptTextTransformGcsPath
(Optional) The Cloud Storage URI of the .js
file that defines the JavaScript user-defined function (UDF) you want to use. For example,gs://my-bucket/my-udfs/my_file.js
.javascriptTextTransformFunctionName
(Optional) The name of the JavaScript user-defined function (UDF) that you want to use. For example, if your JavaScript function code is myTransform(inJson) { /*...do stuff...*/ }
, then the function name ismyTransform
. For sample JavaScript UDFs, see UDF Examples.batchSize
(Optional) Batch size in number of documents. Default: 1000
.batchSizeBytes
(Optional) Batch size in number of bytes. Default: 5242880
(5mb).maxRetryAttempts
(Optional) Max retry attempts, must be > 0. Default: no retries. maxRetryDuration
(Optional) Max retry duration in milliseconds, must be > 0. Default: no retries. csvFileEncoding
(Optional) CSV file encoding. Running the Cloud Storage to Elasticsearch template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Storage to Elasticsearch template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID\ --region=REGION_NAME \ --template-file-gcs-location=gs://dataflow-templates/VERSION/flex/GCS_To_Elasticsearch \ --parameters \ inputFileSpec=INPUT_FILE_SPEC,\ connectionUrl=CONNECTION_URL,\ apiKey=APIKEY,\ index=INDEX,\ deadletterTable=DEADLETTER_TABLE,\
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
REGION_NAME
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INPUT_FILE_SPEC
: your Cloud Storage file pattern.CONNECTION_URL
: your Elasticsearch URL.APIKEY
: your base64 encoded API key for authentication.INDEX
: your Elasticsearch index.DEADLETTER_TABLE
: your BigQuery table.
API
To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see
projects.templates.launch
.POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "inputFileSpec": "INPUT_FILE_SPEC", "connectionUrl": "CONNECTION_URL", "apiKey": "APIKEY", "index": "INDEX", "deadletterTable": "DEADLETTER_TABLE" }, "containerSpecGcsPath": "gs://dataflow-templates/VERSION/flex/GCS_To_Elasticsearch", } }
Replace the following:
PROJECT_ID
: the Cloud project ID where you want to run the Spanner jobJOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/
LOCATION
: the regional endpoint where you want to deploy your Spanner job—for example,us-central1
INPUT_FILE_SPEC
: your Cloud Storage file pattern.CONNECTION_URL
: your Elasticsearch URL.APIKEY
: your base64 encoded API key for authentication.INDEX
: your Elasticsearch index.DEADLETTER_TABLE
: your BigQuery table.
Java Database Connectivity (JDBC) to BigQuery
The JDBC to BigQuery template is a batch pipeline that copies data from a relational database table into an existing BigQuery table. This pipeline uses JDBC to connect to the relational database. You can use this template to copy data from any relational database with available JDBC drivers into BigQuery. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint for additional details on encrypting your username, password, and connection string parameters.
Requirements for this pipeline:
- The JDBC drivers for the relational database must be available.
- The BigQuery table must exist before pipeline execution.
- The BigQuery table must have a compatible schema.
- The relational database must be accessible from the subnet where Dataflow runs.
Template parameters
Parameter Description driverJars
The comma-separated list of driver JAR files. For example, gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar
.driverClassName
The JDBC driver class name. For example, com.mysql.jdbc.Driver
.connectionURL
The JDBC connection URL string. For example, jdbc:mysql://some-host:3306/sampledb
. Can be passed in as a string that's Base64-encoded and then encrypted with a Cloud KMS key.query
The query to be run on the source to extract the data. For example, select * from sampledb.sample_table
.outputTable
The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>
.bigQueryLoadingTemporaryDirectory
The temporary directory for the BigQuery loading process. For example, gs://<my-bucket>/my-files/temp_dir
.connectionProperties
(Optional) Properties string to use for the JDBC connection. Format of the string must be [propertyName=property;]*
. For example,unicode=true;characterEncoding=UTF-8
.username
(Optional) The username to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. password
(Optional) The password to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. KMSEncryptionKey
(Optional) Cloud KMS Encryption Key to decrypt the username, password, and connection string. If Cloud KMS key is passed in, the username, password, and connection string must all be passed in encrypted. disabledAlgorithms
(Optional) Comma separated algorithms to disable. If this value is set to none
then no algorithm is disabled. Use with care, as the algorithms disabled by default are known to have either vulnerabilities or performance issues. For example:SSLv3, RC4.
extraFilesToStage
Comma separated Cloud Storage paths or Secret Manager secrets for files to stage in the worker. These files will be saved under the /extra_files
directory in each worker. For example,gs://<my-bucket>/file.txt,projects/<project-id>/secrets/<secret-id>/versions/<version-id>
.Running the JDBC to BigQuery template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
regional endpoint is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the JDBC to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/VERSION/Jdbc_to_BigQuery \ --region REGION_NAME \ --parameters \ driverJars=DRIVER_PATHS,\ driverClassName=DRIVER_CLASS_NAME,\ connectionURL=JDBC_CONNECTION_URL,\ query=SOURCE_SQL_QUERY,\ outputTable=PROJECT_ID:DATASET.TABLE_NAME, bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\ connectionProperties=CONNECTION_PROPERTIES,\ username=CONNECTION_USERNAME,\ password=CONNECTION_PASSWORD,\ KMSEncryptionKey=KMS_ENCRYPTION_KEY
Replace the following:
JOB_NAME
: a unique job name of your choiceVERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates/latest/- the version name, like
2021-09-20-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates/