Google provides a set of open-source Dataflow templates. For general information about templates, see the Overview page. For a list of all Google-provided templates, see the Get started with Google-provided templates page.
This page documents batch templates:
- BigQuery to Cloud Storage TFRecords
- BigQuery to Cloud Storage Parquet (Storage API)
- Cloud Bigtable to Cloud Storage Avro
- Cloud Bigtable to Cloud Storage Parquet
- Cloud Bigtable to Cloud Storage SequenceFile
- Datastore to Cloud Storage Text
- Cloud Spanner to Cloud Storage Avro
- Cloud Spanner to Cloud Storage Text
- Cloud Storage Avro to Cloud Bigtable
- Cloud Storage Avro to Cloud Spanner
- Cloud Storage Parquet to Cloud Bigtable
- Cloud Storage SequenceFile to Cloud Bigtable
- Cloud Storage Text to BigQuery
- Cloud Storage Text to Datastore
- Cloud Storage Text to Pub/Sub (Batch)
- Cloud Storage Text to Cloud Spanner
- Java Database Connectivity (JDBC) to BigQuery
- Apache Cassandra to Cloud Bigtable
- Apache Hive to BigQuery
- File Format Conversion
BigQuery to Cloud Storage TFRecords
The BigQuery to Cloud Storage TFRecords template is a pipeline that reads data from a BigQuery query and writes it to a Cloud Storage bucket in TFRecord format. You can specify the training, testing, and validation percentage splits. By default, the split is 1 or 100% for the training set and 0 or 0% for testing and validation sets. Note that when setting the dataset split, the sum of training, testing, and validation needs to add up to 1 or 100% (for example, 0.6+0.2+0.2). Dataflow automatically determines the optimal number of shards for each output dataset.
Requirements for this pipeline:
- The BigQuery dataset and table must exist.
- The output Cloud Storage bucket must exist prior to pipeline execution. Note that training, testing, and validation subdirectories do not need to preexist and will be autogenerated.
Template parameters
Parameter | Description |
---|---|
readQuery |
A BigQuery SQL query that extracts data from the source. For example, select * from dataset1.sample_table . |
outputDirectory |
The top-level Cloud Storage path prefix at which to write the training, testing, and validation TFRecord files. For example, gs://mybucket/output . Subdirectories for resulting training, testing and validation TFRecord files will be automatically generated based off of outputDirectory. For example, gs://mybucket/output/train |
trainingPercentage |
(Optional) The percentage of query data allocated to training TFRecord files. The default value is 1, or 100%. |
testingPercentage |
(Optional) The percentage of query data allocated to testing TFRecord files. The default value is 0, or 0%. |
validationPercentage |
(Optional) The percentage of query data allocated to validation TFRecord files. The default value is 0, or 0%. |
outputSuffix |
(Optional) The file suffix for the training, testing, and validation TFRecord files that are written. The default value is .tfrecord . |
Executing the BigQuery to Cloud Storage TFRecord files template
Console
Execute from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Execute from thegcloud
command-line tool
Note: To run templates with the gcloud
command-line tool, you must have
Cloud SDK version 138.0.0 or later.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records \ --parameters readQuery=READ_QUERY,outputDirectory=OUTPUT_DIRECTORY,trainingPercentage=TRAINING_PERCENTAGE,testingPercentage=TESTING_PERCENTAGE,validationPercentage=VALIDATION_PERCENTAGE,outputSuffix=OUTPUT_FILENAME_SUFFIX
Replace the following values:
JOB_NAME
: a job name of your choiceREAD_QUERY
: the BigQuery query to be executedOUTPUT_DIRECTORY
: the Cloud Storage path prefix for output datasetsTRAINING_PERCENTAGE
: the decimal percentage split for the training datasetTESTING_PERCENTAGE
: the decimal percentage split for the testing datasetVALIDATION_PERCENTAGE
: the decimal percentage split for the validation datasetOUTPUT_FILENAME_SUFFIX
: the preferred output TensorFlow Record file suffix
API
Execute from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
To execute the template with the REST API , send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records { "jobName": "JOB_NAME", "parameters": { "readQuery":"READ_QUERY", "outputDirectory":"OUTPUT_DIRECTORY", "trainingPercentage":"TRAINING_PERCENTAGE", "testingPercentage":"TESTING_PERCENTAGE", "validationPercentage":"VALIDATION_PERCENTAGE", "outputSuffix":"OUTPUT_FILENAME_SUFFIX" }, "environment": { "zone": "us-central1-f" } }
Replace the following values:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceREAD_QUERY
: the BigQuery query to be executedOUTPUT_DIRECTORY
: the Cloud Storage path prefix for output datasetsTRAINING_PERCENTAGE
: the decimal percentage split for the training datasetTESTING_PERCENTAGE
: the decimal percentage split for the testing datasetVALIDATION_PERCENTAGE
: the decimal percentage split for the validation datasetOUTPUT_FILENAME_SUFFIX
: the preferred output TensorFlow Record file suffix
BigQuery to Cloud Storage Parquet (Storage API)
The BigQuery to Parquet template is a batch pipeline that reads data from a BigQuery table and writes it to a Cloud Storage bucket in Parquet format. This template utilizes the BigQuery Storage API to export the data.
Requirements for this pipeline:
- The input BigQuery table must exist prior to running the pipeline.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
tableRef |
The BigQuery input table location. For example, <my-project>:<my-dataset>.<my-table> . |
bucket |
The Cloud Storage folder in which to write the Parquet files. For example, gs://mybucket/exports . |
numShards |
(Optional) The number of output file shards. The default value is 1. |
fields |
(Optional) A comma-separated list of fields to select from the input BigQuery table. |
Running the BigQuery to Cloud Storage Parquet template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the BigQuery to Parquet template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run Flex templates, you must have
Cloud SDK version 284.0.0 or higher.
When running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/BigQuery_To_Parquet
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --template-file-gcs-location=gs://dataflow-templates/latest/flex/BigQuery_To_Parquet \ --parameters \ tableRef=BIGQUERY_TABLE,\ bucket=OUTPUT_DIRECTORY,\ numShards=NUM_SHARDS,\ fields=FIELDS
Replace the following values:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceBIGQUERY_TABLE
: your BigQuery table nameOUTPUT_DIRECTORY
: your Cloud Storage folder for output filesNUM_SHARDS
: the desired number of output file shardsFIELDS
: the comma-separated list of fields to select from the input BigQuery tableLOCATION
: the execution region, for example,us-central1
API
Run from the REST APIWhen running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/BigQuery_To_Parquet
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "tableRef": "BIGQUERY_TABLE", "bucket": "OUTPUT_DIRECTORY", "numShards": "NUM_SHARDS", "fields": "FIELDS" }, "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/BigQuery_To_Parquet", } }
Replace the following values:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceBIGQUERY_TABLE
: your BigQuery table nameOUTPUT_DIRECTORY
: your Cloud Storage folder for output filesNUM_SHARDS
: the desired number of output file shardsFIELDS
: the comma-separated list of fields to select from the input BigQuery tableLOCATION
: the execution region, for example,us-central1
Cloud Bigtable to Cloud Storage Avro
The Cloud Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Cloud Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
outputDirectory |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Avro file name. For example, output- . |
Running the Cloud Bigtable to Cloud Storage Avro file template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to Avro template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro \ --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,outputDirectory=OUTPUT_DIRECTORY,filenamePrefix=FILENAME_PREFIX
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Avro file name, for example,output-
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
To Run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "outputDirectory": "OUTPUT_DIRECTORY", "filenamePrefix": "FILENAME_PREFIX", }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Avro file name, for example,output-
Cloud Bigtable to Cloud Storage Parquet
The Cloud Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Cloud Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
outputDirectory |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Parquet file name. For example, output- . |
numShards |
The number of output file shards. For example 2 . |
Running the Cloud Bigtable to Cloud Storage Parquet file template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to Parquet template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet \ --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,outputDirectory=OUTPUT_DIRECTORY,filenamePrefix=FILENAME_PREFIX,numShards=NUM_SHARDS
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet file name, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "outputDirectory": "OUTPUT_DIRECTORY", "filenamePrefix": "FILENAME_PREFIX", "numShards": "NUM_SHARDS" }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportOUTPUT_DIRECTORY
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX
: the prefix of the Parquet file name, for example,output-
NUM_SHARDS
: the number of Parquet files to output, for example,1
Cloud Bigtable to Cloud Storage SequenceFile
The Cloud Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Cloud Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
bigtableAppProfileId |
The ID of the Cloud Bigtable application profile to be used for the export. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile. |
destinationPath |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the SequenceFile file name. For example, output- . |
Running the Cloud Bigtable to Cloud Storage SequenceFile template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to SequenceFile template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile \ --parameters bigtableProject=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,bigtableAppProfileId=APPLICATION_PROFILE_ID,destinationPath=DESTINATION_PATH,filenamePrefix=FILENAME_PREFIX
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID]
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID]
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID]
: the ID of the Cloud Bigtable table to exportAPPLICATION_PROFILE_ID]
: the ID of the Cloud Bigtable application profile to be used for the exportDESTINATION_PATH]
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX]
: the prefix of the SequenceFile file name, for example,output-
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile { "jobName": "JOB_NAME", "parameters": { "bigtableProject": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "bigtableAppProfileId": "APPLICATION_PROFILE_ID", "destinationPath": "DESTINATION_PATH", "filenamePrefix": "FILENAME_PREFIX", }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID]
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID]
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID]
: the ID of the Cloud Bigtable table to exportAPPLICATION_PROFILE_ID]
: the ID of the Cloud Bigtable application profile to be used for the exportDESTINATION_PATH]
: the Cloud Storage path where data is written, for example,gs://mybucket/somefolder
FILENAME_PREFIX]
: the prefix of the SequenceFile file name, for example,output-
Datastore to Cloud Storage Text
The Datastore to Cloud Storage Text template is a batch pipeline that reads Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.
Requirements for this pipeline:
Datastore must be set up in the project prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
datastoreReadGqlQuery |
A GQL query that specifies which
entities to grab. For example, SELECT * FROM MyKind . |
datastoreReadProjectId |
The Google Cloud project ID of the Datastore instance that you want to read data from. |
datastoreReadNamespace |
The namespace of the requested entities. To use the default namespace, leave this parameter blank. |
javascriptTextTransformGcsPath |
A Cloud Storage path that contains all your JavaScript code. For example,
gs://mybucket/mytransforms/*.js . If you don't want to provide a function, leave
this parameter blank. |
javascriptTextTransformFunctionName |
The name of the JavaScript function to be called. For example, if your JavaScript function is
function myTransform(inJson) { ...dostuff...} , the function name is
myTransform . If you don't want to provide a function, leave this parameter blank.
|
textWritePrefix |
The Cloud Storage path prefix to specify where the data is written. For example,
gs://mybucket/somefolder/ . |
Running the Datastore to Cloud Storage Text template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Datastore to Cloud Storage Text template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Datastore_to_GCS_Text \ --parameters \ datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\ datastoreReadProjectId=PROJECT_ID,\ datastoreReadNamespace=DATASTORE_NAMESPACE,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ textWritePrefix=gs://BUCKET_NAME/output/
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceBUCKET_NAME
: the name of your Cloud Storage bucketDATASTORE_KIND
: your the type of your Datastore entitiesDATASTORE_NAMESPACE
: the namespace of your Datastore entitiesJAVASCRIPT_FUNCTION
: your JavaScript function namePATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript code
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND" "datastoreReadProjectId": "PROJECT_ID", "datastoreReadNamespace": "DATASTORE_NAMESPACE", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "textWritePrefix": "gs://BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceBUCKET_NAME
: the name of your Cloud Storage bucketDATASTORE_KIND
: your the type of your Datastore entitiesDATASTORE_NAMESPACE
: the namespace of your Datastore entitiesJAVASCRIPT_FUNCTION
: your JavaScript function namePATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript code
Cloud Spanner to Cloud Storage Avro
The Cloud Spanner to Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:
- A
spanner-export.json
file. - A
TableName-manifest.json
file for each table in the database you exported. - One or more
TableName.avro-#####-of-#####
files.
For example, exporting a database with two tables, Singers
and Albums
,
creates the following file set:
Albums-manifest.json
Albums.avro-00000-of-00002
Albums.avro-00001-of-00002
Singers-manifest.json
Singers.avro-00000-of-00003
Singers.avro-00001-of-00003
Singers.avro-00002-of-00003
spanner-export.json
Requirements for this pipeline:
- The Cloud Spanner database must exist.
- The output Cloud Storage bucket must exist.
- In addition to the IAM roles necessary to run Dataflow jobs, you must also have the appropriate IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database that you want to export. |
databaseId |
The database ID of the Cloud Spanner database that you want to export. |
outputDir |
The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files. |
snapshotTime |
(Optional) The timestamp that corresponds to the version of the Cloud Spanner database
that you want to read. The timestamp must be specified as per
RFC 3339 UTC "Zulu" format.
For example, 1990-12-31T23:59:60Z . The timestamp must be in the past and
Maximum timestamp
staleness applies. |
spannerProjectId |
(Optional) The Google Cloud Project ID of the Cloud Spanner database that you want to read data from. |
Running the Cloud Spanner to Avro Files on Cloud Storage template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Spanner to Cloud Storage Avro template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- The job name must match the format
cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
gcloud dataflow jobs run JOB_NAME \ --gcs-location='gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro' \ --region=DATAFLOW_REGION \ --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,outputDir=GCS_DIRECTORY
Replace the following:
JOB_NAME
: a job name of your choiceDATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)GCS_STAGING_LOCATION
: the path for writing temporary files. For example,gs://mybucket/temp
.INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_DIRECTORY
: the Cloud Storage path that the Avro files are exported to
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "outputDir": "gs://GCS_DIRECTORY" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDDATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_DIRECTORY
: the Cloud Storage path that the Avro files are exported toJOB_NAME
: a job name of your choice- The job name must match the format
cloud-spanner-export-INSTANCE_ID-DATABASE_ID
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
Cloud Spanner to Cloud Storage Text
The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, optionally transforms the data via a JavaScript User Defined Function (UDF) that you provide, and writes it to Cloud Storage as CSV text files.
Requirements for this pipeline:
- The input Spanner table must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
spannerProjectId |
The Google Cloud Project ID of the Cloud Spanner database that you want to read data from. |
spannerDatabaseId |
The database ID of the requested table. |
spannerInstanceId |
The instance ID of the requested table. |
spannerTable |
The table to read the data from. |
textWritePrefix |
The directory where output text files are written. Add / at the end. For example, gs://mybucket/somefolder/ . |
javascriptTextTransformGcsPath |
(Optional) A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js .
If you don't want to provide a function, leave this parameter blank. |
javascriptTextTransformFunctionName |
(Optional) The name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} , the function name is myTransform .
If you don't want to provide a function, leave this parameter blank. |
Running the Cloud Spanner to Cloud Storage Text template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Spanner to Cloud Storage Text template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Spanner_to_GCS_Text \ --parameters \ spannerProjectId=PROJECT_ID,\ spannerDatabaseId=DATABASE_ID,\ spannerInstanceId=INSTANCE_ID,\ spannerTable=TABLE_ID,\ textWritePrefix=gs://BUCKET_NAME/output/,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceDATABASE_ID
: the Spanner database IDBUCKET_NAME
: the name of your Cloud Storage bucketINSTANCE_ID
: the Spanner instance IDTABLE_ID
: the Spanner table IDPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codeJAVASCRIPT_FUNCTION
: your JavaScript function name
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Spanner_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "spannerProjectId": "PROJECT_ID", "spannerDatabaseId": "DATABASE_ID", "spannerInstanceId": "INSTANCE_ID", "spannerTable": "TABLE_ID", "textWritePrefix": "gs://BUCKET_NAME/output/", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceDATABASE_ID
: the Spanner database IDBUCKET_NAME
: the name of your Cloud Storage bucketINSTANCE_ID
: the Spanner instance IDTABLE_ID
: the Spanner table IDPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codeJAVASCRIPT_FUNCTION
: your JavaScript function name
Cloud Storage Avro to Cloud Bigtable
The Cloud Storage Avro to Cloud Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist and have the same column families as exported in the Avro files.
- The input Avro files must exist in a Cloud Storage bucket prior to running the pipeline.
- Cloud Bigtable expects a specific schema from the input Avro files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
inputFilePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Avro file to Cloud Bigtable template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Avro to Cloud Bigtable from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable \ --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,inputFilePattern=INPUT_FILE_PATTERN
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "inputFilePattern": "INPUT_FILE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage Avro to Cloud Spanner
The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files exported from Cloud Spanner stored in Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database must exist and must be empty.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path must exist, and it must include a
spanner-export.json
file that contains a JSON description of files to import.
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database. |
databaseId |
The database ID of the Cloud Spanner database. |
inputDir |
The Cloud Storage path where the Avro files are imported from. |
Running the Cloud Storage Avro to Cloud Spanner template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Avro to Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- The job name must match the format
cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
gcloud dataflow jobs run JOB_NAME \ --gcs-location='gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner' \ --region=DATAFLOW_REGION \ --staging-location=GCS_STAGING_LOCATION \ --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,inputDir=GCS_DIRECTORY'
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
- (API only)
PROJECT_ID
: your project ID DATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)JOB_NAME
: a job name of your choiceINSTANCE_ID
: the ID of the Spanner instance that contains the databaseDATABASE_ID
: the ID of the Spanner database to import to- (gcloud only)
GCS_STAGING_LOCATION
: the path for writing temporary files, for example,gs://mybucket/temp
GCS_DIRECTORY
: the Cloud Storage path where the Avro files are imported from, for example,gs://mybucket/somefolder
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "inputDir": "gs://GCS_DIRECTORY" }, "environment": { "machineType": "n1-standard-2" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
- (API only)
PROJECT_ID
: your project ID DATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)JOB_NAME
: a job name of your choiceINSTANCE_ID
: the ID of the Spanner instance that contains the databaseDATABASE_ID
: the ID of the Spanner database to import to- (gcloud only)
GCS_STAGING_LOCATION
: the path for writing temporary files, for example,gs://mybucket/temp
GCS_DIRECTORY
: the Cloud Storage path where the Avro files are imported from, for example,gs://mybucket/somefolder
Cloud Storage Parquet to Cloud Bigtable
The Cloud Storage Parquet to Cloud Bigtable template is a pipeline that reads data from Parquet files in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist and have the same column families as exported in the Parquet files.
- The input Parquet files must exist in a Cloud Storage bucket prior to running the pipeline.
- Cloud Bigtable expects a specific schema from the input Parquet files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
inputFilePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Parquet file to Cloud Bigtable template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Avro to Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable \ --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,inputFilePattern=INPUT_FILE_PATTERN
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "inputFilePattern": "INPUT_FILE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportINPUT_FILE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage SequenceFile to Cloud Bigtable
The Cloud Storage SequenceFile to Cloud Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The input SequenceFiles must exist in a Cloud Storage bucket prior to running the pipeline.
- The input SequenceFiles must have been exported from Cloud Bigtable or HBase.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
bigtableAppProfileId |
The ID of the Cloud Bigtable application profile to be used for the import. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile. |
sourcePattern |
The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage SequenceFile to Cloud Bigtable template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the SequenceFile Files on Cloud Storage to Cloud Bigtable template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable \ --parameters bigtableProject=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,bigtableAppProfileId=APPLICATION_PROFILE_ID,sourcePattern=SOURCE_PATTERN
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Cloud Bigtable application profile to be used for the exportSOURCE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProject": "PROJECT_ID", "bigtableInstanceId": "INSTANCE_ID", "bigtableTableId": "TABLE_ID", "bigtableAppProfileId": "APPLICATION_PROFILE_ID", "sourcePattern": "SOURCE_PATTERN", }, "environment": { "zone": "us-central1-f" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePROJECT_ID
: the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data fromINSTANCE_ID
: the ID of the Cloud Bigtable instance that contains the tableTABLE_ID
: the ID of the Cloud Bigtable table to exportAPPLICATION_PROFILE_ID
: the ID of the Cloud Bigtable application profile to be used for the exportSOURCE_PATTERN
: the Cloud Storage path pattern where data is located, for example,gs://mybucket/somefolder/prefix*
Cloud Storage Text to BigQuery
The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and output the result to BigQuery.
IMPORTANT: If you reuse an existing BigQuery table, the data is appended to the destination table.
Requirements for this pipeline:
- Create a JSON file that describes your BigQuery schema.
Ensure that there is a top-level JSON array titled
BigQuery Schema
and that its contents follow the pattern{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
. For example:{ "BigQuery Schema": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
- Create a JavaScript (
.js
) file with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.
function transform(line) { var values = line.split(','); var obj = new Object(); obj.location = values[0]; obj.name = values[1]; obj.age = values[2]; obj.color = values[3]; obj.coffee = values[4]; var jsonString = JSON.stringify(obj); return jsonString; }
Template parameters
Parameter | Description |
---|---|
javascriptTextTransformFunctionName |
The name of the function you want to call from your .js file. |
JSONPath |
The gs:// path to the JSON file that defines your BigQuery schema, stored in
Cloud Storage. For example, gs://path/to/my/schema.json . |
javascriptTextTransformGcsPath |
The gs:// path to the JavaScript file that defines your UDF. For example,
gs://path/to/my/javascript_function.js . |
inputFilePattern |
The gs:// path to the text in Cloud Storage you'd like to process. For
example, gs://path/to/my/text/data.txt . |
outputTable |
The BigQuery table name you want to create to store your processed data in.
If you reuse an existing BigQuery table, the data is appended to the destination table.
For example, my-project-name:my-dataset.my-table . |
bigQueryLoadingTemporaryDirectory |
The temporary directory for the BigQuery loading process.
For example, gs://my-bucket/my-files/temp_dir . |
Running the Cloud Storage Text to BigQuery template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \ --parameters \ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ inputFilePattern=PATH_TO_TEXT_DATA,\ outputTable=BIGQUERY_TABLE,\ bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceJAVASCRIPT_FUNCTION
: the name of your UDFPATH_TO_BIGQUERY_SCHEMA_JSON
: the Cloud Storage path to the JSON file containing the schema definitionPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codePATH_TO_TEXT_DATA
: your Cloud Storage path to your text datasetBIGQUERY_TABLE
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directory
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery { "jobName": "JOB_NAME", "parameters": { "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "inputFilePattern":"PATH_TO_TEXT_DATA", "outputTable":"BIGQUERY_TABLE", "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceJAVASCRIPT_FUNCTION
: the name of your UDFPATH_TO_BIGQUERY_SCHEMA_JSON
: the Cloud Storage path to the JSON file containing the schema definitionPATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codePATH_TO_TEXT_DATA
: your Cloud Storage path to your text datasetBIGQUERY_TABLE
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directory
Cloud Storage Text to Datastore
The Cloud Storage Text to Datastore template is a batch pipeline that reads from text files stored in Cloud Storage and writes JSON encoded Entities to Datastore. Each line in the input text files must be in the specified JSON format.
Requirements for this pipeline:
- Datastore must be enabled in the destination project.
Template parameters
Parameter | Description |
---|---|
textReadPattern |
A Cloud Storage file path pattern that specifies the location of your text data files.
For example, gs://mybucket/somepath/*.json . |
javascriptTextTransformGcsPath |
A Cloud Storage path pattern that contains all your JavaScript code. For example,
gs://mybucket/mytransforms/*.js . If you don't want to provide a function, leave
this parameter blank. |
javascriptTextTransformFunctionName |
Name of the JavaScript function to be called. For example, if your JavaScript function is
function myTransform(inJson) { ...dostuff...} then the function name is
myTransform . If you don't want to provide a function, leave this parameter blank.
|
datastoreWriteProjectId |
The Google Cloud project id of where to write the Datastore entities |
errorWritePath |
The error log output file to use for write failures that occur during processing. For
example, gs://bucket-name/errors.txt . |
Running the Cloud Storage Text to Datastore template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Datastore template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Datastore \ --parameters \ textReadPattern=PATH_TO_INPUT_TEXT_FILES,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\ datastoreWriteProjectId=PROJECT_ID,\ errorWritePath=ERROR_FILE_WRITE_PATH
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: your JavaScript function namePATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codeERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore { "jobName": "JOB_NAME", "parameters": { "textReadPattern": "PATH_TO_INPUT_TEXT_FILES", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION", "datastoreWriteProjectId": "PROJECT_ID", "errorWritePath": "ERROR_FILE_WRITE_PATH" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choicePATH_TO_INPUT_TEXT_FILES
: the input files pattern on Cloud StorageJAVASCRIPT_FUNCTION
: your JavaScript function namePATH_TO_JAVASCRIPT_UDF_FILE
: the Cloud Storage path to the.js
file containing your JavaScript codeERROR_FILE_WRITE_PATH
: your desired path to error file on Cloud Storage
Cloud Storage Text to Pub/Sub (Batch)
This template creates a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.
Note that this template does not set any timestamp on the individual records. The event time is equal to the publishing time during execution. If your pipeline relies on an accurate event time for processing, you must not use this pipeline.
Requirements for this pipeline:
- The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files might cause issues downstream because each line within the files will be published as a message to Pub/Sub.
- The Pub/Sub topic must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFilePattern |
The input file pattern to read from. For example, gs://bucket-name/files/*.json . |
outputTopic |
The Pub/Sub input topic to write to. The name must be in the format of
projects/<project-id>/topics/<topic-name> . |
Running the Cloud Storage Text to Pub/Sub (Batch) template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Pub/Sub (Batch) template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub \ --parameters \ inputFilePattern=gs://BUCKET_NAME/files/*.json,\ outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceTOPIC_NAME
: your Pub/Sub topic nameBUCKET_NAME
: the name of your Cloud Storage bucket
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub { "jobName": "JOB_NAME", "parameters": { "inputFilePattern": "gs://BUCKET_NAME/files/*.json", "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceTOPIC_NAME
: your Pub/Sub topic nameBUCKET_NAME
: the name of your Cloud Storage bucket
Cloud Storage Text to Cloud Spanner
The Cloud Storage Text to Cloud Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database and table must exist.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path containing the CSV files must exist.
- You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
- If the target Cloud Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
-
The manifest file, encoded in ASCII or UTF-8, must match the following format:
- Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
- Data must match one of the following types:
INT64
FLOAT64
BOOL
STRING
DATE
TIMESTAMP
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database. |
databaseId |
The database ID of the Cloud Spanner database. |
importManifest |
The path in Cloud Storage to the import manifest file. |
columnDelimiter |
The column delimiter that the source file uses. The default value is , . |
fieldQualifier |
The character that must surround any value in the source file that contains the
columnDelimiter . The default value is " .
|
trailingDelimiter |
Specifies whether the lines in the source files have trailing delimiters (that is, if the
columnDelimiter character appears at the end of each line, after the last column
value). The default value is true . |
escape |
The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character. |
nullString |
The string that represents a NULL value. By default, this parameter is not set
and the template does not use the null string. |
dateFormat |
The format used to parse date columns. By default, the pipeline tries to parse the date
columns as yyyy-M-d[' 00:00:00'] , for example, as 2019-01-31 or 2019-1-1 00:00:00.
If your date format is different, specify the format using the
java.time.format.DateTimeFormatter
patterns. |
timestampFormat |
The format used to parse timestamp columns. If the timestamp is a long integer, then it is
parsed as UNIX epoch time. Otherwise, it is parsed as a string using the
java.time.format.DateTimeFormatter.ISO_INSTANT
format. For other cases, specify your own pattern string, for example,
using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of
"Jan 21 1998 01:02:03.456+08:00". |
If you need to use customized date or timestamp formats, make sure they're valid
java.time.format.DateTimeFormatter
patterns. The following table shows additional examples of customized formats for date and
timestamp columns:
Type | Input value | Format | Remark |
---|---|---|---|
DATE |
2011-3-31 | By default, the template can parse this format.
You don't need to specify the dateFormat parameter. |
|
DATE |
2011-3-31 00:00:00 | By default, the template can parse this format.
You don't need to specify the format. If you like, you can use
yyyy-M-d' 00:00:00' . |
|
DATE |
01 Apr, 18 | dd MMM, yy | |
DATE |
Wednesday, April 3, 2019 AD | EEEE, LLLL d, yyyy G | |
TIMESTAMP |
2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z |
The default format ISO_INSTANT can parse this type of timestamp.
You don't need to provide the timestampFormat parameter. |
|
TIMESTAMP |
1568402363 | By default, the template can parse this type of timestamp and treat it as UNIX epoch time. | |
TIMESTAMP |
Tue, 3 Jun 2008 11:05:30 GMT | EEE, d MMM yyyy HH:mm:ss VV | |
TIMESTAMP |
2018/12/31 110530.123PST | yyyy/MM/dd HHmmss.SSSz | |
TIMESTAMP |
2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z | yyyy-MM-dd'T'HH:mm:ss[.SSS]VV | If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the
default format can parse this type of timestamp. You don't need to provide your own format
parameter.
However, you can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both
cases. Note that you cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z' , because the postfix 'Z' must
be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is
converted to a
java.time.Instant .
Therefore, it must be specified in UTC or have time zone information associated with it.
Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant .
|
Running the Text Files on Cloud Storage to Cloud Spanner template
Console
Run in the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Cloud Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run with thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
gcloud dataflow jobs run JOB_NAME \ --gcs-location='gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner' \ --region=DATAFLOW_REGION \ --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,importManifest=GCS_PATH_TO_IMPORT_MANIFEST'
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
DATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest fileJOB_NAME
: a job name of your choice
API
Run with the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner { "jobName": "JOB_NAME", "parameters": { "instanceId": "INSTANCE_ID", "databaseId": "DATABASE_ID", "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST" }, "environment": { "machineType": "n1-standard-2" } }
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. Replace the following:
PROJECT_ID
: your project IDDATAFLOW_REGION
: the region where you want the Dataflow job to run (such asus-central1
)INSTANCE_ID
: your Cloud Spanner instance IDDATABASE_ID
: your Cloud Spanner database IDGCS_PATH_TO_IMPORT_MANIFEST
: the Cloud Storage path to your import manifest fileJOB_NAME
: a job name of your choice
Java Database Connectivity (JDBC) to BigQuery
The JDBC to BigQuery template is a batch pipeline that copies data from a relational database table into an existing BigQuery table. This pipeline uses JDBC to connect to the relational database. You can use this template to copy data from any relational database with available JDBC drivers into BigQuery. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint for additional details on encrypting your username, password, and connection string parameters.
Requirements for this pipeline:
- The JDBC drivers for the relational database must be available.
- The BigQuery table must exist prior to pipeline execution.
- The BigQuery table must have a compatible schema.
- The relational database must be accessible from the subnet where Dataflow runs.
Template parameters
Parameter | Description |
---|---|
driverJars |
The comma-separated list of driver JAR files. For example, gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar . |
driverClassName |
The JDBC driver class name. For example, com.mysql.jdbc.Driver . |
connectionURL |
The JDBC connection URL string. For example, jdbc:mysql://some-host:3306/sampledb . Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. |
query |
The query to be run on the source to extract the data. For example, select * from sampledb.sample_table . |
outputTable |
The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table> . |
bigQueryLoadingTemporaryDirectory |
The temporary directory for the BigQuery loading process.
For example, gs://<my-bucket>/my-files/temp_dir . |
connectionProperties |
(Optional) Properties string to use for the JDBC connection. For example, unicode=true&characterEncoding=UTF-8 . |
username |
(Optional) The username to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. |
password |
(Optional) The password to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. |
KMSEncryptionKey |
(Optional) Cloud KMS Encryption Key to decrypt the username, password, and connection string. If Cloud KMS key is passed in, the username, password and connection string must all be passed in encrypted. |
Running the JDBC to BigQuery template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the JDBC to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Jdbc_to_BigQuery \ --parameters \ driverJars=DRIVER_PATHS,\ driverClassName=DRIVER_CLASS_NAME,\ connectionURL=JDBC_CONNECTION_URL,\ query=SOURCE_SQL_QUERY,\ outputTable=PROJECT_ID:DATASET.TABLE_NAME, bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\ connectionProperties=CONNECTION_PROPERTIES,\ username=CONNECTION_USERNAME,\ password=CONNECTION_PASSWORD,\ KMSEncryptionKey=KMS_ENCRYPTION_KEY
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceDRIVER_PATHS
: the comma-separated Cloud Storage path(s) of the JDBC driver(s)DRIVER_CLASS_NAME
: the drive class nameJDBC_CONNECTION_URL
: the JDBC connection URLSOURCE_SQL_QUERY
: the SQL query to be run on the source databaseDATASET
: your BigQuery dataset, and replaceTABLE_NAME
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directoryCONNECTION_PROPERTIES
: the JDBC connection properties if requiredCONNECTION_USERNAME
: the JDBC connection usernameCONNECTION_PASSWORD
: the JDBC connection passwordKMS_ENCRYPTION_KEY
: the Cloud KMS Encryption Key
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Jdbc_to_BigQuery { "jobName": "JOB_NAME", "parameters": { "driverJars": "DRIVER_PATHS", "driverClassName": "DRIVER_CLASS_NAME", "connectionURL": "JDBC_CONNECTION_URL", "query": "SOURCE_SQL_QUERY", "outputTable": "PROJECT_ID:DATASET.TABLE_NAME", "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS", "connectionProperties": "CONNECTION_PROPERTIES", "username": "CONNECTION_USERNAME", "password": "CONNECTION_PASSWORD", "KMSEncryptionKey":"KMS_ENCRYPTION_KEY" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceDRIVER_PATHS
: the comma-separated Cloud Storage path(s) of the JDBC driver(s)DRIVER_CLASS_NAME
: the drive class nameJDBC_CONNECTION_URL
: the JDBC connection URLSOURCE_SQL_QUERY
: the SQL query to be run on the source databaseDATASET
: your BigQuery dataset, and replaceTABLE_NAME
: your BigQuery table namePATH_TO_TEMP_DIR_ON_GCS
: your Cloud Storage path to the temp directoryCONNECTION_PROPERTIES
: the JDBC connection properties if requiredCONNECTION_USERNAME
: the JDBC connection usernameCONNECTION_PASSWORD
: the JDBC connection passwordKMS_ENCRYPTION_KEY
: the Cloud KMS Encryption Key
Apache Cassandra to Cloud Bigtable
The Apache Cassandra to Cloud Bigtable template copies a table from Apache Cassandra to Cloud Bigtable. This template requires minimal configuration and replicates the table structure in Cassandra as closely as possible in Cloud Bigtable.
The Apache Cassandra to Cloud Bigtable template is useful for the following:
- Migrating Apache Cassandra database when short downtime is acceptable.
- Periodically replicating Cassandra tables to Cloud Bigtable for global serving.
Requirements for this pipeline:
- The target Cloud Bigtable table must exist prior to running the pipeline.
- Network connection between Dataflow workers and Apache Cassandra nodes.
Type conversion
The Apache Cassandra to Cloud Bigtable template automatically converts Apache Cassandra data types to Cloud Bigtable's data types.
Most primitivies are represented the same way in Cloud Bigtable and Apache Cassandra; however, the following primitives are represented differently:
Date
andTimestamp
are converted toDateTime
objectsUUID
is converted toString
Varint
is converted toBigDecimal
Apache Cassandra also natively supports more complex types such as Tuple
, List
, Set
and Map
.
Tuples are not supported by this pipeline as there is no corresponding type in the Apache Beam.
For example, in Apache Cassandra you can have a column of type List
called "mylist" and values like those in the following table:
row | mylist |
---|---|
1 | (a,b,c) |
The pipeline expands the list column into three different columns (known in Cloud Bigtable as column qualifiers). The name of the columns is "mylist" but the pipeline appends the index of the item in the list, such as "mylist[0]".
row | mylist[0] | mylist[1] | mylist[2] |
---|---|---|---|
1 | a | b | c |
The pipeline handles sets the same way as lists but adds an additional suffix to denote if the cell is a key or a value.
row | mymap |
---|---|
1 | {"first_key":"first_value","another_key":"different_value"} |
After transformation, table appears as follows:
row | mymap[0].key | mymap[0].value | mymap[1].key | mymap[1].value |
---|---|---|---|---|
1 | first_key | first_value | another_key | different_value |
Primary key conversion
In Apache Cassandra, a primary key is defined using data definition language. The primary key is either simple, composite, or compound with the clustering columns. Cloud Bigtable supports manual row-key construction, ordered lexicographically on a byte array. The pipeline automatically collects information about the type of key and constructs a key based on best practices for building row-keys based on multiple values.
Template parameters
Parameter | Description |
---|---|
cassandraHosts |
The hosts of the Apache Cassandra nodes in a comma-separated list. |
cassandraPort |
(Optional) The TCP port to reach Apache Cassandra on the nodes (defaults to 9042 ). |
cassandraKeyspace |
The Apache Cassandra keyspace where the table is located. |
cassandraTable |
The Apache Cassandra table to be copied. |
bigtableProjectId |
The Google Cloud Project ID of the Cloud Bigtable instance where the Apache Cassandra table is copied. |
bigtableInstanceId |
The Cloud Bigtable instance ID in which to copy the Apache Cassandra table. |
bigtableTableId |
The name of the Cloud Bigtable table in which to copy the Apache Cassandra table. |
defaultColumnFamily |
(Optional) The name of the Cloud Bigtable table's column family (defaults to default ). |
rowKeySeparator |
(Optional) The separator used to build row-key (defaults to # ). |
Running the Apache Cassandra to Cloud Bigtable template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Apache Cassandra to Cloud Bigtable template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable \ --parameters\ bigtableProjectId=PROJECT_ID,\ bigtableInstanceId=BIGTABLE_INSTANCE_ID,\ bigtableTableId=BIGTABLE_TABLE_ID,\ cassandraHosts=CASSANDRA_HOSTS,\ cassandraKeyspace=CASSANDRA_KEYSPACE,\ cassandraTable=CASSANDRA_TABLE
Replace the following:
PROJECT_ID
: your project ID where Cloud Bigtable is locatedJOB_NAME
: a job name of your choiceBIGTABLE_INSTANCE_ID
: the Cloud Bigtable instance idBIGTABLE_TABLE_ID
: the name of your Cloud Bigtable table nameCASSANDRA_HOSTS
: the Apache Cassandra host list; if multiple hosts are provided, follow the instruction on how to escape commasCASSANDRA_KEYSPACE
: the Apache Cassandra keyspace where table is locatedCASSANDRA_TABLE
: the Apache Cassandra table that needs to be migrated
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "PROJECT_ID", "bigtableInstanceId": "BIGTABLE_INSTANCE_ID", "bigtableTableId": "BIGTABLE_TABLE_ID", "cassandraHosts": "CASSANDRA_HOSTS", "cassandraKeyspace": "CASSANDRA_KEYSPACE", "cassandraTable": "CASSANDRA_TABLE" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project ID where Cloud Bigtable is locatedJOB_NAME
: a job name of your choiceBIGTABLE_INSTANCE_ID
: the Cloud Bigtable instance idBIGTABLE_TABLE_ID
: the name of your Cloud Bigtable table nameCASSANDRA_HOSTS
: the Apache Cassandra host list; if multiple hosts are provided, follow the instruction on how to escape commasCASSANDRA_KEYSPACE
: the Apache Cassandra keyspace where table is locatedCASSANDRA_TABLE
: the Apache Cassandra table that needs to be migrated
Apache Hive to BigQuery
The Apache Hive to BigQuery template is a batch pipeline that reads from an Apache Hive table and writes it to a BigQuery table.
Requirements for this pipeline:
- The target Cloud Bigtable table must exist prior to running the pipeline.
- Network connection must exist between Dataflow workers and Apache Hive nodes.
- Network connection must exist between Dataflow workers and the Apache Thrift server node.
- The BigQuery dataset must exist prior to pipeline execution.
Template parameters
Parameter | Description |
---|---|
metastoreUri |
The Apache Thrift server URI such as thrift://thrift-server-host:port . |
hiveDatabaseName |
The Apache Hive database name that contains the table you want to export. |
hiveTableName |
The Apache Hive table name that you want to export. |
outputTableSpec |
The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table> |
hivePartitionCols |
(Optional) The comma-separated list of the Apache Hive partition columns. |
filterString |
(Optional) The filter string for the input Apache Hive table. |
partitionType |
(Optional) The partition type in BigQuery. Currently, only Time is supported. |
partitionCol |
(Optional) The partition column name in the output BigQuery table. |
Running the Apache Hive to BigQuery template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Apache Hive to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Hive_To_BigQuery
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Hive_To_BigQuery \ --parameters\ metastoreUri=METASTORE_URI,\ hiveDatabaseName=HIVE_DATABASE_NAME,\ hiveTableName=HIVE_TABLE_NAME,\ outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\ hivePartitionCols=HIVE_PARTITION_COLS,\ filterString=FILTER_STRING,\ partitionType=PARTITION_TYPE,\ partitionCol=PARTITION_COL
Replace the following:
PROJECT_ID
: your project ID where BigQuery is locatedJOB_NAME
: the job name of your choiceDATASET
: your BigQuery datasetTABLE_NAME
: your BigQuery table nameMETASTORE_URI
: the Apache Thrift server URIHIVE_DATABASE_NAME
: the Apache Hive database name that contains the table you want to exportHIVE_TABLE_NAME
: the Apache Hive table name that you want to exportHIVE_PARTITION_COLS
: the comma-separated list of your Apache Hive partition columnsFILTER_STRING
: the filter string for the Apache Hive input tablePARTITION_TYPE
: the partition type in BigQueryPARTITION_COL
: the name of the BigQuery partition column
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Hive_To_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Hive_To_BigQuery { "jobName": "JOB_NAME", "parameters": { "metastoreUri": "METASTORE_URI", "hiveDatabaseName": "HIVE_DATABASE_NAME", "hiveTableName": "HIVE_TABLE_NAME", "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME", "hivePartitionCols": "HIVE_PARTITION_COLS", "filterString": "FILTER_STRING", "partitionType": "PARTITION_TYPE", "partitionCol": "PARTITION_COL" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID
: your project ID where BigQuery is locatedJOB_NAME
: the job name of your choiceDATASET
: your BigQuery datasetTABLE_NAME
: your BigQuery table nameMETASTORE_URI
: the Apache Thrift server URIHIVE_DATABASE_NAME
: the Apache Hive database name that contains the table you want to exportHIVE_TABLE_NAME
: the Apache Hive table name that you want to exportHIVE_PARTITION_COLS
: the comma-separated list of your Apache Hive partition columnsFILTER_STRING
: the filter string for the Apache Hive input tablePARTITION_TYPE
: the partition type in BigQueryPARTITION_COL
: the name of the BigQuery partition column
File Format Conversion (Avro, Parquet, CSV)
The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.
The following format conversions are supported:
- CSV to Avro.
- CSV to Parquet.
- Avro to Parquet.
- Parquet to Avro.
Requirements for this pipeline:
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFileFormat |
The input file format. Must be one of [csv, avro, parquet] . |
outputFileFormat |
The output file format. Must be one of [avro, parquet] . |
inputFileSpec |
The Cloud Storage path pattern for input files. For example, gs://bucket-name/path/*.csv |
outputBucket |
The Cloud Storage folder to write output files. This path must end with a slash.
For example, gs://bucket-name/output/ |
schema |
The Cloud Storage path to the Avro schema file. For example, gs://bucket-name/schema/my-schema.avsc |
containsHeaders |
(Optional) The input CSV files contain a header record (true/false). The default value is false . Only required when reading CSV files. |
csvFormat |
(Optional) The CSV format specification to use for parsing records. The default value is Default .
See Apache Commons CSV Format
for more details. |
delimiter |
(Optional) The field delimiter used by the input CSV files. |
outputFilePrefix |
(Optional) The output file prefix. The default value is output . |
numShards |
(Optional) The number of output file shards. |
Running the File Format Conversion template
Console
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the File Format Conversion template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run Flex templates, you must have
Cloud SDK version 284.0.0 or higher.
When running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/File_Format_Conversion
gcloud beta dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --template-file-gcs-location=gs://dataflow-templates/latest/flex/File_Format_Conversion \ --parameters \ inputFileFormat=INPUT_FORMAT,\ outputFileFormat=OUTPUT_FORMAT,\ inputFileSpec=INPUT_FILES,\ schema=SCHEMA,\ outputBucket=OUTPUT_FOLDER
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceINPUT_FORMAT
: the file format of the input file; must be one of[csv, avro, parquet]
OUTPUT_FORMAT
: the file format of the output files; must be one of[avro, parquet]
INPUT_FILES
: the path pattern for input filesOUTPUT_FOLDER
: your Cloud Storage folder for output filesSCHEMA
: the path to the Avro schema fileLOCATION
: the execution region, for example,us-central1
API
Run from the REST APIWhen running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/File_Format_Conversion
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "inputFileFormat": "INPUT_FORMAT", "outputFileFormat": "OUTPUT_FORMAT", "inputFileSpec": "INPUT_FILES", "schema": "SCHEMA", "outputBucket": "OUTPUT_FOLDER" }, "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/File_Format_Conversion", } }
Replace the following:
PROJECT_ID
: your project IDJOB_NAME
: a job name of your choiceINPUT_FORMAT
: the file format of the input file; must be one of[csv, avro, parquet]
OUTPUT_FORMAT
: the file format of the output files; must be one of[avro, parquet]
INPUT_FILES
: the path pattern for input filesOUTPUT_FOLDER
: your Cloud Storage folder for output filesSCHEMA
: the path to the Avro schema fileLOCATION
: the execution region, for example,us-central1