Google provides a set of open-source Dataflow templates. For general information about templates, see the Overview page. For a list of all Google-provided templates, see the Get started with Google-provided templates page.
This page documents batch templates:
- BigQuery to Cloud Storage TFRecords
- BigQuery to Cloud Storage Parquet (Storage API)
- Cloud Bigtable to Cloud Storage Avro
- Cloud Bigtable to Cloud Storage Parquet
- Cloud Bigtable to Cloud Storage SequenceFile
- Datastore to Cloud Storage Text
- Cloud Spanner to Cloud Storage Avro
- Cloud Spanner to Cloud Storage Text
- Cloud Storage Avro to Cloud Bigtable
- Cloud Storage Avro to Cloud Spanner
- Cloud Storage Parquet to Cloud Bigtable
- Cloud Storage SequenceFile to Cloud Bigtable
- Cloud Storage Text to BigQuery
- Cloud Storage Text to Datastore
- Cloud Storage Text to Pub/Sub (Batch)
- Cloud Storage Text to Cloud Spanner
- Java Database Connectivity (JDBC) to BigQuery
- Apache Cassandra to Cloud Bigtable
- Apache Hive to BigQuery
- File Format Conversion
BigQuery to Cloud Storage TFRecords
The BigQuery to Cloud Storage TFRecords template is a pipeline that reads data from a BigQuery query and writes it to a Cloud Storage bucket in TFRecord format. You can specify the training, testing, and validation percentage splits. By default, the split is 1 or 100% for the training set and 0 or 0% for testing and validation sets. Note that when setting the dataset split, the sum of training, testing, and validation needs to add up to 1 or 100% (for example, 0.6+0.2+0.2). Dataflow automatically determines the optimal number of shards for each output dataset.
Requirements for this pipeline:
- The BigQuery dataset and table must exist.
- The output Cloud Storage bucket must exist prior to pipeline execution. Note that training, testing, and validation subdirectories do not need to preexist and will be autogenerated.
Template parameters
Parameter | Description |
---|---|
readQuery |
A BigQuery SQL query that extracts data from the source. For example, select * from dataset1.sample_table . |
outputDirectory |
The top-level Cloud Storage path prefix at which to write the training, testing, and validation TFRecord files. For example, gs://mybucket/output . Subdirectories for resulting training, testing and validation TFRecord files will be automatically generated based off of outputDirectory. For example, gs://mybucket/output/train |
trainingPercentage |
[Optional] The percentage of query data allocated to training TFRecord files. The default value is 1, or 100%. |
testingPercentage |
[Optional] The percentage of query data allocated to testing TFRecord files. The default value is 0, or 0%. |
validationPercentage |
[Optional] The percentage of query data allocated to validation TFRecord files. The default value is 0, or 0%. |
outputSuffix |
[Optional] The file suffix for the training, testing, and validation TFRecord files that are written. The default value is .tfrecord . |
Executing the BigQuery to Cloud Storage TFRecord files template
CONSOLE
Execute from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Execute from thegcloud
command-line tool
Note: To run templates with the gcloud
command-line tool, you must have
Cloud SDK version 138.0.0 or later.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
To make the HTTP request, replace the following values:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[READ_QUERY]
with the BigQuery query to be executed. - Replace
[OUTPUT_DIRECTORY]
with the Cloud Storage path prefix for output datasets. - Replace
[TRAINING_PERCENTAGE]
with the decimal percentage split for the training dataset. - Replace
[TESTING_PERCENTAGE]
with the decimal percentage split for the testing dataset. - Replace
[VALIDATION_PERCENTAGE]
with the decimal percentage split for the validation dataset. - Replace
[OUTPUT_SUFFIX]
with the preferred output TensorFlow Record file suffix.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records \ --parameters readQuery=[READ_QUERY],outputDirectory=[OUTPUT_DIRECTORY],trainingPercentage=[TRAINING_PERCENTAGE],testingPercentage=[TESTING_PERCENTAGE],validationPercentage=[VALIDATION_PERCENTAGE],outputSuffix=[FILENAME_SUFFIX]
API
Execute from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
To execute the template with the REST API , send an HTTP POST request with your project ID. This request requires authorization.
To make the HTTP request, replace the following values:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[READ_QUERY]
with the BigQuery query to be executed. - Replace
[OUTPUT_DIRECTORY]
with the Cloud Storage path prefix for output datasets. - Replace
[TRAINING_PERCENTAGE]
with the decimal percentage split for the training dataset. - Replace
[TESTING_PERCENTAGE]
with the decimal percentage split for the testing dataset. - Replace
[VALIDATION_PERCENTAGE]
with the decimal percentage split for the validation dataset. - Replace
[OUTPUT_SUFFIX]
with the preferred output TensorFlow Record file suffix.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records { "jobName": "[JOB_NAME]", "parameters": { "readQuery":"[READ_QUERY]", "outputDirectory":"[OUTPUT_DIRECTORY]", "trainingPercentage":"[TRAINING_PERCENTAGE]", "testingPercentage":"[TESTING_PERCENTAGE]", "validationPercentage":"[VALIDATION_PERCENTAGE]", "outputSuffix":"[FILENAME_SUFFIX]" }, "environment": { "zone": "us-central1-f" } }
BigQuery to Cloud Storage Parquet (Storage API)
The BigQuery to Parquet template is a batch pipeline that reads data from a BigQuery table and writes it to a Cloud Storage bucket in Parquet format. This template utilizes the BigQuery Storage API to export the data.
Requirements for this pipeline:
- The input BigQuery table must exist prior to running the pipeline.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
tableRef |
The BigQuery input table location. For example, <my-project>:<my-dataset>.<my-table> . |
bucket |
The Cloud Storage folder in which to write the Parquet files. For example, gs://mybucket/exports . |
numShards |
[Optional] The number of output file shards. Default 1. |
fields |
[Optional] A comma separated list of fields to select from the input BigQuery table. |
Running the BigQuery to Cloud Storage Parquet template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the BigQuery to Parquet template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run Flex templates, you must have
Cloud SDK version 284.0.0 or higher.
When running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/BigQuery_To_Parquet
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace BIGQUERY_TABLE with your BigQuery table name.
- Replace OUTPUT_DIRECTORY with your Cloud Storage folder for output files.
- Replace NUM_SHARDS with the desired number of output file shards.
- Replace FIELDS with the comma separated list of fields to select from the input BigQuery table.
- Replace LOCATION with the execution region. For example,
us-central1
.
gcloud beta dataflow flex-template run JOB_NAME \ --project=YOUR_PROJECT_ID} \ --template-file-gcs-location=gs://dataflow-templates/latest/flex/BigQuery_To_Parquet \ --parameters \ tableRef=BIGQUERY_TABLE,\ bucket=OUTPUT_DIRECTORY,\ numShards=NUM_SHARDS,\ fields=FIELDS
API
Run from the REST APIWhen running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/BigQuery_To_Parquet
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace BIGQUERY_TABLE with your BigQuery table name.
- Replace OUTPUT_DIRECTORY with your Cloud Storage folder for output files.
- Replace NUM_SHARDS with the desired number of output file shards.
- Replace FIELDS with the comma separated list of fields to select from the input BigQuery table.
- Replace LOCATION with the execution region. For example,
us-central1
.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "tableRef": "BIGQUERY_TABLE", "bucket": "OUTPUT_DIRECTORY", "numShards": "NUM_SHARDS", "fields": "FIELDS" }, "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/BigQuery_To_Parquet", } }
Cloud Bigtable to Cloud Storage Avro
The Cloud Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Cloud Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
outputDirectory |
Cloud Storage path where data should be written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Avro file name. For example, output- . |
Running the Cloud Bigtable to Cloud Storage Avro file template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to Avro template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[OUTPUT_DIRECTORY]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the Avro file name. For example,output-
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro \ --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],outputDirectory=[OUTPUT_DIRECTORY],filenamePrefix=[FILENAME_PREFIX]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
To Run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[OUTPUT_DIRECTORY]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the Avro file name. For example,output-
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro { "jobName": "[JOB_NAME]", "parameters": { "bigtableProjectId": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "outputDirectory": "[OUTPUT_DIRECTORY]", "filenamePrefix": "[FILENAME_PREFIX]", }, "environment": { "zone": "us-central1-f" } }
Cloud Bigtable to Cloud Storage Parquet
The Cloud Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Cloud Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
outputDirectory |
Cloud Storage path where data should be written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Parquet file name. For example, output- . |
numShards |
The number of parquet files to output. For example 2 . |
Running the Cloud Bigtable to Cloud Storage Parquet file template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to Parquet template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[OUTPUT_DIRECTORY]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the Parquet file name. For example,output-
. - Replace
[NUM_SHARDS]
with the number of Parquet files to output. For example,1
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet \ --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],outputDirectory=[OUTPUT_DIRECTORY],filenamePrefix=[FILENAME_PREFIX],numShards=[NUM_SHARDS]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
To Run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[OUTPUT_DIRECTORY]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the Parquet file name. For example,output-
. - Replace
[NUM_SHARDS]
with the number of Parquet files to output. For example,1
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet { "jobName": "[JOB_NAME]", "parameters": { "bigtableProjectId": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "outputDirectory": "[OUTPUT_DIRECTORY]", "filenamePrefix": "[FILENAME_PREFIX]", "numShards": "[NUM_SHARDS]" }, "environment": { "zone": "us-central1-f" } }
Cloud Bigtable to Cloud Storage SequenceFile
The Cloud Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Cloud Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Cloud Bigtable to Cloud Storage.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to export. |
bigtableAppProfileId |
The ID of the Cloud Bigtable application profile to be used for the export. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile. |
destinationPath |
Cloud Storage path where data should be written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the SequenceFile file name. For example, output- . |
Running the Cloud Bigtable to Cloud Storage SequenceFile template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Bigtable to SequenceFile template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[APPLICATION_PROFILE_ID]
with the ID of the Cloud Bigtable application profile to be used for the export. - Replace
[DESTINATION_PATH]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the SequenceFile file name. For example,output-
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile \ --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],destinationPath=[DESTINATION_PATH],filenamePrefix=[FILENAME_PREFIX]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[APPLICATION_PROFILE_ID]
with the ID of the Cloud Bigtable application profile to be used for the export. - Replace
[DESTINATION_PATH]
with Cloud Storage path where data should be written. For example,gs://mybucket/somefolder
. - Replace
[FILENAME_PREFIX]
with the prefix of the SequenceFile file name. For example,output-
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile { "jobName": "[JOB_NAME]", "parameters": { "bigtableProject": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]", "destinationPath": "[DESTINATION_PATH]", "filenamePrefix": "[FILENAME_PREFIX]", }, "environment": { "zone": "us-central1-f" } }
Datastore to Cloud Storage Text
The Datastore to Cloud Storage Text template is a batch pipeline that reads Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.
Requirements for this pipeline:
Datastore must be set up in the project prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
datastoreReadGqlQuery |
A GQL query that specifies which
entities to grab. For example, SELECT * FROM MyKind . |
datastoreReadProjectId |
The Google Cloud project ID of the Datastore instance that you want to read data from. |
datastoreReadNamespace |
The namespace of the requested entities. To use the default namespace, leave this parameter blank. |
javascriptTextTransformGcsPath |
A Cloud Storage path that contains all your JavaScript code. For example,
gs://mybucket/mytransforms/*.js . If you don't want to provide a function, leave
this parameter blank. |
javascriptTextTransformFunctionName |
Name of the JavaScript function to be called. For example, if your JavaScript function is
function myTransform(inJson) { ...dostuff...} then the function name is
myTransform . If you don't want to provide a function, leave this parameter blank.
|
textWritePrefix |
The Cloud Storage path prefix to specify where the data should be written. For example,
gs://mybucket/somefolder/ . |
Running the Datastore to Cloud Storage Text template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Datastore to Cloud Storage Text template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
- Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
- Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
- Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Datastore_to_GCS_Text \ --parameters \ datastoreReadGqlQuery="SELECT * FROM YOUR_DATASTORE_KIND",\ datastoreReadProjectId=YOUR_PROJECT_ID,\ datastoreReadNamespace=YOUR_DATASTORE_NAMESPACE,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\ textWritePrefix=gs://YOUR_BUCKET_NAME/output/
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
- Replace YOUR_DATASTORE_KIND with your the type of your Datastore entities.
- Replace YOUR_DATASTORE_NAMESPACE with the namespace of your Datastore entities.
- Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "datastoreReadGqlQuery": "SELECT * FROM YOUR_DATASTORE_KIND" "datastoreReadProjectId": "YOUR_PROJECT_ID", "datastoreReadNamespace": "YOUR_DATASTORE_NAMESPACE", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION", "textWritePrefix": "gs://YOUR_BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Cloud Spanner to Cloud Storage Avro
The Cloud Spanner to Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:
- A
spanner-export.json
file. - A
TableName-manifest.json
file for each table in the database you exported. - One or more
TableName.avro-#####-of-#####
files.
For example, exporting a database with two tables, Singers
and Albums
,
creates the following file set:
Albums-manifest.json
Albums.avro-00000-of-00002
Albums.avro-00001-of-00002
Singers-manifest.json
Singers.avro-00000-of-00003
Singers.avro-00001-of-00003
Singers.avro-00002-of-00003
spanner-export.json
Requirements for this pipeline:
- The Cloud Spanner database must exist.
- The output Cloud Storage bucket must exist.
- In addition to the IAM roles necessary to run Dataflow jobs, you must also have the appropriate IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database that you want to export. |
databaseId |
The database ID of the Cloud Spanner database that you want to export. |
outputDir |
The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files. |
Running the template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Spanner to Cloud Storage Avro template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- The job name must match the format
cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[YOUR_INSTANCE_ID]
with your Cloud Spanner instance ID. - Replace
[YOUR_DATABASE_ID]
with your Cloud Spanner database ID. - Replace
[YOUR_GCS_DIRECTORY]
with the Cloud Storage path that the Avro files should be exported to. - Replace
[JOB_NAME]
with a job name of your choice.- The job name must match the format
cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location='gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro' \ --region=[DATAFLOW_REGION] \ --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],outputDir=[YOUR_GCS_DIRECTORY]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[YOUR_INSTANCE_ID]
with your Cloud Spanner instance ID. - Replace
[YOUR_DATABASE_ID]
with your Cloud Spanner database ID. - Replace
[YOUR_GCS_DIRECTORY]
with the Cloud Storage path that the Avro files should be exported to. - Replace
[JOB_NAME]
with a job name of your choice.- The job name must match the format
cloud-spanner-export-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/Cloud_Spanner_to_GCS_Avro { "jobName": "[JOB_NAME]", "parameters": { "instanceId": "[YOUR_INSTANCE_ID]", "databaseId": "[YOUR_DATABASE_ID]", "outputDir": "gs://[YOUR_GCS_DIRECTORY]" } }
Cloud Spanner to Cloud Storage Text
The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, optionally transforms the data via a JavaScript User Defined Function (UDF) that you provide and writes it to Cloud Storage as CSV text files.
Requirements for this pipeline:
- The input Spanner table must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
spannerProjectId |
The GCP Project Id of the Cloud Spanner database that you want to read data from. |
spannerDatabaseId |
Database of requested table. |
spannerInstanceId |
Instance of requested table. |
spannerTable |
Table to export. |
textWritePrefix |
Output Directory where output text files will be written. Please add / at the end. For eg: gs://mybucket/somefolder/ . |
javascriptTextTransformGcsPath |
[Optional] A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js .
If you don't want to provide a function, leave this parameter blank. |
javascriptTextTransformFunctionName |
[Optional] Name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform .
If you don't want to provide a function, leave this parameter blank. |
Running the Cloud Spanner to Cloud Storage Text template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Spanner to Cloud Storage Text template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_DATABASE_ID with the Spanner database id.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
- Replace YOUR_INSTANCE_ID with the Spanner instance id.
- Replace YOUR_TABLE_ID with the Spanner table id.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Spanner_to_GCS_Text \ --parameters \ spannerProjectId=YOUR_PROJECT_ID,\ spannerDatabaseId=YOUR_DATABASE_ID,\ spannerInstanceId=YOUR_INSTANCE_ID,\ spannerTable=YOUR_TABLE_ID,\ textWritePrefix=gs://YOUR_BUCKET_NAME/output/,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_DATABASE_ID with the Spanner database id.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
- Replace YOUR_INSTANCE_ID with the Spanner instance id.
- Replace YOUR_TABLE_ID with the Spanner table id.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Spanner_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "spannerProjectId": "YOUR_PROJECT_ID", "spannerDatabaseId": "YOUR_DATABASE_ID", "spannerInstanceId": "YOUR_INSTANCE_ID", "spannerTable": "YOUR_TABLE_ID", "textWritePrefix": "gs://YOUR_BUCKET_NAME/output/", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION" }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Avro to Cloud Bigtable
The Cloud Storage Avro to Cloud Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist and have the same column families as exported in the Avro files.
- The input Avro files must exist in a Cloud Storage bucket prior to running the pipeline.
- Cloud Bigtable expects a specific schema from the input Avro files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
inputFilePattern |
Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Avro file to Cloud Bigtable template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Spanner to Cloud Storage Text template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[INPUT_FILE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable \ --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],inputFilePattern=[INPUT_FILE_PATTERN]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[INPUT_FILE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable { "jobName": "[JOB_NAME]", "parameters": { "bigtableProjectId": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "inputFilePattern": "[INPUT_FILE_PATTERN]", }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Avro to Cloud Spanner
The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files exported from Cloud Spanner stored in Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database must exist and must be empty.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path must exist, and it must include a
spanner-export.json
file that contains a JSON description of files to import.
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database. |
databaseId |
The database ID of the Cloud Spanner database. |
inputDir |
The Cloud Storage path where the Avro files should be imported from. |
Running the Cloud Storage Avro to Cloud Spanner template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Avro to Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- The job name must match the format
cloud-spanner-import-[YOUR_INSTANCE_ID]-[YOUR_DATABASE_ID]
to show up in the Cloud Spanner portion of the Cloud Console.
- The job name must match the format
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- (API only) Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[YOUR_INSTANCE_ID]
with the ID of the Spanner instance that contains the database. - Replace
[YOUR_DATABASE_ID]
with the ID of the Spanner database to import to. - (gcloud only) Replace
[YOUR_GCS_STAGING_LOCATION]
with the path for writing temporary files. For example,gs://mybucket/temp
. - Replace
[YOUR_GCS_DIRECTORY]
with the Cloud Storage path where the Avro files should be imported from. For example,gs://mybucket/somefolder
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location='gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner' \ --region=[DATAFLOW_REGION] \ --staging-location=[YOUR_GCS_STAGING_LOCATION] \ --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],inputDir=[YOUR_GCS_DIRECTORY]'
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- (API only) Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[YOUR_INSTANCE_ID]
with the ID of the Spanner instance that contains the database. - Replace
[YOUR_DATABASE_ID]
with the ID of the Spanner database to import to. - (gcloud only) Replace
[YOUR_GCS_STAGING_LOCATION]
with the path for writing temporary files. For example,gs://mybucket/temp
. - Replace
[YOUR_GCS_DIRECTORY]
with the Cloud Storage path where the Avro files should be imported from. For example,gs://mybucket/somefolder
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/GCS_Avro_to_Cloud_Spanner { "jobName": "[JOB_NAME]", "parameters": { "instanceId": "[YOUR_INSTANCE_ID]", "databaseId": "[YOUR_DATABASE_ID]", "inputDir": "gs://[YOUR_GCS_DIRECTORY]" }, "environment": { "machineType": "n1-standard-2" } }
Cloud Storage Parquet to Cloud Bigtable
The Cloud Storage Parquet to Cloud Bigtable template is a pipeline that reads data from Parquet files in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist and have the same column families as exported in the Parquet files.
- The input Parquet files must exist in a Cloud Storage bucket prior to running the pipeline.
- Cloud Bigtable expects a specific schema from the input Parquet files.
Template parameters
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
inputFilePattern |
Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage Parquet file to Cloud Bigtable template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Avro to Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[INPUT_FILE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable \ --parameters bigtableProjectId=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],inputFilePattern=[INPUT_FILE_PATTERN]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[INPUT_FILE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable { "jobName": "[JOB_NAME]", "parameters": { "bigtableProjectId": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "inputFilePattern": "[INPUT_FILE_PATTERN]", }, "environment": { "zone": "us-central1-f" } }
Cloud Storage SequenceFile to Cloud Bigtable
The Cloud Storage SequenceFile to Cloud Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Cloud Bigtable table. You can use the template to copy data from Cloud Storage to Cloud Bigtable.
Requirements for this pipeline:
- The Cloud Bigtable table must exist.
- The input SequenceFiles must exist in a Cloud Storage bucket prior to running the pipeline.
- The input SequenceFiles must have been exported from Cloud Bigtable or HBase.
Template parameters
Parameter | Description |
---|---|
bigtableProject |
The ID of the Google Cloud project of the Cloud Bigtable instance that you want to write data to. |
bigtableInstanceId |
The ID of the Cloud Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Cloud Bigtable table to import. |
bigtableAppProfileId |
The ID of the Cloud Bigtable application profile to be used for the import. If you do not specify an app profile, Cloud Bigtable uses the instance's default app profile. |
sourcePattern |
Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix* . |
Running the Cloud Storage SequenceFile to Cloud Bigtable template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the SequenceFile Files on Cloud Storage to Cloud Bigtable template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[APPLICATION_PROFILE_ID]
with the ID of the Cloud Bigtable application profile to be used for the export. - Replace
[SOURCE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable \ --parameters bigtableProject=[PROJECT_ID],bigtableInstanceId=[INSTANCE_ID],bigtableTableId=[TABLE_ID],bigtableAppProfileId=[APPLICATION_PROFILE_ID],sourcePattern=[SOURCE_PATTERN]
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[JOB_NAME]
with a job name of your choice. - Replace
[PROJECT_ID]
with the ID of the Google Cloud project of the Cloud Bigtable instance that you want to read data from. - Replace
[INSTANCE_ID]
with the ID of the Cloud Bigtable instance that contains the table. - Replace
[TABLE_ID]
with the ID of the Cloud Bigtable table to export. - Replace
[APPLICATION_PROFILE_ID]
with the ID of the Cloud Bigtable application profile to be used for the export. - Replace
[SOURCE_PATTERN]
with Cloud Storage path pattern where data is located. For example,gs://mybucket/somefolder/prefix*
.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable { "jobName": "[JOB_NAME]", "parameters": { "bigtableProject": "[PROJECT_ID]", "bigtableInstanceId": "[INSTANCE_ID]", "bigtableTableId": "[TABLE_ID]", "bigtableAppProfileId": "[APPLICATION_PROFILE_ID]", "sourcePattern": "[SOURCE_PATTERN]", }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Text to BigQuery
The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and output the result to BigQuery.
IMPORTANT: If you reuse an existing BigQuery table, the table will be overwritten.
Requirements for this pipeline:
- Create a JSON file that describes your BigQuery schema.
Ensure that there is a top level JSON array titled
BigQuery Schema
and that its contents follow the pattern{"name": "COLUMN_NAME", "type": "DATA_TYPE"}
. For example:{ "BigQuery Schema": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
- Create a JavaScript (
.js
) file with your UDF function that supplies the logic to transform the lines of text. Note that your function must return a JSON string.For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.
function transform(line) { var values = line.split(','); var obj = new Object(); obj.location = values[0]; obj.name = values[1]; obj.age = values[2]; obj.color = values[3]; obj.coffee = values[4]; var jsonString = JSON.stringify(obj); return jsonString; }
Template parameters
Parameter | Description |
---|---|
javascriptTextTransformFunctionName |
The name of the function you want to call from your .js file. |
JSONPath |
The gs:// path to the JSON file that defines your BigQuery schema, stored in
Cloud Storage. For example, gs://path/to/my/schema.json . |
javascriptTextTransformGcsPath |
The gs:// path to the JavaScript file that defines your UDF. For example,
gs://path/to/my/javascript_function.js . |
inputFilePattern |
The gs:// path to the text in Cloud Storage you'd like to process. For
example, gs://path/to/my/text/data.txt . |
outputTable |
The BigQuery table name you want to create to store your processed data in.
If you reuse an existing BigQuery table, the table will be overwritten.
For example, my-project-name:my-dataset.my-table . |
bigQueryLoadingTemporaryDirectory |
Temporary directory for BigQuery loading process.
For example, gs://my-bucket/my-files/temp_dir . |
Running the Cloud Storage Text to BigQuery template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
- Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
- Replace BIGQUERY_TABLE with your BigQuery table name.
- Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \ --parameters \ javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\ JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\ outputTable=BIGQUERY_TABLE,\ bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_JAVASCRIPT_FUNCTION with the name of your UDF.
- Replace PATH_TO_BIGQUERY_SCHEMA_JSON with the Cloud Storage path to the JSON file containing the schema definition.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace PATH_TO_YOUR_TEXT_DATA with your Cloud Storage path to your text dataset.
- Replace BIGQUERY_TABLE with your BigQuery table name.
- Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery { "jobName": "JOB_NAME", "parameters": { "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION", "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "inputFilePattern":"PATH_TO_YOUR_TEXT_DATA", "outputTable":"BIGQUERY_TABLE", "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS" }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Text to Datastore
The Cloud Storage Text to Datastore template is a batch pipeline which reads from text files stored in Cloud Storage and writes JSON encoded Entities to Datastore. Each line in the input text files should be in JSON format specified in https://cloud.google.com/datastore/docs/reference/rest/v1/Entity .
Requirements for this pipeline:
- Datastore must be enabled in the destination project.
Template parameters
Parameter | Description |
---|---|
textReadPattern |
A Cloud Storage file path pattern that specifies the location of your text data files.
For example, gs://mybucket/somepath/*.json . |
javascriptTextTransformGcsPath |
A Cloud Storage path pattern that contains all your JavaScript code. For example,
gs://mybucket/mytransforms/*.js . If you don't want to provide a function, leave
this parameter blank. |
javascriptTextTransformFunctionName |
Name of the JavaScript function to be called. For example, if your JavaScript function is
function myTransform(inJson) { ...dostuff...} then the function name is
myTransform . If you don't want to provide a function, leave this parameter blank.
|
datastoreWriteProjectId |
The Google Cloud project id of where to write the Datastore entities |
errorWritePath |
The error log output file to use for write failures that occur during processing. For
example, gs://bucket-name/errors.txt . |
Running the Cloud Storage Text to Datastore template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Datastore template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
- Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Datastore \ --parameters \ textReadPattern=PATH_TO_INPUT_TEXT_FILES,\ javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\ javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\ datastoreWriteProjectId=YOUR_PROJECT_ID,\ errorWritePath=ERROR_FILE_WRITE_PATH
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace PATH_TO_INPUT_TEXT_FILES with the input files pattern on Cloud Storage.
- Replace YOUR_JAVASCRIPT_FUNCTION with your JavaScript function name.
- Replace PATH_TO_JAVASCRIPT_UDF_FILE with the Cloud Storage path to the
.js
file containing your JavaScript code. - Replace ERROR_FILE_WRITE_PATH with your desired path to error file on Cloud Storage.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore { "jobName": "JOB_NAME", "parameters": { "textReadPattern": "PATH_TO_INPUT_TEXT_FILES", "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE", "javascriptTextTransformFunctionName": "YOUR_JAVASCRIPT_FUNCTION", "datastoreWriteProjectId": "YOUR_PROJECT_ID", "errorWritePath": "ERROR_FILE_WRITE_PATH" }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Text to Pub/Sub (Batch)
This template creates a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.
Note that this template does not set any timestamp on the individual records, so the event time will be equal to the publishing time during execution. If your pipeline is reliant on an accurate event time for processing, you should not use this pipeline.
Requirements for this pipeline:
- The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files may cause issues downstream as each line within the files will be published as a message to Pub/Sub.
- The Pub/Sub topic must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFilePattern |
The input file pattern to read from. For example, gs://bucket-name/files/*.json . |
outputTopic |
The Pub/Sub input topic to write to. The name should be in the format of
projects/<project-id>/topics/<topic-name> . |
Running the Cloud Storage Text to Pub/Sub (Batch) template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Pub/Sub (Batch) template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_TOPIC_NAME with your Pub/Sub topic name.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub \ --parameters \ inputFilePattern=gs://YOUR_BUCKET_NAME/files/*.json,\ outputTopic=projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_TOPIC_NAME with your Pub/Sub topic name.
- Replace YOUR_BUCKET_NAME with the name of your Cloud Storage bucket.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub { "jobName": "JOB_NAME", "parameters": { "inputFilePattern": "gs://YOUR_BUCKET_NAME/files/*.json", "outputTopic": "projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME" }, "environment": { "zone": "us-central1-f" } }
Cloud Storage Text to Cloud Spanner
The Cloud Storage Text to Cloud Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Cloud Spanner database.
Requirements for this pipeline:
- The target Cloud Spanner database and table must exist.
- You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
- The input Cloud Storage path containing the CSV files must exist.
- You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest in Cloud Storage.
- If the target Cloud Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
-
The manifest file, encoded in ASCII or UTF-8, must match the following format:
- Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
- Data must match one of the following types:
INT64
FLOAT64
BOOL
STRING
DATE
TIMESTAMP
Template parameters
Parameter | Description |
---|---|
instanceId |
The instance ID of the Cloud Spanner database. |
databaseId |
The database ID of the Cloud Spanner database. |
importManifest |
The path in Cloud Storage to the import manifest file. |
columnDelimiter |
The column delimiter that the source file uses. The default value is , . |
fieldQualifier |
The character that should surround any value in the source file that contains the
columnDelimiter . The default value is " .
|
trailingDelimiter |
Specifies whether the lines in the source files have trailing delimiters (that is, if the
columnDelimiter character appears at the end of each line, after the last column
value). The default value is true . |
escape |
The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character. |
nullString |
The string that represents a NULL value. By default, this parameter is not set
and the template does not use the null string. |
dateFormat |
The format used to parse date columns. By default, the pipeline tries to parse the date
columns as yyyy-M-d[' 00:00:00'] , for example, as 2019-01-31 or 2019-1-1 00:00:00.
If your date format is different, specify the format using the
java.time.format.DateTimeFormatter
patterns. |
timestampFormat |
The format used to parse timestamp columns. If the timestamp is a long integer, then it is
parsed as Unix epoch. Otherwise, it is parsed as a string using the
java.time.format.DateTimeFormatter.ISO_INSTANT
format. For other cases, specify your own pattern string, for example,
using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of
"Jan 21 1998 01:02:03.456+08:00". |
If you need to use customized date or timestamp formats, make sure they're valid
java.time.format.DateTimeFormatter
patterns. The following table shows additional examples of customized formats for date and
timestamp columns:
Type | Input value | Format | Remark |
---|---|---|---|
DATE |
2011-3-31 | By default, the template can parse this format.
You don't need to specify the dateFormat parameter. |
|
DATE |
2011-3-31 00:00:00 | By default, the template can parse this format.
You don't need to specify the format. If you like, you can use
yyyy-M-d' 00:00:00' . |
|
DATE |
01 Apr, 18 | dd MMM, yy | |
DATE |
Wednesday, April 3, 2019 AD | EEEE, LLLL d, yyyy G | |
TIMESTAMP |
2019-01-02T11:22:33Z 2019-01-02T11:22:33.123Z 2019-01-02T11:22:33.12356789Z |
The default format ISO_INSTANT can parse this type of timestamp.
You don't need to provide the timestampFormat parameter. |
|
TIMESTAMP |
1568402363 | By default, the template can parse this type of timestamp and treat it as the Unix epoch time. | |
TIMESTAMP |
Tue, 3 Jun 2008 11:05:30 GMT | EEE, d MMM yyyy HH:mm:ss VV | |
TIMESTAMP |
2018/12/31 110530.123PST | yyyy/MM/dd HHmmss.SSSz | |
TIMESTAMP |
2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z | yyyy-MM-dd'T'HH:mm:ss[.SSS]VV | If the input column is a mix 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the
default format can parse this type of timestamp. You don't need to provide your own format
parameter.
However, you can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both
cases. Note that you cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z' , because the postfix 'Z' must
be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is
converted to a
java.time.Instant .
Therefore, it must be specified in UTC or have time zone information associated with it.
Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant .
|
Running the template
Console
Run in the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Cloud Storage Text to Cloud Spanner template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

gcloud
Run with thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[YOUR_INSTANCE_ID]
with your Cloud Spanner instance ID. - Replace
[YOUR_DATABASE_ID]
with your Cloud Spanner database ID. - Replace
[GCS_PATH_TO_IMPORT_MANIFEST]
with the Cloud Storage path to your import manifest file. - Replace
[JOB_NAME]
with a job name of your choice.
gcloud dataflow jobs run [JOB_NAME] \ --gcs-location='gs://dataflow-templates/[VERSION]/GCS_Text_to_Cloud_Spanner' \ --region=[DATAFLOW_REGION] \ --parameters='instanceId=[YOUR_INSTANCE_ID],databaseId=[YOUR_DATABASE_ID],importManifest=[GCS_PATH_TO_IMPORT_MANIFEST]'
API
Run with the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
Use this example request as documented in
Using the REST API.
This request requires
authorization, and you must specify a tempLocation
where you
have write permissions. You must replace the following values in this example:
- Replace
[YOUR_PROJECT_ID]
with your project ID. - Replace
[DATAFLOW_REGION]
with the region where you want the Dataflow job to run (such asus-central1
). - Replace
[YOUR_INSTANCE_ID]
with your Cloud Spanner instance ID. - Replace
[YOUR_DATABASE_ID]
with your Cloud Spanner database ID. - Replace
[GCS_PATH_TO_IMPORT_MANIFEST]
with the Cloud Storage path to your import manifest file. - Replace
[JOB_NAME]
with a job name of your choice.
POST https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/locations/[DATAFLOW_REGION]/templates:launch?gcsPath=gs://dataflow-templates/[VERSION]/GCS_Text_to_Cloud_Spanner { "jobName": "[JOB_NAME]", "parameters": { "instanceId": "[YOUR_INSTANCE_ID]", "databaseId": "[YOUR_DATABASE_ID]", "importManifest": "[GCS_PATH_TO_IMPORT_MANIFEST]" }, "environment": { "machineType": "n1-standard-2" } }
Java Database Connectivity (JDBC) to BigQuery
The JDBC to BigQuery template is a batch pipeline that copies data from a relational database table into an existing BigQuery table. This pipeline uses JDBC to connect to the relational database. You can use this template to copy data from any relational database with available JDBC drivers into BigQuery. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint for additional details on encrypting your username, password, and connection string parameters.
Requirements for this pipeline:
- The JDBC drivers for the relational database must be available.
- The BigQuery table must exist prior to pipeline execution.
- The BigQuery table must have a compatible schema.
- The relational database must be accessible from the subnet where Dataflow runs.
Template parameters
Parameter | Description |
---|---|
driverJars |
Comma separated list of driver jars. For example, gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar . |
driverClassName |
The JDBC driver class name. For example, com.mysql.jdbc.Driver . |
connectionURL |
The JDBC connection URL string. For example, jdbc:mysql://some-host:3306/sampledb . Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. |
query |
Query to be run on the source to extract the data. For example, select * from sampledb.sample_table . |
outputTable |
The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table> . |
bigQueryLoadingTemporaryDirectory |
Temporary directory for BigQuery loading process.
For example, gs://<my-bucket>/my-files/temp_dir . |
connectionProperties |
[Optional] Properties string to use for the JDBC connection. For example, unicode=true&characterEncoding=UTF-8 . |
username |
[Optional] Username to be used for the JDBC connection. Can be passed in as a Base64encoded string encrypted with a Cloud KMS key. |
password |
[Optional] Password to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key. |
KMSEncryptionKey |
[Optional] Cloud KMS Encryption Key to decrypt the username, password, and connection string. If Cloud KMS key is passed in, the username, password and connection string must all be passed in encrypted. |
Running the JDBC to BigQuery template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the JDBC to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace DRIVER_PATHS with the comma separated Cloud Storage path(s) of the JDBC driver(s).
- Replace DRIVER_CLASS_NAME with the drive class name.
- Replace JDBC_CONNECTION_URL with the JDBC connection URL.
- Replace SOURCE_SQL_QUERY with the SQL query to be run on the source database.
- Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
- Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
- Replace CONNECTION_PROPERTIES with the JDBC connection properties if required.
- Replace CONNECTION_USERNAME with the JDBC connection username.
- Replace CONNECTION_PASSWORD with the JDBC connection password.
- Replace KMS_ENCRYPTION_KEY with the Cloud KMS Encryption Key.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Jdbc_to_BigQuery \ --parameters \ driverJars=DRIVER_PATHS,\ driverClassName=DRIVER_CLASS_NAME,\ connectionURL=JDBC_CONNECTION_URL,\ query=SOURCE_SQL_QUERY,\ outputTable=YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME, bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\ connectionProperties=CONNECTION_PROPERTIES,\ username=CONNECTION_USERNAME,\ password=CONNECTION_PASSWORD,\ KMSEncryptionKey=KMS_ENCRYPTION_KEY
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace DRIVER_PATHS with the comma separated Cloud Storage path(s) of the JDBC driver(s).
- Replace DRIVER_CLASS_NAME with the drive class name.
- Replace JDBC_CONNECTION_URL with the JDBC connection URL.
- Replace SOURCE_SQL_QUERY with the SQL query to be run on the source database.
- Replace YOUR_DATASET with your BigQuery dataset, and replace YOUR_TABLE_NAME with your BigQuery table name.
- Replace PATH_TO_TEMP_DIR_ON_GCS with your Cloud Storage path to the temp directory.
- Replace CONNECTION_PROPERTIES with the JDBC connection properties if required.
- Replace CONNECTION_USERNAME with the JDBC connection username.
- Replace CONNECTION_PASSWORD with the JDBC connection password.
- Replace KMS_ENCRYPTION_KEY with the Cloud KMS Encryption Key.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Jdbc_to_BigQuery { "jobName": "JOB_NAME", "parameters": { "driverJars": "DRIVER_PATHS", "driverClassName": "DRIVER_CLASS_NAME", "connectionURL": "JDBC_CONNECTION_URL", "query": "SOURCE_SQL_QUERY", "outputTable": "YOUR_PROJECT_ID:YOUR_DATASET.YOUR_TABLE_NAME", "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS", "connectionProperties": "CONNECTION_PROPERTIES", "username": "CONNECTION_USERNAME", "password": "CONNECTION_PASSWORD", "KMSEncryptionKey":"KMS_ENCRYPTION_KEY" }, "environment": { "zone": "us-central1-f" } }
Apache Cassandra to Cloud Bigtable
The Apache Cassandra to Cloud Bigtable template copies a table from Apache Cassandra to Cloud Bigtable. This template requires minimal configuration and replicates the table structure in Cassandra as closely as possible in Cloud Bigtable.
The Apache Cassandra to Cloud Bigtable template is useful for the following:
- Migrating Apache Cassandra database when short downtime is acceptable.
- Periodically replicating Cassandra tables to Cloud Bigtable for global serving.
Requirements for this pipeline:
- The target Cloud Bigtable table must exist prior to running the pipeline.
- Network connection between Dataflow workers and Apache Cassandra nodes.
Type Conversion
The Apache Cassandra to Cloud Bigtable template automatically converts Apache Cassandra data types to Cloud Bigtable's data types.
Most primitivies are represented the same way in Cloud Bigtable and Apache Cassandra; however, the following primitives are represented differently:
Date
andTimestamp
are converted toDateTime
objectsUUID
is converted toString
Varint
is converted toBigDecimal
Apache Cassandra also natively supports more complex types such as Tuple
, List
, Set
and Map
.
Tuples are not supported by this pipeline as there is no corresponding type in the Apache Beam.
For example, in Apache Cassandra you can have a column of type List
called "mylist" and values like those in the following table:
row | mylist |
---|---|
1 | (a,b,c) |
The pipeline expands the list column into three different columns (known in Cloud Bigtable as column qualifiers). The name of the columns is "mylist" but the pipeline appends the index of the item in the list, such as "mylist[0]".
row | mylist[0] | mylist[1] | mylist[2] |
---|---|---|---|
1 | a | b | c |
The pipeline handles sets the same way as lists but adds an additional suffix to denote if the cell is a key or a value.
row | mymap |
---|---|
1 | {"first_key":"first_value","another_key":"different_value"} |
After transformation, table will look as follows:
row | mymap[0].key | mymap[0].value | mymap[1].key | mymap[1].value |
---|---|---|---|---|
1 | first_key | first_value | another_key | different_value |
Primary key conversion
In Apache Cassandra, a primary key is defined using data definition language. The primary key is either simple, composite or compound with the clustering columns. Cloud Bigtable supports manual row-key construction, ordered lexicographically on a byte array. The pipeline automatically collects information about type of key and constructs key based on best practices for building row-keys based on multiple values.
Template parameters
Parameter | Description |
---|---|
cassandraHosts |
The hosts of the Apache Cassandra nodes in a comma-separated list. |
cassandraPort |
(Optional) The TCP port to reach Apache Cassandra on the nodes (defaults to 9042 ). |
cassandraKeyspace |
The Apache Cassandra keyspace where the table is located. |
cassandraTable |
The Apache Cassandra table to be copied. |
bigtableProjectId |
The Google Cloud Project ID of the Cloud Bigtable instance where the Apache Cassandra table should be copied. |
bigtableInstanceId |
The Cloud Bigtable instance ID in which to copy the Apache Cassandra table. |
bigtableTableId |
The name of the Cloud Bigtable table in which to copy the Apache Cassandra table. |
defaultColumnFamily |
(Optional) The name of the Cloud Bigtable table's column family (defaults to default ). |
rowKeySeparator |
(Optional) The separator used to build row-key (defaults to # ). |
Running the Apache Cassandra to Cloud Bigtable template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Apache Cassandra to Cloud Bigtable template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID where Cloud Bigtable is located.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_BIGTABLE_INSTANCE_ID with the Cloud Bigtable instance id.
- Replace YOUR_BIGTABLE_TABLE_ID with the name of your Cloud Bigtable table name.
- Replace YOUR_CASSANDRA_HOSTS with the Apache Cassandra host list. If multiple hosts are provided, please follow instruction on how to escape commas.
- Replace YOUR_CASSANDRA_KEYSPACE with the Apache Cassandra keyspace where table is located.
- Replace YOUR_CASSANDRA_TABLE with the Apache Cassandra table that needs to be migrated.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable \ --parameters\ bigtableProjectId=YOUR_PROJECT_ID,\ bigtableInstanceId=YOUR_BIGTABLE_INSTANCE_ID,\ bigtableTableId=YOUR_BIGTABLE_TABLE_ID,\ cassandraHosts=YOUR_CASSANDRA_HOSTS,\ cassandraKeyspace=YOUR_CASSANDRA_KEYSPACE,\ cassandraTable=YOUR_CASSANDRA_TABLE
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID where Cloud Bigtable is located.
- Replace JOB_NAME with a job name of your choice.
- Replace YOUR_BIGTABLE_INSTANCE_ID with the Cloud Bigtable instance id.
- Replace YOUR_BIGTABLE_TABLE_ID with the name of your Cloud Bigtable table name.
- Replace YOUR_CASSANDRA_HOSTS with the Apache Cassandra host list. If multiple hosts are provided, please follow instruction on how to escape commas.
- Replace YOUR_CASSANDRA_KEYSPACE with the Apache Cassandra keyspace where table is located.
- Replace YOUR_CASSANDRA_TABLE with the Apache Cassandra table that needs to be migrated.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable { "jobName": "JOB_NAME", "parameters": { "bigtableProjectId": "YOUR_PROJECT_ID", "bigtableInstanceId": "YOUR_BIGTABLE_INSTANCE_ID", "bigtableTableId": "YOUR_BIGTABLE_TABLE_ID", "cassandraHosts": "YOUR_CASSANDRA_HOSTS", "cassandraKeyspace": "YOUR_CASSANDRA_KEYSPACE", "cassandraTable": "YOUR_CASSANDRA_TABLE" }, "environment": { "zone": "us-central1-f" } }
Apache Hive to BigQuery
The Apache Hive to BigQuery template is a batch pipeline which reads from an Apache Hive table and writes it to a BigQuery table.
Requirements for this pipeline:
- The target Cloud Bigtable table must exist prior to running the pipeline.
- Network connection must exist between Dataflow workers and Apache Hive nodes.
- Network connection must exist between Dataflow workers and the Apache Thrift server node.
- The BigQuery dataset must exist prior to pipeline execution.
Template parameters
Parameter | Description |
---|---|
metastoreUri |
The Apache Thrift server URI such as thrift://thrift-server-host:port . |
hiveDatabaseName |
The Apache Hive database name that contains the table you want to export. |
hiveTableName |
The Apache Hive table name that you want to export. |
outputTableSpec |
The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table> |
hivePartitionCols |
(Optional) The comma separated list of the Apache Hive partition columns. |
filterString |
(Optional) The filter string for the input Apache Hive table. |
partitionType |
(Optional) The partition type in BigQuery. Currently, only Time is supported. |
partitionCol |
(Optional) The partition column name in the output BigQuery table. |
Running the Apache Hive to BigQuery template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the Apache Hive to BigQuery template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run templates, you must have
Cloud SDK version 138.0.0 or higher.
When running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Hive_To_BigQuery
Replace the following:
PROJECT_ID
: your project ID where BigQuery is located.JOB_NAME
: job name of your choice.DATASET
: your BigQuery datasetTABLE_NAME
: your BigQuery table nameMETASTORE_URI
: the Apache Thrift server URIHIVE_DATABASE_NAME
: the Apache Hive database name that contains the table you want to export.HIVE_TABLE_NAME
: the Apache Hive table name that you want to export.HIVE_PARTITION_COLS
: the comma separated list of your Apache Hive partition columns.FILTER_STRING
: the filter string for the Apache Hive input table.PARTITION_TYPE
: the partition type in BigQuery.PARTITION_COL
: the name of the BigQuery partition column.
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates/latest/Hive_To_BigQuery \ --parameters\ metastoreUri=METASTORE_URI,\ hiveDatabaseName=HIVE_DATABASE_NAME,\ hiveTableName=HIVE_TABLE_NAME,\ outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\ hivePartitionCols=HIVE_PARTITION_COLS,\ filterString=FILTER_STRING,\ partitionType=PARTITION_TYPE,\ partitionCol=PARTITION_COL
API
Run from the REST APIWhen running this template, you'll need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/Hive_To_BigQuery
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
Replace the following:
PROJECT_ID
: your project ID where BigQuery is located.JOB_NAME
: job name of your choice.DATASET
: your BigQuery datasetTABLE_NAME
: your BigQuery table nameMETASTORE_URI
: the Apache Thrift server URIHIVE_DATABASE_NAME
: the Apache Hive database name that contains the table you want to export.HIVE_TABLE_NAME
: the Apache Hive table name that you want to export.HIVE_PARTITION_COLS
: the comma separated list of your Apache Hive partition columns.FILTER_STRING
: the filter string for the Apache Hive input table.PARTITION_TYPE
: the partition type in BigQuery.PARTITION_COL
: the name of the BigQuery partition column.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Hive_To_BigQuery { "jobName": "JOB_NAME", "parameters": { "metastoreUri": "METASTORE_URI", "hiveDatabaseName": "HIVE_DATABASE_NAME", "hiveTableName": "HIVE_TABLE_NAME", "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME", "hivePartitionCols": "HIVE_PARTITION_COLS", "filterString": "FILTER_STRING", "partitionType": "PARTITION_TYPE", "partitionCol": "PARTITION_COL" }, "environment": { "zone": "us-central1-f" } }
File Format Conversion (Avro, Parquet, CSV)
The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.
The following format conversions are supported:
- CSV to Avro.
- CSV to Parquet.
- Avro to Parquet.
- Parquet to Avro.
Requirements for this pipeline:
- The output Cloud Storage bucket must exist prior to running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFileFormat |
Input file format. Must be one of [csv, avro, parquet] . |
outputFileFormat |
Output file format. Must be one of [avro, parquet] . |
inputFileSpec |
Cloud Storage path pattern for input files. For example, gs://bucket-name/path/*.csv |
outputBucket |
Cloud Storage folder to write output files. This path should end with a slash.
For example, gs://bucket-name/output/ |
schema |
Cloud Storage path to the Avro schema file. For example, gs://bucket-name/schema/my-schema.avsc |
containsHeaders |
[Optional] Input CSV files contain a header record (true/false). Default: false . Only required when reading CSV files. |
csvFormat |
[Optional] CSV format specification to use for parsing records. Default: Default .
See Apache Commons CSV Format
for more details. |
delimiter |
[Optional] Field delimiter used by the input CSV files. |
outputFilePrefix |
[Optional] Output file prefix. Default: output . |
numShards |
[Optional] The number of output file shards. |
Running the File Format Conversion template
CONSOLE
Run from the Google Cloud Console- Go to the Dataflow page in the Cloud Console. Go to the Dataflow page
- Click Create job from template.
- Select the File Format Conversion template from the Dataflow template drop-down menu.
- Enter a job name in the Job Name field.
- Enter your parameter values in the provided parameter fields.
- Click Run Job.

GCLOUD
Run from thegcloud
command-line tool
Note: To use the gcloud
command-line tool to run Flex templates, you must have
Cloud SDK version 284.0.0 or higher.
When running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/File_Format_Conversion
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace INPUT_FORMAT with the file format of the input files. Must be one of
[csv, avro, parquet]
. - Replace OUTPUT_FORMAT with the file format of the output files. Must be one of
[avro, parquet]
. - Replace INPUT_FILES with the path pattern for input files.
- Replace OUTPUT_FOLDER with your Cloud Storage folder for output files.
- Replace SCHEMA with the path to the Avro schema file.
- Replace LOCATION with the execution region. For example,
us-central1
.
gcloud beta dataflow flex-template run JOB_NAME \ --project=YOUR_PROJECT_ID} \ --template-file-gcs-location=gs://dataflow-templates/latest/flex/File_Format_Conversion \ --parameters \ inputFileFormat=INPUT_FORMAT,\ outputFileFormat=OUTPUT_FORMAT,\ inputFileSpec=INPUT_FILES,\ schema=SCHEMA,\ outputBucket=OUTPUT_FOLDER
API
Run from the REST APIWhen running this template, you need the Cloud Storage path to the template:
gs://dataflow-templates/VERSION/flex/File_Format_Conversion
To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.
You must replace the following values in this example:
- Replace YOUR_PROJECT_ID with your project ID.
- Replace JOB_NAME with a job name of your choice.
- Replace INPUT_FORMAT with the file format of the input files. Must be one of
[csv, avro, parquet]
. - Replace OUTPUT_FORMAT with the file format of the output files. Must be one of
[avro, parquet]
. - Replace INPUT_FILES with the path pattern for input files.
- Replace OUTPUT_FOLDER with your Cloud Storage folder for output files.
- Replace SCHEMA with the path to the Avro schema file.
- Replace LOCATION with the execution region. For example,
us-central1
.
POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "inputFileFormat": "INPUT_FORMAT", "outputFileFormat": "OUTPUT_FORMAT", "inputFileSpec": "INPUT_FILES", "schema": "SCHEMA", "outputBucket": "OUTPUT_FOLDER" }, "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/File_Format_Conversion", } }