Google-provided batch templates

Google provides a set of open source Dataflow templates. For general information about templates, see the Overview page. For a list of all Google-provided templates, see the Get started with Google-provided templates page.

This page documents batch templates.

BigQuery to Cloud Storage TFRecords

The BigQuery to Cloud Storage TFRecords template is a pipeline that reads data from a BigQuery query and writes it to a Cloud Storage bucket in TFRecord format. You can specify the training, testing, and validation percentage splits. By default, the split is 1 or 100% for the training set and 0 or 0% for testing and validation sets. When setting the dataset split, the sum of training, testing, and validation needs to add up to 1 or 100% (for example, 0.6+0.2+0.2). Dataflow automatically determines the optimal number of shards for each output dataset.

Requirements for this pipeline:

  • The BigQuery dataset and table must exist.
  • The output Cloud Storage bucket must exist before pipeline execution. Training, testing, and validation subdirectories do not need to preexist and are autogenerated.

Template parameters

Parameter Description
readQuery A BigQuery SQL query that extracts data from the source. For example, select * from dataset1.sample_table.
outputDirectory The top-level Cloud Storage path prefix at which to write the training, testing, and validation TFRecord files. For example, gs://mybucket/output. Subdirectories for resulting training, testing, and validation TFRecord files are automatically generated from outputDirectory. For example, gs://mybucket/output/train
trainingPercentage (Optional) The percentage of query data allocated to training TFRecord files. The default value is 1, or 100%.
testingPercentage (Optional) The percentage of query data allocated to testing TFRecord files. The default value is 0, or 0%.
validationPercentage (Optional) The percentage of query data allocated to validation TFRecord files. The default value is 0, or 0%.
outputSuffix (Optional) The file suffix for the training, testing, and validation TFRecord files that are written. The default value is .tfrecord.

Running the BigQuery to Cloud Storage TFRecord files template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery to TFRecords template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To run templates with the gcloud command-line tool, you must have Cloud SDK version 138.0.0 or later.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records \
    --parameters readQuery=READ_QUERY,outputDirectory=OUTPUT_DIRECTORY,trainingPercentage=TRAINING_PERCENTAGE,testingPercentage=TESTING_PERCENTAGE,validationPercentage=VALIDATION_PERCENTAGE,outputSuffix=OUTPUT_FILENAME_SUFFIX

Replace the following values:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • READ_QUERY: the BigQuery query to be executed
  • OUTPUT_DIRECTORY: the Cloud Storage path prefix for output datasets
  • TRAINING_PERCENTAGE: the decimal percentage split for the training dataset
  • TESTING_PERCENTAGE: the decimal percentage split for the testing dataset
  • VALIDATION_PERCENTAGE: the decimal percentage split for the validation dataset
  • OUTPUT_FILENAME_SUFFIX: the preferred output TensorFlow Record file suffix

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_BigQuery_to_GCS_TensorFlow_Records

To execute the template with the REST API , send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_BigQuery_to_GCS_TensorFlow_Records
{
   "jobName": "JOB_NAME",
   "parameters": {
       "readQuery":"READ_QUERY",
       "outputDirectory":"OUTPUT_DIRECTORY",
       "trainingPercentage":"TRAINING_PERCENTAGE",
       "testingPercentage":"TESTING_PERCENTAGE",
       "validationPercentage":"VALIDATION_PERCENTAGE",
       "outputSuffix":"OUTPUT_FILENAME_SUFFIX"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following values:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • READ_QUERY: the BigQuery query to be executed
  • OUTPUT_DIRECTORY: the Cloud Storage path prefix for output datasets
  • TRAINING_PERCENTAGE: the decimal percentage split for the training dataset
  • TESTING_PERCENTAGE: the decimal percentage split for the testing dataset
  • VALIDATION_PERCENTAGE: the decimal percentage split for the validation dataset
  • OUTPUT_FILENAME_SUFFIX: the preferred output TensorFlow Record file suffix

BigQuery export to Parquet (via Storage API)

The BigQuery export to Parquet template is a batch pipeline that reads data from a BigQuery table and writes it to a Cloud Storage bucket in Parquet format. This template utilizes the BigQuery Storage API to export the data.

Requirements for this pipeline:

  • The input BigQuery table must exist before running the pipeline.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
tableRef The BigQuery input table location. For example, <my-project>:<my-dataset>.<my-table>.
bucket The Cloud Storage folder in which to write the Parquet files. For example, gs://mybucket/exports.
numShards (Optional) The number of output file shards. The default value is 1.
fields (Optional) A comma-separated list of fields to select from the input BigQuery table.

Running the BigQuery to Cloud Storage Parquet template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the BigQuery export to Parquet (via Storage API) template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run Flex templates, you must have Cloud SDK version 284.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet
gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --template-file-gcs-location=gs://dataflow-templates/latest/flex/BigQuery_to_Parquet \
    --parameters \
tableRef=BIGQUERY_TABLE,\
bucket=OUTPUT_DIRECTORY,\
numShards=NUM_SHARDS,\
fields=FIELDS

Replace the following values:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • BIGQUERY_TABLE: your BigQuery table name
  • OUTPUT_DIRECTORY: your Cloud Storage folder for output files
  • NUM_SHARDS: the desired number of output file shards
  • FIELDS: the comma-separated list of fields to select from the input BigQuery table
  • LOCATION: the execution region, for example, us-central1

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/flex/BigQuery_to_Parquet

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "tableRef": "BIGQUERY_TABLE",
          "bucket": "OUTPUT_DIRECTORY",
          "numShards": "NUM_SHARDS",
          "fields": "FIELDS"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/BigQuery_to_Parquet",
   }
}

Replace the following values:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • BIGQUERY_TABLE: your BigQuery table name
  • OUTPUT_DIRECTORY: your Cloud Storage folder for output files
  • NUM_SHARDS: the desired number of output file shards
  • FIELDS: the comma-separated list of fields to select from the input BigQuery table
  • LOCATION: the execution region, for example, us-central1

Bigtable to Cloud Storage Avro

The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
outputDirectory The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the Avro filename. For example, output-.

Running the Bigtable to Cloud Storage Avro file template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to Avro Files on Cloud Storage template .
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro \
    --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,outputDirectory=OUTPUT_DIRECTORY,filenamePrefix=FILENAME_PREFIX

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Avro filename, for example, output-

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Avro

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Avro filename, for example, output-

Bigtable to Cloud Storage Parquet

The Bigtable to Cloud Storage Parquet template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Parquet format. You can use the template to move data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
outputDirectory The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the Parquet filename. For example, output-.
numShards The number of output file shards. For example 2.

Running the Bigtable to Cloud Storage Parquet file template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to Parquet Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet \
    --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,outputDirectory=OUTPUT_DIRECTORY,filenamePrefix=FILENAME_PREFIX,numShards=NUM_SHARDS

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Parquet filename, for example, output-
  • NUM_SHARDS: the number of Parquet files to output, for example, 1

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_Parquet

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Parquet
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
       "numShards": "NUM_SHARDS"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • OUTPUT_DIRECTORY: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the Parquet filename, for example, output-
  • NUM_SHARDS: the number of Parquet files to output, for example, 1

Bigtable to Cloud Storage SequenceFile

The Bigtable to Cloud Storage SequenceFile template is a pipeline that reads data from a Bigtable table and writes the data to a Cloud Storage bucket in SequenceFile format. You can use the template to copy data from Bigtable to Cloud Storage.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
bigtableProject The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to export.
bigtableAppProfileId The ID of the Bigtable application profile to be used for the export. If you do not specify an app profile, Bigtable uses the instance's default app profile.
destinationPath The Cloud Storage path where data is written. For example, gs://mybucket/somefolder.
filenamePrefix The prefix of the SequenceFile filename. For example, output-.

Running the Bigtable to Cloud Storage SequenceFile template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Bigtable to SequenceFile Files on Cloud Storage template .
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile \
    --parameters bigtableProject=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,bigtableAppProfileId=APPLICATION_PROFILE_ID,destinationPath=DESTINATION_PATH,filenamePrefix=FILENAME_PREFIX

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • DESTINATION_PATH: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the SequenceFile filename, for example, output-

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Bigtable_to_GCS_SequenceFile

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_SequenceFile
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "destinationPath": "DESTINATION_PATH",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • DESTINATION_PATH: the Cloud Storage path where data is written, for example, gs://mybucket/somefolder
  • FILENAME_PREFIX: the prefix of the SequenceFile filename, for example, output-

Datastore to Cloud Storage Text

The Datastore to Cloud Storage Text template is a batch pipeline that reads Datastore entities and writes them to Cloud Storage as text files. You can provide a function to process each entity as a JSON string. If you don't provide such a function, every line in the output file will be a JSON-serialized entity.

Requirements for this pipeline:

Datastore must be set up in the project before running the pipeline.

Template parameters

Parameter Description
datastoreReadGqlQuery A GQL query that specifies which entities to grab. For example, SELECT * FROM MyKind.
datastoreReadProjectId The Google Cloud project ID of the Datastore instance that you want to read data from.
datastoreReadNamespace The namespace of the requested entities. To use the default namespace, leave this parameter blank.
javascriptTextTransformGcsPath A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName The name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...}, the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
textWritePrefix The Cloud Storage path prefix to specify where the data is written. For example, gs://mybucket/somefolder/.

Running the Datastore to Cloud Storage Text template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Datastore to Text Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Datastore_to_GCS_Text \
    --parameters \
datastoreReadGqlQuery="SELECT * FROM DATASTORE_KIND",\
datastoreReadProjectId=PROJECT_ID,\
datastoreReadNamespace=DATASTORE_NAMESPACE,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
textWritePrefix=gs://BUCKET_NAME/output/

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • DATASTORE_KIND: the type of your Datastore entities
  • DATASTORE_NAMESPACE: the namespace of your Datastore entities
  • JAVASCRIPT_FUNCTION: your JavaScript function name
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Datastore_to_GCS_Text

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Datastore_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "datastoreReadGqlQuery": "SELECT * FROM DATASTORE_KIND"
       "datastoreReadProjectId": "PROJECT_ID",
       "datastoreReadNamespace": "DATASTORE_NAMESPACE",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "textWritePrefix": "gs://BUCKET_NAME/output/"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • DATASTORE_KIND: the type of your Datastore entities
  • DATASTORE_NAMESPACE: the namespace of your Datastore entities
  • JAVASCRIPT_FUNCTION: your JavaScript function name
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code

Cloud Spanner to Cloud Storage Avro

The Cloud Spanner to Avro Files on Cloud Storage template is a batch pipeline that exports a whole Cloud Spanner database to Cloud Storage in Avro format. Exporting a Cloud Spanner database creates a folder in the bucket you select. The folder contains:

  • A spanner-export.json file.
  • A TableName-manifest.json file for each table in the database you exported.
  • One or more TableName.avro-#####-of-##### files.

For example, exporting a database with two tables, Singers and Albums, creates the following file set:

  • Albums-manifest.json
  • Albums.avro-00000-of-00002
  • Albums.avro-00001-of-00002
  • Singers-manifest.json
  • Singers.avro-00000-of-00003
  • Singers.avro-00001-of-00003
  • Singers.avro-00002-of-00003
  • spanner-export.json

Requirements for this pipeline:

  • The Cloud Spanner database must exist.
  • The output Cloud Storage bucket must exist.
  • In addition to the IAM roles necessary to run Dataflow jobs, you must also have the appropriate IAM roles for reading your Cloud Spanner data and writing to your Cloud Storage bucket.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database that you want to export.
databaseId The database ID of the Cloud Spanner database that you want to export.
outputDir The Cloud Storage path you want to export Avro files to. The export job creates a new directory under this path that contains the exported files.
snapshotTime (Optional) The timestamp that corresponds to the version of the Cloud Spanner database that you want to read. The timestamp must be specified as per RFC 3339 UTC "Zulu" format. For example, 1990-12-31T23:59:60Z. The timestamp must be in the past and Maximum timestamp staleness applies.
spannerProjectId (Optional) The Google Cloud Project ID of the Cloud Spanner database that you want to read data from.

Running the Cloud Spanner to Avro Files on Cloud Storage template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.

    For the job to show up in the Spanner Instances page of the Cloud Console, the job name must match the following format:

    cloud-spanner-export-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME

    Replace the following:

    • SPANNER_INSTANCE_ID: your Spanner instance's ID
    • SPANNER_DATABASE_NAME: your Spanner database's name
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Spanner to Avro Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
gcloud dataflow jobs run JOB_NAME \
    --gcs-location='gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro' \
    --region=DATAFLOW_REGION \
    --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,outputDir=GCS_DIRECTORY'

Replace the following:

  • JOB_NAME: a job name of your choice
  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • GCS_STAGING_LOCATION: the path for writing temporary files. For example, gs://mybucket/temp.
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_DIRECTORY: the Cloud Storage path that the Avro files are exported to

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/Cloud_Spanner_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "outputDir": "gs://GCS_DIRECTORY"
   }
}

Use this example request as documented in Using the REST API. This request requires authorization, and you must specify a tempLocation where you have write permissions. Replace the following:

  • PROJECT_ID: your project ID
  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_DIRECTORY: the Cloud Storage path that the Avro files are exported to
  • JOB_NAME: a job name of your choice
    • The job name must match the format cloud-spanner-export-INSTANCE_ID-DATABASE_ID to show up in the Cloud Spanner portion of the Cloud Console.

Cloud Spanner to Cloud Storage Text

The Cloud Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Cloud Spanner table, optionally transforms the data via a JavaScript User Defined Function (UDF) that you provide, and writes it to Cloud Storage as CSV text files.

Requirements for this pipeline:

  • The input Spanner table must exist before running the pipeline.

Template parameters

Parameter Description
spannerProjectId The Google Cloud Project ID of the Cloud Spanner database that you want to read data from.
spannerDatabaseId The database ID of the requested table.
spannerInstanceId The instance ID of the requested table.
spannerTable The table to read the data from.
textWritePrefix The directory where output text files are written. Add / at the end. For example, gs://mybucket/somefolder/.
javascriptTextTransformGcsPath (Optional) A Cloud Storage path that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName (Optional) The name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...}, the function name is myTransform. If you don't want to provide a function, leave this parameter blank.

Running the Cloud Spanner to Cloud Storage Text template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cloud Spanner to Text Files on Cloud Storage template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Spanner_to_GCS_Text \
    --parameters \
spannerProjectId=PROJECT_ID,\
spannerDatabaseId=DATABASE_ID,\
spannerInstanceId=INSTANCE_ID,\
spannerTable=TABLE_ID,\
textWritePrefix=gs://BUCKET_NAME/output/,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • DATABASE_ID: the Spanner database ID
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • INSTANCE_ID: the Spanner instance ID
  • TABLE_ID: the Spanner table ID
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • JAVASCRIPT_FUNCTION: your JavaScript function name

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Spanner_to_GCS_Text

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Spanner_to_GCS_Text
{
   "jobName": "JOB_NAME",
   "parameters": {
       "spannerProjectId": "PROJECT_ID",
       "spannerDatabaseId": "DATABASE_ID",
       "spannerInstanceId": "INSTANCE_ID",
       "spannerTable": "TABLE_ID",
       "textWritePrefix": "gs://BUCKET_NAME/output/",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • DATABASE_ID: the Spanner database ID
  • BUCKET_NAME: the name of your Cloud Storage bucket
  • INSTANCE_ID: the Spanner instance ID
  • TABLE_ID: the Spanner table ID
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • JAVASCRIPT_FUNCTION: your JavaScript function name

Cloud Storage Avro to Bigtable

The Cloud Storage Avro to Bigtable template is a pipeline that reads data from Avro files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist and have the same column families as exported in the Avro files.
  • The input Avro files must exist in a Cloud Storage bucket before running the pipeline.
  • Bigtable expects a specific schema from the input Avro files.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
inputFilePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage Avro file to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable \
    --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,inputFilePattern=INPUT_FILE_PATTERN

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Bigtable

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage Avro to Cloud Spanner

The Cloud Storage Avro files to Cloud Spanner template is a batch pipeline that reads Avro files exported from Cloud Spanner stored in Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database must exist and must be empty.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path must exist, and it must include a spanner-export.json file that contains a JSON description of files to import.

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
inputDir The Cloud Storage path where the Avro files are imported from.

Running the Cloud Storage Avro to Cloud Spanner template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.

    For the job to show up in the Spanner Instances page of the Cloud Console, the job name must match the following format:

    cloud-spanner-import-SPANNER_INSTANCE_ID-SPANNER_DATABASE_NAME

    Replace the following:

    • SPANNER_INSTANCE_ID: your Spanner instance's ID
    • SPANNER_DATABASE_NAME: your Spanner database's name
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Avro Files on Cloud Storage to Cloud Spanner template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
gcloud dataflow jobs run JOB_NAME \
    --gcs-location='gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner' \
    --region=DATAFLOW_REGION \
    --staging-location=GCS_STAGING_LOCATION \
    --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,inputDir=GCS_DIRECTORY'

Replace the following:

  • (API only)PROJECT_ID: your project ID
  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • JOB_NAME: a job name of your choice
  • INSTANCE_ID: the ID of the Spanner instance that contains the database
  • DATABASE_ID: the ID of the Spanner database to import to
  • (gcloud only)GCS_STAGING_LOCATION: the path for writing temporary files, for example, gs://mybucket/temp
  • GCS_DIRECTORY: the Cloud Storage path where the Avro files are imported from, for example, gs://mybucket/somefolder

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner

This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Avro_to_Cloud_Spanner
{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "inputDir": "gs://GCS_DIRECTORY"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Replace the following:

  • (API only)PROJECT_ID: your project ID
  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • JOB_NAME: a job name of your choice
  • INSTANCE_ID: the ID of the Spanner instance that contains the database
  • DATABASE_ID: the ID of the Spanner database to import to
  • (gcloud only)GCS_STAGING_LOCATION: the path for writing temporary files, for example, gs://mybucket/temp
  • GCS_DIRECTORY: the Cloud Storage path where the Avro files are imported from, for example, gs://mybucket/somefolder

Cloud Storage Parquet to Bigtable

The Cloud Storage Parquet to Bigtable template is a pipeline that reads data from Parquet files in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist and have the same column families as exported in the Parquet files.
  • The input Parquet files must exist in a Cloud Storage bucket before running the pipeline.
  • Bigtable expects a specific schema from the input Parquet files.

Template parameters

Parameter Description
bigtableProjectId The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
inputFilePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage Parquet file to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Parquet Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable \
    --parameters bigtableProjectId=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,inputFilePattern=INPUT_FILE_PATTERN

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Parquet_to_Cloud_Bigtable

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Parquet_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "inputFilePattern": "INPUT_FILE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • INPUT_FILE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage SequenceFile to Bigtable

The Cloud Storage SequenceFile to Bigtable template is a pipeline that reads data from SequenceFiles in a Cloud Storage bucket and writes the data to a Bigtable table. You can use the template to copy data from Cloud Storage to Bigtable.

Requirements for this pipeline:

  • The Bigtable table must exist.
  • The input SequenceFiles must exist in a Cloud Storage bucket before running the pipeline.
  • The input SequenceFiles must have been exported from Bigtable or HBase.

Template parameters

Parameter Description
bigtableProject The ID of the Google Cloud project of the Bigtable instance that you want to write data to.
bigtableInstanceId The ID of the Bigtable instance that contains the table.
bigtableTableId The ID of the Bigtable table to import.
bigtableAppProfileId The ID of the Bigtable application profile to be used for the import. If you do not specify an app profile, Bigtable uses the instance's default app profile.
sourcePattern The Cloud Storage path pattern where data is located. For example, gs://mybucket/somefolder/prefix*.

Running the Cloud Storage SequenceFile to Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Parquet Files on Cloud Storage to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable \
    --parameters bigtableProject=PROJECT_ID,bigtableInstanceId=INSTANCE_ID,bigtableTableId=TABLE_ID,bigtableAppProfileId=APPLICATION_PROFILE_ID,sourcePattern=SOURCE_PATTERN

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_SequenceFile_to_Cloud_Bigtable

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_SequenceFile_to_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProject": "PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "bigtableAppProfileId": "APPLICATION_PROFILE_ID",
       "sourcePattern": "SOURCE_PATTERN",
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PROJECT_ID: the ID of the Google Cloud project of the Bigtable instance that you want to read data from
  • INSTANCE_ID: the ID of the Bigtable instance that contains the table
  • TABLE_ID: the ID of the Bigtable table to export
  • APPLICATION_PROFILE_ID: the ID of the Bigtable application profile to be used for the export
  • SOURCE_PATTERN: the Cloud Storage path pattern where data is located, for example, gs://mybucket/somefolder/prefix*

Cloud Storage Text to BigQuery

The Cloud Storage Text to BigQuery pipeline is a batch pipeline that allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and append the result to a BigQuery table.

Requirements for this pipeline:

  • Create a JSON file that describes your BigQuery schema.

    Ensure that there is a top-level JSON array titled BigQuery Schema and that its contents follow the pattern {"name": "COLUMN_NAME", "type": "DATA_TYPE"}.

    The Cloud Storage Text to BigQuery batch template doesn't support importing data into STRUCT (Record) fields in the target BigQuery table.

    The following JSON describes an example BigQuery schema:

    {
      "BigQuery Schema": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    
  • Create a JavaScript (.js) file with your UDF function that supplies the logic to transform the lines of text. Your function must return a JSON string.

    For example, this function splits each line of a CSV file and returns a JSON string after transforming the values.

    function transform(line) {
    var values = line.split(',');
    
    var obj = new Object();
    obj.location = values[0];
    obj.name = values[1];
    obj.age = values[2];
    obj.color = values[3];
    obj.coffee = values[4];
    var jsonString = JSON.stringify(obj);
    
    return jsonString;
    }
    

Template parameters

Parameter Description
javascriptTextTransformFunctionName The name of the function you want to call from your .js file.
JSONPath The gs:// path to the JSON file that defines your BigQuery schema, stored in Cloud Storage. For example, gs://path/to/my/schema.json.
javascriptTextTransformGcsPath The gs:// path to the JavaScript file that defines your UDF. For example, gs://path/to/my/javascript_function.js.
inputFilePattern The gs:// path to the text in Cloud Storage you'd like to process. For example, gs://path/to/my/text/data.txt.
outputTable The BigQuery table name you want to create to store your processed data in. If you reuse an existing BigQuery table, the data is appended to the destination table. For example, my-project-name:my-dataset.my-table.
bigQueryLoadingTemporaryDirectory The temporary directory for the BigQuery loading process. For example, gs://my-bucket/my-files/temp_dir.

Running the Cloud Storage Text to BigQuery template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to BigQuery (Batch) template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --parameters \
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • JAVASCRIPT_FUNCTION: the name of your UDF
  • PATH_TO_BIGQUERY_SCHEMA_JSON: the Cloud Storage path to the JSON file containing the schema definition
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • PATH_TO_TEXT_DATA: your Cloud Storage path to your text dataset
  • BIGQUERY_TABLE: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_BigQuery

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "JSONPath": "PATH_TO_BIGQUERY_SCHEMA_JSON",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "inputFilePattern":"PATH_TO_TEXT_DATA",
       "outputTable":"BIGQUERY_TABLE",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • JAVASCRIPT_FUNCTION: the name of your UDF
  • PATH_TO_BIGQUERY_SCHEMA_JSON: the Cloud Storage path to the JSON file containing the schema definition
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • PATH_TO_TEXT_DATA: your Cloud Storage path to your text dataset
  • BIGQUERY_TABLE: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory

Cloud Storage Text to Datastore

The Cloud Storage Text to Datastore template is a batch pipeline that reads from text files stored in Cloud Storage and writes JSON encoded Entities to Datastore. Each line in the input text files must be in the specified JSON format.

Requirements for this pipeline:

  • Datastore must be enabled in the destination project.

Template parameters

Parameter Description
textReadPattern A Cloud Storage path pattern that specifies the location of your text data files. For example, gs://mybucket/somepath/*.json.
javascriptTextTransformGcsPath A Cloud Storage path pattern that contains all your JavaScript code. For example, gs://mybucket/mytransforms/*.js. If you don't want to provide a function, leave this parameter blank.
javascriptTextTransformFunctionName Name of the JavaScript function to be called. For example, if your JavaScript function is function myTransform(inJson) { ...dostuff...} then the function name is myTransform. If you don't want to provide a function, leave this parameter blank.
datastoreWriteProjectId The Google Cloud project id of where to write the Datastore entities
errorWritePath The error log output file to use for write failures that occur during processing. For example, gs://bucket-name/errors.txt.

Running the Cloud Storage Text to Datastore template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Datastore template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Datastore \
    --parameters \
textReadPattern=PATH_TO_INPUT_TEXT_FILES,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
javascriptTextTransformFunctionName=JAVASCRIPT_FUNCTION,\
datastoreWriteProjectId=PROJECT_ID,\
errorWritePath=ERROR_FILE_WRITE_PATH

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PATH_TO_INPUT_TEXT_FILES: the input files pattern on Cloud Storage
  • JAVASCRIPT_FUNCTION: your JavaScript function name
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • ERROR_FILE_WRITE_PATH: your desired path to error file on Cloud Storage

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Datastore

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Datastore
{
   "jobName": "JOB_NAME",
   "parameters": {
       "textReadPattern": "PATH_TO_INPUT_TEXT_FILES",
       "javascriptTextTransformGcsPath": "PATH_TO_JAVASCRIPT_UDF_FILE",
       "javascriptTextTransformFunctionName": "JAVASCRIPT_FUNCTION",
       "datastoreWriteProjectId": "PROJECT_ID",
       "errorWritePath": "ERROR_FILE_WRITE_PATH"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • PATH_TO_INPUT_TEXT_FILES: the input files pattern on Cloud Storage
  • JAVASCRIPT_FUNCTION: your JavaScript function name
  • PATH_TO_JAVASCRIPT_UDF_FILE: the Cloud Storage path to the .js file containing your JavaScript code
  • ERROR_FILE_WRITE_PATH: your desired path to error file on Cloud Storage

Cloud Storage Text to Pub/Sub (Batch)

This template creates a batch pipeline that reads records from text files stored in Cloud Storage and publishes them to a Pub/Sub topic. The template can be used to publish records in a newline-delimited file containing JSON records or CSV file to a Pub/Sub topic for real-time processing. You can use this template to replay data to Pub/Sub.

This template does not set any timestamp on the individual records. The event time is equal to the publishing time during execution. If your pipeline relies on an accurate event time for processing, you must not use this pipeline.

Requirements for this pipeline:

  • The files to read need to be in newline-delimited JSON or CSV format. Records spanning multiple lines in the source files might cause issues downstream because each line within the files will be published as a message to Pub/Sub.
  • The Pub/Sub topic must exist before running the pipeline.

Template parameters

Parameter Description
inputFilePattern The input file pattern to read from. For example, gs://bucket-name/files/*.json.
outputTopic The Pub/Sub input topic to write to. The name must be in the format of projects/<project-id>/topics/<topic-name>.

Running the Cloud Storage Text to Pub/Sub (Batch) template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Pub/Sub (Batch) template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub \
    --parameters \
inputFilePattern=gs://BUCKET_NAME/files/*.json,\
outputTopic=projects/PROJECT_ID/topics/TOPIC_NAME

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • TOPIC_NAME: your Pub/Sub topic name
  • BUCKET_NAME: the name of your Cloud Storage bucket

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_PubSub

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/GCS_Text_to_Cloud_PubSub
{
   "jobName": "JOB_NAME",
   "parameters": {
       "inputFilePattern": "gs://BUCKET_NAME/files/*.json",
       "outputTopic": "projects/PROJECT_ID/topics/TOPIC_NAME"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • TOPIC_NAME: your Pub/Sub topic name
  • BUCKET_NAME: the name of your Cloud Storage bucket

Cloud Storage Text to Cloud Spanner

The Cloud Storage Text to Cloud Spanner template is a batch pipeline that reads CSV text files from Cloud Storage and imports them to a Cloud Spanner database.

Requirements for this pipeline:

  • The target Cloud Spanner database and table must exist.
  • You must have read permissions for the Cloud Storage bucket and write permissions for the target Cloud Spanner database.
  • The input Cloud Storage path containing the CSV files must exist.
  • You must create an import manifest file containing a JSON description of the CSV files, and you must store that manifest file in Cloud Storage.
  • If the target Cloud Spanner database already has a schema, any columns specified in the manifest file must have the same data types as their corresponding columns in the target database's schema.
  • The manifest file, encoded in ASCII or UTF-8, must match the following format:

  • Text files to be imported must be in CSV format, with ASCII or UTF-8 encoding. We recommend not using byte order mark (BOM) in UTF-8 encoded files.
  • Data must match one of the following types:
    • INT64
    • FLOAT64
    • BOOL
    • STRING
    • DATE
    • TIMESTAMP

Template parameters

Parameter Description
instanceId The instance ID of the Cloud Spanner database.
databaseId The database ID of the Cloud Spanner database.
importManifest The path in Cloud Storage to the import manifest file.
columnDelimiter The column delimiter that the source file uses. The default value is ,.
fieldQualifier The character that must surround any value in the source file that contains the columnDelimiter. The default value is ".
trailingDelimiter Specifies whether the lines in the source files have trailing delimiters (that is, if the columnDelimiter character appears at the end of each line, after the last column value). The default value is true.
escape The escape character the source file uses. By default, this parameter is not set and the template does not use the escape character.
nullString The string that represents a NULL value. By default, this parameter is not set and the template does not use the null string.
dateFormat The format used to parse date columns. By default, the pipeline tries to parse the date columns as yyyy-M-d[' 00:00:00'], for example, as 2019-01-31 or 2019-1-1 00:00:00. If your date format is different, specify the format using the java.time.format.DateTimeFormatter patterns.
timestampFormat The format used to parse timestamp columns. If the timestamp is a long integer, then it is parsed as Unix epoch time. Otherwise, it is parsed as a string using the java.time.format.DateTimeFormatter.ISO_INSTANT format. For other cases, specify your own pattern string, for example, using MMM dd yyyy HH:mm:ss.SSSVV for timestamps in the form of "Jan 21 1998 01:02:03.456+08:00".

If you need to use customized date or timestamp formats, make sure they're valid java.time.format.DateTimeFormatter patterns. The following table shows additional examples of customized formats for date and timestamp columns:

Type Input value Format Remark
DATE 2011-3-31 By default, the template can parse this format. You don't need to specify the dateFormat parameter.
DATE 2011-3-31 00:00:00 By default, the template can parse this format. You don't need to specify the format. If you like, you can use yyyy-M-d' 00:00:00'.
DATE 01 Apr, 18 dd MMM, yy
DATE Wednesday, April 3, 2019 AD EEEE, LLLL d, yyyy G
TIMESTAMP 2019-01-02T11:22:33Z
2019-01-02T11:22:33.123Z
2019-01-02T11:22:33.12356789Z
The default format ISO_INSTANT can parse this type of timestamp. You don't need to provide the timestampFormat parameter.
TIMESTAMP 1568402363 By default, the template can parse this type of timestamp and treat it as Unix epoch time.
TIMESTAMP Tue, 3 Jun 2008 11:05:30 GMT EEE, d MMM yyyy HH:mm:ss VV
TIMESTAMP 2018/12/31 110530.123PST yyyy/MM/dd HHmmss.SSSz
TIMESTAMP 2019-01-02T11:22:33Z or 2019-01-02T11:22:33.123Z yyyy-MM-dd'T'HH:mm:ss[.SSS]VV If the input column is a mix of 2019-01-02T11:22:33Z and 2019-01-02T11:22:33.123Z, the default format can parse this type of timestamp. You don't need to provide your own format parameter. You can use yyyy-MM-dd'T'HH:mm:ss[.SSS]VV to handle both cases. You cannot use yyyy-MM-dd'T'HH:mm:ss[.SSS]'Z', because the postfix 'Z' must be parsed as a time-zone ID, not a character literal. Internally, the timestamp column is converted to a java.time.Instant. Therefore, it must be specified in UTC or have time zone information associated with it. Local datetime, such as 2019-01-02 11:22:33, cannot be parsed as a valid java.time.Instant.

Running the Text Files on Cloud Storage to Cloud Spanner template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Text Files on Cloud Storage to Cloud Spanner template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run with the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner
gcloud dataflow jobs run JOB_NAME \
    --gcs-location='gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner' \
    --region=DATAFLOW_REGION \
    --parameters='instanceId=INSTANCE_ID,databaseId=DATABASE_ID,importManifest=GCS_PATH_TO_IMPORT_MANIFEST'

Replace the following:

  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file
  • JOB_NAME: a job name of your choice

API

Run with the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner

This request requires authorization, and you must specify a tempLocation where you have write permissions. Use this example request as documented in Using the REST API.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/DATAFLOW_REGION/templates:launch?gcsPath=gs://dataflow-templates/VERSION/GCS_Text_to_Cloud_Spanner

{
   "jobName": "JOB_NAME",
   "parameters": {
       "instanceId": "INSTANCE_ID",
       "databaseId": "DATABASE_ID",
       "importManifest": "GCS_PATH_TO_IMPORT_MANIFEST"
   },
   "environment": {
       "machineType": "n1-standard-2"
   }
}

Replace the following:

  • PROJECT_ID: your project ID
  • DATAFLOW_REGION: the region where you want the Dataflow job to run (such as us-central1)
  • INSTANCE_ID: your Cloud Spanner instance ID
  • DATABASE_ID: your Cloud Spanner database ID
  • GCS_PATH_TO_IMPORT_MANIFEST: the Cloud Storage path to your import manifest file
  • JOB_NAME: a job name of your choice

Java Database Connectivity (JDBC) to BigQuery

The JDBC to BigQuery template is a batch pipeline that copies data from a relational database table into an existing BigQuery table. This pipeline uses JDBC to connect to the relational database. You can use this template to copy data from any relational database with available JDBC drivers into BigQuery. For an extra layer of protection, you can also pass in a Cloud KMS key along with a Base64-encoded username, password, and connection string parameters encrypted with the Cloud KMS key. See the Cloud KMS API encryption endpoint for additional details on encrypting your username, password, and connection string parameters.

Requirements for this pipeline:

  • The JDBC drivers for the relational database must be available.
  • The BigQuery table must exist before pipeline execution.
  • The BigQuery table must have a compatible schema.
  • The relational database must be accessible from the subnet where Dataflow runs.

Template parameters

Parameter Description
driverJars The comma-separated list of driver JAR files. For example, gs://<my-bucket>/driver_jar1.jar,gs://<my-bucket>/driver_jar2.jar.
driverClassName The JDBC driver class name. For example, com.mysql.jdbc.Driver.
connectionURL The JDBC connection URL string. For example, jdbc:mysql://some-host:3306/sampledb. Can be passed in as a string that's Base64-encoded and then encrypted with a Cloud KMS key.
query The query to be run on the source to extract the data. For example, select * from sampledb.sample_table.
outputTable The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>.
bigQueryLoadingTemporaryDirectory The temporary directory for the BigQuery loading process. For example, gs://<my-bucket>/my-files/temp_dir.
connectionProperties (Optional) Properties string to use for the JDBC connection. For example, unicode=true&characterEncoding=UTF-8.
username (Optional) The username to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key.
password (Optional) The password to be used for the JDBC connection. Can be passed in as a Base64-encoded string encrypted with a Cloud KMS key.
KMSEncryptionKey (Optional) Cloud KMS Encryption Key to decrypt the username, password, and connection string. If Cloud KMS key is passed in, the username, password, and connection string must all be passed in encrypted.

Running the JDBC to BigQuery template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the JDBC to BigQuery template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Jdbc_to_BigQuery
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Jdbc_to_BigQuery \
    --parameters \
driverJars=DRIVER_PATHS,\
driverClassName=DRIVER_CLASS_NAME,\
connectionURL=JDBC_CONNECTION_URL,\
query=SOURCE_SQL_QUERY,\
outputTable=PROJECT_ID:DATASET.TABLE_NAME,
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS,\
connectionProperties=CONNECTION_PROPERTIES,\
username=CONNECTION_USERNAME,\
password=CONNECTION_PASSWORD,\
KMSEncryptionKey=KMS_ENCRYPTION_KEY

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • DRIVER_PATHS: the comma-separated Cloud Storage path(s) of the JDBC driver(s)
  • DRIVER_CLASS_NAME: the drive class name
  • JDBC_CONNECTION_URL: the JDBC connection URL
  • SOURCE_SQL_QUERY: the SQL query to be run on the source database
  • DATASET: your BigQuery dataset, and replace TABLE_NAME: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory
  • CONNECTION_PROPERTIES: the JDBC connection properties, if necessary
  • CONNECTION_USERNAME: the JDBC connection username
  • CONNECTION_PASSWORD: the JDBC connection password
  • KMS_ENCRYPTION_KEY: the Cloud KMS Encryption Key

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Jdbc_to_BigQuery

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Jdbc_to_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "driverJars": "DRIVER_PATHS",
       "driverClassName": "DRIVER_CLASS_NAME",
       "connectionURL": "JDBC_CONNECTION_URL",
       "query": "SOURCE_SQL_QUERY",
       "outputTable": "PROJECT_ID:DATASET.TABLE_NAME",
       "bigQueryLoadingTemporaryDirectory": "PATH_TO_TEMP_DIR_ON_GCS",
       "connectionProperties": "CONNECTION_PROPERTIES",
       "username": "CONNECTION_USERNAME",
       "password": "CONNECTION_PASSWORD",
       "KMSEncryptionKey":"KMS_ENCRYPTION_KEY"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • DRIVER_PATHS: the comma-separated Cloud Storage path(s) of the JDBC driver(s)
  • DRIVER_CLASS_NAME: the drive class name
  • JDBC_CONNECTION_URL: the JDBC connection URL
  • SOURCE_SQL_QUERY: the SQL query to be run on the source database
  • DATASET: your BigQuery dataset, and replace TABLE_NAME: your BigQuery table name
  • PATH_TO_TEMP_DIR_ON_GCS: your Cloud Storage path to the temp directory
  • CONNECTION_PROPERTIES: the JDBC connection properties, if necessary
  • CONNECTION_USERNAME: the JDBC connection username
  • CONNECTION_PASSWORD: the JDBC connection password
  • KMS_ENCRYPTION_KEY: the Cloud KMS Encryption Key

Apache Cassandra to Cloud Bigtable

The Apache Cassandra to Cloud Bigtable template copies a table from Apache Cassandra to Cloud Bigtable. This template requires minimal configuration and replicates the table structure in Cassandra as closely as possible in Cloud Bigtable.

The Apache Cassandra to Cloud Bigtable template is useful for the following:

  • Migrating Apache Cassandra database when short downtime is acceptable.
  • Periodically replicating Cassandra tables to Cloud Bigtable for global serving.

Requirements for this pipeline:

  • The target Bigtable table must exist before running the pipeline.
  • Network connection between Dataflow workers and Apache Cassandra nodes.

Type conversion

The Apache Cassandra to Cloud Bigtable template automatically converts Apache Cassandra data types to Cloud Bigtable's data types.

Most primitives are represented the same way in Cloud Bigtable and Apache Cassandra; however, the following primitives are represented differently:

  • Date and Timestamp are converted to DateTime objects
  • UUID is converted to String
  • Varint is converted to BigDecimal

Apache Cassandra also natively supports more complex types such as Tuple, List, Set and Map. Tuples are not supported by this pipeline as there is no corresponding type in the Apache Beam.

For example, in Apache Cassandra you can have a column of type List called "mylist" and values like those in the following table:

row mylist
1 (a,b,c)

The pipeline expands the list column into three different columns (known in Cloud Bigtable as column qualifiers). The name of the columns is "mylist" but the pipeline appends the index of the item in the list, such as "mylist[0]".

row mylist[0] mylist[1] mylist[2]
1 a b c

The pipeline handles sets the same way as lists but adds a suffix to denote if the cell is a key or a value.

row mymap
1 {"first_key":"first_value","another_key":"different_value"}

After transformation, table appears as follows:

row mymap[0].key mymap[0].value mymap[1].key mymap[1].value
1 first_key first_value another_key different_value

Primary key conversion

In Apache Cassandra, a primary key is defined using data definition language. The primary key is either simple, composite, or compound with the clustering columns. Cloud Bigtable supports manual row-key construction, ordered lexicographically on a byte array. The pipeline automatically collects information about the type of key and constructs a key based on best practices for building row-keys based on multiple values.

Template parameters

Parameter Description
cassandraHosts The hosts of the Apache Cassandra nodes in a comma-separated list.
cassandraPort (Optional) The TCP port to reach Apache Cassandra on the nodes (defaults to 9042).
cassandraKeyspace The Apache Cassandra keyspace where the table is located.
cassandraTable The Apache Cassandra table to be copied.
bigtableProjectId The Google Cloud Project ID of the Bigtable instance where the Apache Cassandra table is copied.
bigtableInstanceId The Bigtable instance ID in which to copy the Apache Cassandra table.
bigtableTableId The name of the Bigtable table in which to copy the Apache Cassandra table.
defaultColumnFamily (Optional) The name of the Bigtable table's column family (defaults to default).
rowKeySeparator (Optional) The separator used to build row-key (defaults to #).

Running the Apache Cassandra to Cloud Bigtable template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Cassandra to Cloud Bigtable template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable \
    --parameters\
bigtableProjectId=PROJECT_ID,\
bigtableInstanceId=BIGTABLE_INSTANCE_ID,\
bigtableTableId=BIGTABLE_TABLE_ID,\
cassandraHosts=CASSANDRA_HOSTS,\
cassandraKeyspace=CASSANDRA_KEYSPACE,\
cassandraTable=CASSANDRA_TABLE

Replace the following:

  • PROJECT_ID: your project ID where Cloud Bigtable is located
  • JOB_NAME: a job name of your choice
  • BIGTABLE_INSTANCE_ID: the Cloud Bigtable instance id
  • BIGTABLE_TABLE_ID: the name of your Cloud Bigtable table name
  • CASSANDRA_HOSTS: the Apache Cassandra host list; if multiple hosts are provided, follow the instruction on how to escape commas
  • CASSANDRA_KEYSPACE: the Apache Cassandra keyspace where table is located
  • CASSANDRA_TABLE: the Apache Cassandra table that needs to be migrated

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Cassandra_To_Cloud_Bigtable

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cassandra_To_Cloud_Bigtable
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "PROJECT_ID",
       "bigtableInstanceId": "BIGTABLE_INSTANCE_ID",
       "bigtableTableId": "BIGTABLE_TABLE_ID",
       "cassandraHosts": "CASSANDRA_HOSTS",
       "cassandraKeyspace": "CASSANDRA_KEYSPACE",
       "cassandraTable": "CASSANDRA_TABLE"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID where Cloud Bigtable is located
  • JOB_NAME: a job name of your choice
  • BIGTABLE_INSTANCE_ID: the Cloud Bigtable instance id
  • BIGTABLE_TABLE_ID: the name of your Cloud Bigtable table name
  • CASSANDRA_HOSTS: the Apache Cassandra host list; if multiple hosts are provided, follow the instruction on how to escape commas
  • CASSANDRA_KEYSPACE: the Apache Cassandra keyspace where table is located
  • CASSANDRA_TABLE: the Apache Cassandra table that needs to be migrated

Apache Hive to BigQuery

The Apache Hive to BigQuery template is a batch pipeline that reads from an Apache Hive table and writes it to a BigQuery table.

Requirements for this pipeline:

  • The target BigQuery table must exist before running the pipeline.
  • Network connection must exist between Dataflow workers and Apache Hive nodes.
  • Network connection must exist between Dataflow workers and the Apache Thrift server node.
  • The BigQuery dataset must exist before pipeline execution.

Template parameters

Parameter Description
metastoreUri The Apache Thrift server URI such as thrift://thrift-server-host:port.
hiveDatabaseName The Apache Hive database name that contains the table you want to export.
hiveTableName The Apache Hive table name that you want to export.
outputTableSpec The BigQuery output table location, in the format of <my-project>:<my-dataset>.<my-table>
hivePartitionCols (Optional) The comma-separated list of the Apache Hive partition columns.
filterString (Optional) The filter string for the input Apache Hive table.
partitionType (Optional) The partition type in BigQuery. Currently, only Time is supported.
partitionCol (Optional) The partition column name in the output BigQuery table.

Running the Apache Hive to BigQuery template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Hive to BigQuery template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run templates, you must have Cloud SDK version 138.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Hive_To_BigQuery
gcloud dataflow jobs run JOB_NAME \
    --gcs-location gs://dataflow-templates/latest/Hive_To_BigQuery \
    --parameters\
metastoreUri=METASTORE_URI,\
hiveDatabaseName=HIVE_DATABASE_NAME,\
hiveTableName=HIVE_TABLE_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
hivePartitionCols=HIVE_PARTITION_COLS,\
filterString=FILTER_STRING,\
partitionType=PARTITION_TYPE,\
partitionCol=PARTITION_COL

Replace the following:

  • PROJECT_ID: your project ID where BigQuery is located
  • JOB_NAME: the job name of your choice
  • DATASET: your BigQuery dataset
  • TABLE_NAME: your BigQuery table name
  • METASTORE_URI: the Apache Thrift server URI
  • HIVE_DATABASE_NAME: the Apache Hive database name that contains the table you want to export
  • HIVE_TABLE_NAME: the Apache Hive table name that you want to export
  • HIVE_PARTITION_COLS: the comma-separated list of your Apache Hive partition columns
  • FILTER_STRING: the filter string for the Apache Hive input table
  • PARTITION_TYPE: the partition type in BigQuery
  • PARTITION_COL: the name of the BigQuery partition column

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/Hive_To_BigQuery

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Hive_To_BigQuery
{
   "jobName": "JOB_NAME",
   "parameters": {
       "metastoreUri": "METASTORE_URI",
       "hiveDatabaseName": "HIVE_DATABASE_NAME",
       "hiveTableName": "HIVE_TABLE_NAME",
       "outputTableSpec": "PROJECT_ID:DATASET.TABLE_NAME",
       "hivePartitionCols": "HIVE_PARTITION_COLS",
       "filterString": "FILTER_STRING",
       "partitionType": "PARTITION_TYPE",
       "partitionCol": "PARTITION_COL"
   },
   "environment": { "zone": "us-central1-f" }
}

Replace the following:

  • PROJECT_ID: your project ID where BigQuery is located
  • JOB_NAME: the job name of your choice
  • DATASET: your BigQuery dataset
  • TABLE_NAME: your BigQuery table name
  • METASTORE_URI: the Apache Thrift server URI
  • HIVE_DATABASE_NAME: the Apache Hive database name that contains the table you want to export
  • HIVE_TABLE_NAME: the Apache Hive table name that you want to export
  • HIVE_PARTITION_COLS: the comma-separated list of your Apache Hive partition columns
  • FILTER_STRING: the filter string for the Apache Hive input table
  • PARTITION_TYPE: the partition type in BigQuery
  • PARTITION_COL: the name of the BigQuery partition column

File Format Conversion (Avro, Parquet, CSV)

The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.

The following format conversions are supported:

  • CSV to Avro.
  • CSV to Parquet.
  • Avro to Parquet.
  • Parquet to Avro.

Requirements for this pipeline:

  • The output Cloud Storage bucket must exist before running the pipeline.

Template parameters

Parameter Description
inputFileFormat The input file format. Must be one of [csv, avro, parquet].
outputFileFormat The output file format. Must be one of [avro, parquet].
inputFileSpec The Cloud Storage path pattern for input files. For example, gs://bucket-name/path/*.csv
outputBucket The Cloud Storage folder to write output files. This path must end with a slash. For example, gs://bucket-name/output/
schema The Cloud Storage path to the Avro schema file. For example, gs://bucket-name/schema/my-schema.avsc
containsHeaders (Optional) The input CSV files contain a header record (true/false). The default value is false. Only required when reading CSV files.
csvFormat (Optional) The CSV format specification to use for parsing records. The default value is Default. See Apache Commons CSV Format for more details.
delimiter (Optional) The field delimiter used by the input CSV files.
outputFilePrefix (Optional) The output file prefix. The default value is output.
numShards (Optional) The number of output file shards.

Running the File Format Conversion template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default regional endpoint is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Convert file formats template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

Run from the gcloud command-line tool

Note: To use the gcloud command-line tool to run Flex templates, you must have Cloud SDK version 284.0.0 or higher.

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/flex/File_Format_Conversion
gcloud beta dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --template-file-gcs-location=gs://dataflow-templates/latest/flex/File_Format_Conversion \
    --parameters \
inputFileFormat=INPUT_FORMAT,\
outputFileFormat=OUTPUT_FORMAT,\
inputFileSpec=INPUT_FILES,\
schema=SCHEMA,\
outputBucket=OUTPUT_FOLDER

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • INPUT_FORMAT: the file format of the input file; must be one of [csv, avro, parquet]
  • OUTPUT_FORMAT: the file format of the output files; must be one of [avro, parquet]
  • INPUT_FILES: the path pattern for input files
  • OUTPUT_FOLDER: your Cloud Storage folder for output files
  • SCHEMA: the path to the Avro schema file
  • LOCATION: the execution region, for example, us-central1

API

Run from the REST API

When running this template, you need the Cloud Storage path to the template:

gs://dataflow-templates/VERSION/flex/File_Format_Conversion

To run this template with a REST API request, send an HTTP POST request with your project ID. This request requires authorization.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileFormat": "INPUT_FORMAT",
          "outputFileFormat": "OUTPUT_FORMAT",
          "inputFileSpec": "INPUT_FILES",
          "schema": "SCHEMA",
          "outputBucket": "OUTPUT_FOLDER"
      },
      "containerSpecGcsPath": "gs://dataflow-templates/latest/flex/File_Format_Conversion",
   }
}

Replace the following:

  • PROJECT_ID: your project ID
  • JOB_NAME: a job name of your choice
  • INPUT_FORMAT: the file format of the input file; must be one of [csv, avro, parquet]
  • OUTPUT_FORMAT: the file format of the output files; must be one of [avro, parquet]
  • INPUT_FILES: the path pattern for input files
  • OUTPUT_FOLDER: your Cloud Storage folder for output files
  • SCHEMA: the path to the Avro schema file
  • LOCATION: the execution region, for example, us-central1