Dataflow data pipeline templates

Stay organized with collections Save and categorize content based on your preferences.

Dataflow templates allow you to package a Dataflow pipeline for deployment. Anyone with the correct permissions can then use the template to deploy the packaged pipeline. You can create your own custom Dataflow templates, and Google provides pre-built templates for common scenarios.

Advantages of pipeline templates

Templates have several advantages over directly deploying a pipeline to Dataflow:

  • Templates separate pipeline design from deployment. For example, a developer can create a template, and a data scientist can deploy the template at a later time.
  • Templates can have parameters that let you customize the pipeline when you deploy the template.
  • You can deploy a template by using the Google Cloud console, the Google Cloud CLI, or REST API calls. You don't need a development environment or any pipeline dependencies installed on your local machine.
  • A template is a code artifact that can be stored in a source control repository and used in continuous integration (CI/CD) pipelines.

Dataflow supports two types of template: Flex templates, which are newer, and classic templates. If you are creating a new Dataflow template, we recommend creating it as a Flex template.

Template workflow

Using Dataflow templates involves the following high-level steps:

  1. Developers set up a development environment and develop their pipeline. The environment includes the Apache Beam SDK and other dependencies.
  2. Depending on the template type (Flex or classic):
    • For Flex templates, the developers package the pipeline into a Docker image, push the image to Container Registry or Artifact Registry, and upload a template specification file to Cloud Storage.
    • For classic templates, developers run the pipeline, create a template file, and stage the template to Cloud Storage.
  3. Other users submit a request to the Dataflow service to run the template.
  4. Dataflow creates a pipeline from the template.

Comparing Flex templates and classic templates

With a Flex template, the pipeline is packaged as a Docker image in Container Registry or Artifact Registry, along with a template specification file in Cloud Storage. The template specification contains a pointer to the Docker image. When you run the template, the Dataflow service starts a launcher VM, pulls the Docker image, and runs the pipeline. The execution graph is dynamically built based on runtime parameters provided by the user.

A classic template contains the JSON serialization of a Dataflow job graph. The code for the pipeline must wrap any runtime parameters in the ValueProvider interface. This interface allows users to specify parameter values when they deploy the template.

Flex templates have the following advantages over clasic templates:

  • Unlike classic templates, Flex templates don't require the ValueProvider interface for input parameters. Not all Dataflow sources and sinks support ValueProvider.
  • While classic templates have a static job graph, Flex templates can dynamically construct the job graph. For example, the template might select a different I/O connector based on input parameters.
  • A Flex template can perform preprocessing on a virtual machine (VM) during pipeline construction. For example, it might validate input parameter values.

Apache Beam SDK version requirements

To create your own templates, make sure your Apache Beam SDK version supports template creation.

Java

To create templates with the Apache Beam SDK 2.x for Java, you must have version 2.0.0-beta3 or higher.

Python

To create templates with the Apache Beam SDK 2.x for Python, you must have version 2.0.0 or higher.

To run templates with Google Cloud CLI, you must have Google Cloud CLI version 138.0.0 or higher.

What's next