Design pipelines

This document highlights some things to consider when designing your pipeline.

Questions to consider

In designing your pipeline, think about the following questions:

  • Where is your pipeline's input data stored? How many sets of input data do you have?
  • What does your data look like?
  • What do you want to do with your data?
  • Where should your pipeline's output data go?
  • Does your Dataflow job use Assured Workloads?

Consider using Dataflow templates

Alternatively to building a pipeline (by writing Apache Beam code), you can use a template. Templates allow for reusability. Templates let you customize each job by changing specific pipeline parameters. Then, anyone you provide permissions to can use the template to deploy the pipeline.

For example, a Developer can create a job from a template, and a Data Scientist in the organization can deploy that template at a later time.

There are many Google-provided templates you can use, or you can create your own.

Structure your Apache Beam user code

Often in creating pipelines, you'll use the generic parallel processing Apache Beam transform, ParDo. When you apply a ParDo transform, you provide user code in the form of a DoFn object. DoFn is an Apache Beam SDK class that defines a distributed processing function.

You can think of your DoFn code as small, independent entities: there can potentially be many instances running on different machines, each with no knowledge of the others. As such, we recommend creating pure functions (functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic), because they are ideal for the parallel and distributed nature of DoFns.

The pure function model is not strictly rigid, however; state information or external initialization data can be valid for DoFn and other function objects, as long as your code does not depend on things that the Dataflow service does not guarantee. When structuring your ParDo transforms and creating your DoFns, keep the following guidelines in mind:

  • The Dataflow service guarantees that every element in your input PCollection is processed by a DoFn instance exactly once.
  • The Dataflow service does not guarantee how many times a DoFn will be invoked.
  • The Dataflow service does not guarantee exactly how the distributed elements are grouped—that is, it does not guarantee which (if any) elements are processed together.
  • The Dataflow service does not guarantee the exact number of DoFn instances that will be created over the course of a pipeline.
  • The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
  • The Dataflow service serializes element processing per DoFn instance. Your code does not need to be strictly thread-safe; however, any state shared between multiple DoFn instances must be thread-safe.

Assured Workloads

Assured Workloads helps enforce security and compliance requirements for Google Cloud customers. For example, EU Regions and Support with Sovereignty Controls helps enforce data residency and data sovereignty guarantees for EU-based customers. To provide these features, some Dataflow features are restricted or limited. If you use Assured Workloads with Dataflow, all of the resources that your pipeline accesses must be located in your organization's Assured Workloads project or folder. These resources include:

  • Cloud Storage buckets
  • BigQuery datasets
  • Pub/Sub topics and subscriptions
  • Firestore datasets
  • I/O connectors

In Dataflow, user-specified data keys used in key-based operations are not protected by CMEK encryption. If these keys contain personally identifiable information (PII), you need to hash or otherwise transform the keys before they enter the Dataflow pipeline.

Share data across pipelines

There is no Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines. You can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances.

Schedule jobs

You can automate pipeline execution in the following ways:

Learn more

The following Apache Beam articles describe how to structure your pipeline, how to choose which transforms to apply to your data, and what to consider in choosing your pipeline's input and output methods.

For more information about building your user code, see the requirements for user-provided functions.