This document highlights some things to consider when designing your pipeline.
Questions to consider
In designing your pipeline, think about the following questions:
- Where is your pipeline's input data stored? How many sets of input data do you have?
- What does your data look like?
- What do you want to do with your data?
- Where should your pipeline's output data go?
Consider using Dataflow templates
Alternatively to building a pipeline (by writing Apache Beam code), you can use a template. Templates allow for reusability. Templates let you customize each job by changing specific pipeline parameters. Then, anyone you provide permissions to can use the template to deploy the pipeline.
For example, a Developer can create a job from a template, and a Data Scientist in the organization can deploy that template at a later time.
Structure your Apache Beam user code
Often in creating pipelines, you'll use the generic parallel processing
Apache Beam transform,
When you apply a
ParDo transform, you provide user code in the form of a
DoFn is an Apache Beam SDK class that defines a distributed
You can think of your
DoFn code as small, independent entities: there can
potentially be many instances running on different machines, each with no
knowledge of the others. As such, we recommend creating pure functions
(functions that do not depend on hidden or external state, that have no
observable side effects, and are deterministic), because they are ideal for the
parallel and distributed nature of
The pure function model is not strictly rigid, however; state information or
external initialization data can be valid for
DoFn and other function objects,
as long as your code does not depend on things that the Dataflow
service does not guarantee. When structuring your
ParDo transforms and
DoFns, keep the following guidelines in mind:
- The Dataflow service guarantees that every element in your
PCollectionis processed by a
DoFninstance exactly once.
- The Dataflow service does not guarantee how many times a
DoFnwill be invoked.
- The Dataflow service does not guarantee exactly how the distributed elements are grouped—that is, it does not guarantee which (if any) elements are processed together.
- The Dataflow service does not guarantee the exact number of
DoFninstances that will be created over the course of a pipeline.
- The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
- The Dataflow service serializes element processing per
DoFninstance. Your code does not need to be strictly thread-safe; however, any state shared between multiple
DoFninstances must be thread-safe.
Share data across pipelines
There is no Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines. You can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances.
You can automate pipeline execution by:
- Using Cloud Scheduler
- Using Apache Airflow's Dataflow Operator, one of several Google Cloud Operators in a Cloud Composer workflow.
- Running custom (cron) job processes on Compute Engine.
The following Apache Beam articles describe how to structure your pipeline, how to choose which transforms to apply to your data, and what to consider in choosing your pipeline's input and output methods.
For more information about building your user code, see the requirements for user-provided functions.