Overview
Dataflow Prime is a serverless platform that uses horizontal and vertical scaling to allocate workers and worker resources. You don't specify the number or size and shape of workers used in your pipeline. To customize worker resources, you can use Apache Beam resource hints to specify resource requirements for an entire pipeline or for specific pipeline steps. The Dataflow Prime Right Fitting feature uses resource hints to customize worker resources for the pipeline.
Limitations and requirements
- Resource hints are supported with the Apache Beam Java and Python SDKs, versions 2.30.0 and later.
- Right Fitting is supported only with batch pipelines.
Available resource hints
The following resource hints are available:
min_ram
="numberGB": The minimum amount of RAM in gigabytes to allocate to workers. Dataflow Prime uses this value as a lower limit when allocating memory to new workers (horizontal scaling) or to existing workers (vertical scaling).- Set it with the minimum value of worker memory that your pipeline or pipeline step requires.
min_ram
is an aggregate, per-worker specification. It isn't a per-vCPU specification. For example, if you setmin_ram=15GB
, Dataflow sets the aggregate memory available across all vCPUs in the worker to at least 15 GB.
accelerator
="type:type;count:number;configuration-options": The GPU type, number of GPUs, and GPU configuration options to use. To use NVIDIA GPUs with Dataflow, set theinstall-nvidia-driver
configuration option.
Resource hint nesting
Resource hints are applied to the pipeline transform hierarchy as follows:
min_ram
: The value on a transform is evaluated as the largestmin_ram
hint value among the values that are set on the transform itself and all of its parents in the transform's hierarchy.- Example: If an inner transform hint sets
min_ram
to 16 GB, and the outer transform hint in the hierarchy setsmin_ram
to 32 GB, a hint of 32 GB is used for all steps in the entire transform. - Example: If an inner transform hint sets
min_ram
to 16 GB, and the outer transform hint in the hierarchy setsmin_ram
to 8 GB, a hint of 8 GB is used for all steps in the outer transform that are not in the inner transform, and a 16 GB hint is used for all steps in the inner transform.
- Example: If an inner transform hint sets
accelerator
: The innermost value in the transform's hierarchy takes precedence.- Example: If an inner transform
accelerator
hint is different from an outer transformaccelerator
hint in a hierarchy, the inner transformaccelerator
hint is used for the inner transform.
- Example: If an inner transform
Use resource hints
You can set resource hints on the entire pipeline or on pipeline steps.
Pipeline resource hints
You can set resource hints on the entire pipeline when you run the pipeline from the command line.
Example:
python my_pipeline.py \ --runner=DataflowRunner \ --resource_hints=min_ram=numberGB \ --resource_hints=accelerator="type:type;count:number;install-nvidia-driver" \ ...
Pipeline step resource hints
You can set resource hints on pipeline steps (transforms) programmatically.
Java
You can set resource hints programmatically on pipeline transforms using
ResourceHints
.
Example:
pcoll.apply(MyCompositeTransform.of(...) .setResourceHints( ResourceHints.create() .withMinRam("15GB") .withAccelerator( "type:nvidia-tesla-k80;count:1;install-nvidia-driver"))) pcoll.apply(ParDo.of(new BigMemFn()) .setResourceHints( ResourceHints.create().withMinRam("30GB")))
Python
You can set resource hints programmatically on pipeline transforms using
PTransforms.with_resource_hints
(also see ResourceHint
).
Example:
pcoll | MyPTransform().with_resource_hints( min_ram="4GB", accelerator="type:nvidia-tesla-k80;count:1;install-nvidia-driver") pcoll | beam.ParDo(BigMemFn()).with_resource_hints( min_ram="30GB")
Go
Resource hints aren't supported in Go.
Right Fitting and fusion
In some cases, transforms set with different resource hints can be executed on workers in the same worker pool, as part of the process of fusion optimization. When transforms are fused, Dataflow executes them in an environment that satisfies the union of resource hints set on the transforms.
When resource hints can't be merged, fusion doesn't occur. For example, resource hints for different GPUs aren't mergeable, so those transforms aren't fused.
You can also prevent fusion by adding an operation to your pipeline that forces
Dataflow to materialize an intermediate PCollection
. To learn
more, see
Prevent fusion.