Dataflow Prime Right Fitting

Overview

Dataflow Prime is a serverless platform that uses horizontal and vertical scaling to allocate workers and worker resources. You don't specify the number or size and shape of workers used in your pipeline. To customize worker resources, you can use Apache Beam resource hints to specify resource requirements for an entire pipeline or for specific pipeline steps. The Dataflow Prime Right Fitting feature uses resource hints to customize worker resources for the pipeline.

Limitations and requirements

  • Resource hints are supported with the Apache Beam Java and Python SDKs, versions 2.30.0 and later.
  • Right Fitting is supported only with batch pipelines.

Available resource hints

The following resource hints are available:

  1. min_ram="numberGB": The minimum amount of RAM in gigabytes to allocate to workers. Dataflow Prime uses this value as a lower limit when allocating memory to new workers (horizontal scaling) or to existing workers (vertical scaling).

    • Set it with the minimum value of worker memory that your pipeline or pipeline step requires.
    • min_ram is an aggregate, per-worker specification. It isn't a per-vCPU specification. For example, if you set min_ram=15GB, Dataflow sets the aggregate memory available across all vCPUs in the worker to at least 15 GB.
  2. accelerator="type:type;count:number;configuration-options": The GPU type, number of GPUs, and GPU configuration options to use. To use NVIDIA GPUs with Dataflow, set the install-nvidia-driver configuration option.

Resource hint nesting

Resource hints are applied to the pipeline transform hierarchy as follows:

  • min_ram: The value on a transform is evaluated as the largest min_ram hint value among the values that are set on the transform itself and all of its parents in the transform's hierarchy.
    • Example: If an inner transform hint sets min_ram to 16 GB, and the outer transform hint in the hierarchy sets min_ram to 32 GB, a hint of 32 GB is used for all steps in the entire transform.
    • Example: If an inner transform hint sets min_ram to 16 GB, and the outer transform hint in the hierarchy sets min_ram to 8 GB, a hint of 8 GB is used for all steps in the outer transform that are not in the inner transform, and a 16 GB hint is used for all steps in the inner transform.
  • accelerator: The innermost value in the transform's hierarchy takes precedence.
    • Example: If an inner transform accelerator hint is different from an outer transform accelerator hint in a hierarchy, the inner transform accelerator hint is used for the inner transform.

Use resource hints

You can set resource hints on the entire pipeline or on pipeline steps.

Pipeline resource hints

You can set resource hints on the entire pipeline when you run the pipeline from the command line.

Example:

    python my_pipeline.py \
        --runner=DataflowRunner \
        --resource_hints=min_ram=numberGB \
        --resource_hints=accelerator="type:type;count:number;install-nvidia-driver" \
        ...

Pipeline step resource hints

You can set resource hints on pipeline steps (transforms) programmatically.

Java

You can set resource hints programmatically on pipeline transforms using ResourceHints.

Example:

pcoll.apply(MyCompositeTransform.of(...)
    .setResourceHints(
        ResourceHints.create()
            .withMinRam("15GB")
            .withAccelerator(
     "type:nvidia-tesla-k80;count:1;install-nvidia-driver")))

pcoll.apply(ParDo.of(new BigMemFn())
    .setResourceHints(
        ResourceHints.create().withMinRam("30GB")))

Python

You can set resource hints programmatically on pipeline transforms using PTransforms.with_resource_hints (also see ResourceHint).

Example:

pcoll | MyPTransform().with_resource_hints(
    min_ram="4GB",
    accelerator="type:nvidia-tesla-k80;count:1;install-nvidia-driver")

pcoll | beam.ParDo(BigMemFn()).with_resource_hints(
    min_ram="30GB")

Go

Resource hints aren't supported in Go.

Right Fitting and fusion

In some cases, transforms set with different resource hints can be executed on workers in the same worker pool, as part of the process of fusion optimization. When transforms are fused, Dataflow executes them in an environment that satisfies the union of resource hints set on the transforms.

When resource hints can't be merged, fusion doesn't occur. For example, resource hints for different GPUs aren't mergeable, so those transforms aren't fused.

You can also prevent fusion by adding an operation to your pipeline that forces Dataflow to materialize an intermediate PCollection. To learn more, see Prevent fusion.