Right fitting

The right fitting feature uses Apache Beam resource hints to customize worker resources for a pipeline. The ability to target resources to specific pipeline steps provides additional pipeline flexibility and capability, and potential cost savings. You can apply more costly resources to pipeline steps that require them, and less costly resources to other pipeline steps. Use right fitting to specify resource requirements for an entire pipeline or for specific pipeline steps.

Support and limitations

  • Resource hints are supported with the Apache Beam Java and Python SDKs, versions 2.31.0 and later.
  • Right fitting is supported with batch pipelines. Streaming pipelines aren't supported.
  • Right fitting supports Dataflow Prime.
  • Right fitting doesn't support FlexRS.
  • When you use right fitting, don't use the worker_accelerator service option.

Enable right fitting

To turn on right fitting, use one or more of the available resource hints in your pipeline. When you use a resource hint in your pipeline, right fitting is automatically enabled. For more information, see the Use resource hints section of this document.

Available resource hints

The following resource hints are available.

Resource hint Description
min_ram

The minimum amount of RAM in gigabytes to allocate to workers. Dataflow uses this value as a lower limit when allocating memory to new workers (horizontal scaling) or to existing workers (vertical scaling).

For example:

min_ram=NUMBERGB
  • Replace NUMBER with the minimum value of worker memory that your pipeline or pipeline step requires.
  • min_ram is an aggregate, per-worker specification. It isn't a per-vCPU specification. For example, if you set min_ram=15GB, Dataflow sets the aggregate memory available across all vCPUs in the worker to at least 15 GB.
accelerator

A user-supplied allocation of GPUs that lets you control the use and cost of GPUs in your pipeline and its steps. Specify the type and number of GPUs to attach to Dataflow workers as parameters to the flag.

For example:

accelerator="type:GPU_TYPE;count:GPU_COUNT;machine_type:MACHINE_TYPE;CONFIGURATION_OPTIONS"
  • Replace GPU_TYPE with the type of GPU to use. For a list of GPU types that are supported with Dataflow, see Dataflow support for GPUs.
  • Replace GPU_COUNT with the number of GPUs to use.
  • Optional: Replace MACHINE_TYPE with the type of machine to use with your GPUs.
    • The machine type must be compatible with the GPU type selected. For details about GPU types and their compatible machine types, see GPU platforms.
    • If you specify a machine type both in the accelerator resource hint and in the worker machine type pipeline option, then the pipeline option is ignored during right fitting.
  • To use NVIDIA GPUs with Dataflow, set the install-nvidia-driver configuration option.

For more information about using GPUs, see GPUs with Dataflow.

Resource hint nesting

Resource hints are applied to the pipeline transform hierarchy as follows:

  • min_ram: The value on a transform is evaluated as the largest min_ram hint value among the values that are set on the transform itself and all of its parents in the transform's hierarchy.
    • Example: If an inner transform hint sets min_ram to 16 GB, and the outer transform hint in the hierarchy sets min_ram to 32 GB, a hint of 32 GB is used for all steps in the entire transform.
    • Example: If an inner transform hint sets min_ram to 16 GB, and the outer transform hint in the hierarchy sets min_ram to 8 GB, a hint of 8 GB is used for all steps in the outer transform that are not in the inner transform, and a 16 GB hint is used for all steps in the inner transform.
  • accelerator: The innermost value in the transform's hierarchy takes precedence.
    • Example: If an inner transform accelerator hint is different from an outer transform accelerator hint in a hierarchy, the inner transform accelerator hint is used for the inner transform.

Hints that are set for the entire pipeline are treated as if they are set on a separate outermost transform.

Use resource hints

You can set resource hints on the entire pipeline or on pipeline steps.

Pipeline resource hints

You can set resource hints on the entire pipeline when you run the pipeline from the command line.

To set up your Python environment, see the Python quickstart.

Example:

    python my_pipeline.py \
        --runner=DataflowRunner \
        --resource_hints=min_ram=numberGB \
        --resource_hints=accelerator="type:type;count:number;install-nvidia-driver" \
        ...

Pipeline step resource hints

You can set resource hints on pipeline steps (transforms) programmatically.

Java

To install the Apache Beam SDK for Java, see Install the Apache Beam SDK.

You can set resource hints programmatically on pipeline transforms by using the ResourceHints class.

The following example demonstrates how to set resource hints programmatically on pipeline transforms.

pcoll.apply(MyCompositeTransform.of(...)
    .setResourceHints(
        ResourceHints.create()
            .withMinRam("15GB")
            .withAccelerator(
    "type:nvidia-tesla-k80;count:1;install-nvidia-driver")))

pcoll.apply(ParDo.of(new BigMemFn())
    .setResourceHints(
        ResourceHints.create().withMinRam("30GB")))

To programmatically set resource hints on the entire pipeline, use the ResourceHintsOptions interface.

Python

To install the Apache Beam SDK for Python, see Install the Apache Beam SDK.

You can set resource hints programmatically on pipeline transforms by using the PTransforms.with_resource_hints class. For more information, see the ResourceHint class.

The following example demonstrates how to set resource hints programmatically on pipeline transforms.

pcoll | MyPTransform().with_resource_hints(
    min_ram="4GB",
    accelerator="type:nvidia-tesla-k80;count:1;install-nvidia-driver")

pcoll | beam.ParDo(BigMemFn()).with_resource_hints(
    min_ram="30GB")

To set resource hints on the entire pipeline, use the --resource_hints pipeline option when you run your pipeline. For an example, see Pipeline resource hints.

Go

Resource hints aren't supported in Go.

Right fitting and fusion

In some cases, transforms set with different resource hints can be executed on workers in the same worker pool, as part of the process of fusion optimization. When transforms are fused, Dataflow executes them in an environment that satisfies the union of resource hints set on the transforms.

When resource hints can't be merged, fusion doesn't occur. For example, resource hints for different GPUs aren't mergeable, so those transforms aren't fused.

You can also prevent fusion by adding an operation to your pipeline that forces Dataflow to materialize an intermediate PCollection. To learn more, see Prevent fusion.

Troubleshoot right fitting

This section provides instructions for troubleshooting common issues related to right fitting.

Invalid configuration

When you try to use right fitting, the following error occurs:

Workflow failed. Causes: One or more operations had an error: 'operation-OPERATION_ID':
[UNSUPPORTED_OPERATION] 'NUMBER vCpus with NUMBER MiB memory is
an invalid configuration for NUMBER count of 'GPU_TYPE' in family 'MACHINE_TYPE'.'.

This error occurs when the GPU type selected isn't compatible with the machine type selected. To resolve this error, select a compatible GPU type and machine type. For compatibility details, see GPU platforms.