Configuring Dataflow Prime Right Fitting

Overview

Dataflow Prime is a serverless platform that uses horizontal and vertical scaling to allocate workers and worker resources; you do not specify the number or size and shape of workers used in your pipeline. In order to customize worker resources, you can use Apache Beam resource hints to specify resource requirements for an entire pipeline or specific pipeline steps. The Dataflow Prime Right Fitting feature uses resource hints to customize worker resources for the pipeline.

Limitations and requirements

In the Dataflow Prime Preview stage, resource hints can be used with Apache Beam 2.30.0 or later.

Available resource hints

The following resource hints are available during Dataflow Prime Preview:

  1. min_ram="numberGB": The minimum amount of ram in Gigabytes to allocate to workers. Dataflow Prime uses this value as a lower limit when allocating memory to new workers (horizontal scaling) or to existing workers (vertical scaling).

    • Set it with the maximum value of worker memory your pipeline or pipeline step will require.
    • min_ram is an aggregate, per-worker, not per vCPU, specification. For example, if you set min_ram=15GB, Dataflow will set the aggregate memory available across all vCPUs in the worker to at least 15GB.
  2. accelerator="type:type;count:number;configuration-options": The GPU type, number of GPUs, and GPU configuration options to use (to use NVIDIA GPUs with Dataflow, you must set the "install-nvidia-driver" configuration option).

Resource hint nesting

Resource hints are applied to the pipeline transform hierarchy as follows:

  • min_ram: The value on a transform is evaluated as the largest min_ram hint value among values set on the transform itself and all its parents in the transform's hierarchy.
    • Example: If an inner transform hint sets min_ram to 16GB, and the outer transform hint in the hierarchy sets min_ram to 32GB, a hint of 32GB will be used for all steps in the entire transform.
    • Example: If an inner transform hint sets min_ram to 16GB, and the outer transform hint in the hierarchy sets min_ram to 8GB, a hint of 8GB will be used for all steps in the outer transform that are not in the inner transform, and a 16GB hint will be used for all steps in the inner transform.
  • accelerator: The innermost value in the transform's hierarchy takes precedence.
    • Example: If an inner transform accelerator hint is different from an outer transform accelerator hint in a hierarchy, the inner transform accelerator hint will be used for the inner transform.

Using resource hints

You can set resource hints on the entire pipeline or on pipeline steps.

Pipeline resource hints

You can set resource hints on the entire pipeline when you run the pipeline from the command line.

Example:

    python my_pipeline.py \
        --runner=DataflowRunner \
        --resource_hints=min_ram=numberGB \
        --resource_hints=accelerator="type:type;count:number;install-nvidia-driver" \
        ...

Pipeline step resource hints

You can set resource hints on pipeline steps (transforms) programmatically.

Java

You can set resource hints programmatically on pipeline transforms using ResourceHints.

Example:

pcoll.apply(MyCompositeTransform.of(...)
    .setResourceHints(
        ResourceHints.create()
            .withMinRam("15GB")
            .withAccelerator(
     "type:nvidia-tesla-k80;count:1;install-nvidia-driver")))

pcoll.apply(ParDo.of(new BigMemFn())
    .setResourceHints(
        ResourceHints.create().withMinRam("30GB")))

Python

You can set resource hints programmatically on pipeline transforms using PTransforms.with_resource_hints (also see ResourceHint).

Example:

pcoll | MyPTransform().with_resource_hints(
    min_ram="4GB",
    accelerator="type:nvidia-tesla-k80;count:1;install-nvidia-driver")

pcoll | beam.ParDo(BigMemFn()).with_resource_hints(
    min_ram="30GB")