Improve performance on a shared GPU by using NVIDIA MPS

If you run multiple SDK processes on a shared Dataflow GPU, you can improve GPU efficiency and utilization by enabling the NVIDIA Multi-Process Service (MPS). MPS supports concurrent processing on a GPU by enabling processes to share CUDA contexts and scheduling resources. MPS can reduce context-switching costs, increase parallelism, and reduce storage requirements.

Target workflows are Python pipelines that run on workers with more than one vCPU.

MPS is an NVIDIA technology that implements the CUDA API, an NVIDIA platform that supports general-purpose GPU computing. For more information, see the NVIDIA Multi-Process Service user guide.

Benefits

Improves parallel processing and overall throughput for GPU pipelines, especially for workloads with low GPU resource usage.
Improves GPU utilization, which might reduce your costs.

Support and limitations

MPS is supported only on Dataflow workers that use a single GPU.
The pipeline can't use pipeline options that restrict parallelism.
Avoid exceeding the available GPU memory, especially for use cases that involve loading large machine learning models. Balance the number of vCPUs and SDK processes with the available GPU memory that these processes need.
MPS doesn't affect the concurrency of non-GPU operations.
Dataflow Prime doesn't support MPS.

Enable MPS

When you run a pipeline with GPUs, enable MPS by doing the following:

In the pipeline option --dataflow_service_options, append use_nvidia_mps to the worker_accelerator parameter.
Set the count to 1.
Don't use the pipeline option --experiments=no_use_multiple_sdk_containers.

The pipeline option --dataflow_service_options looks like the following:

--dataflow_service_options="worker_accelerator=type:GPU_TYPE;count:1;install-nvidia-driver;use_nvidia_mps"

If you use TensorFlow and enable MPS, do the following:

Enable dynamic memory allocation on the GPU. Use either of the following TensorFlow options:
- Turn on memory growth by calling tf.config.experimental.set_memory_growth(gpu, True).
- Set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use logical devices with appropriate memory limits.
For optimal performance, enforce the use of the GPU when possible by using soft device placement or manual placement.

What's next

To review more best practices, see GPUs and worker parallelism.