Using GPUs

This page explains how to run an Apache Beam pipeline on Dataflow with GPUs. Jobs using GPUs incur charges as specified in the Dataflow pricing page.

For more information about using GPUs with Dataflow, read Dataflow support for GPUs.

Using Apache Beam notebooks

If you already have a pipeline that you would like to run with GPUs on Dataflow, you can skip this section.

Apache Beam notebooks offer a convenient way to prototype and iteratively develop your pipeline with GPUs without setting up a development environment. To get started, read the Developing with Apache Beam notebooks guide, launch an Apache Beam notebooks instance, and follow the example notebook Use GPUs with Apache Beam.

Provisioning GPU quota

GPU devices are subject to your Google Cloud project's quota availability. Request GPU quota in the region of your choice.

Installing GPU drivers

You must instruct Dataflow to install NVIDIA drivers onto the workers by appending install-nvidia-driver to the worker_accelerator option. When the install-nvidia-driver option is specified, Dataflow installs NVIDIA drivers onto the Dataflow workers using cos-extensions utility provided by Container-Optimized OS. By specifying install-nvidia-driver, users agree to accept the NVIDIA license agreement.

Binaries and libraries provided by the NVIDIA driver installer are mounted into the container running pipeline user code at /usr/local/nvidia/.

The GPU driver version depends on the Container-Optimized OS version currently used by Dataflow.

Building a custom container image

To interact with the GPUs, you might need additional NVIDIA software, such as GPU-accelerated libraries and the CUDA Toolkit. You must supply these libraries in the Docker container running user code.

You can customize the container image by supplying an image that fulfills the Apache Beam SDK container image contract and has the necessary GPU libraries.

To provide a custom container image, you must use Dataflow Runner v2 and supply the container image using the worker_harness_container_image pipeline option. If you are using Apache Beam 2.30.0 or later, you can use a shorter option name sdk_container_image instead for simplicity. For more information, see Using custom containers.

Approach 1. Using an existing image configured for GPU usage

You can build a Docker image that fulfills the Apache Beam SDK container contract from an existing base image that is preconfigured for GPU usage. For example, TensorFlow Docker images, and NVIDIA container images are preconfigured for GPU usage.

A sample Dockerfile that builds upon TensorFlow Docker image with Python 3.6 looks like the following:

ARG BASE=tensorflow/tensorflow:2.5.0-gpu

# Check that the chosen base image provides the expected version of Python interpreter.
RUN [[ $PY_VERSION == `python -c 'import sys; print("%s.%s" % sys.version_info[0:2])'` ]] \
   || { echo "Could not find Python interpreter or Python version is different from ${PY_VERSION}"; exit 1; }

RUN pip install --no-cache-dir apache-beam[gcp]==2.29.0 \
    # Verify that there are no conflicting dependencies.
    && pip check

# Copy the Apache Beam worker dependencies from the Beam Python 3.6 SDK image.
COPY --from=apache/beam_python3.6_sdk:2.29.0 /opt/apache/beam /opt/apache/beam

# Apache Beam worker expects pip at /usr/local/bin/pip by default.
# Some images have pip in a different location. If necessary, make a symlink.
# This can be omitted in Beam 2.30.0 and later versions.
RUN [[ `which pip` == "/usr/local/bin/pip" ]] || ln -s `which pip` /usr/local/bin/pip

# Set the entrypoint to Apache Beam SDK worker launcher.
ENTRYPOINT [ "/opt/apache/beam/boot" ]

When using TensorFlow Docker images, use TensorFlow 2.5.0 or later. Earlier TensorFlow Docker images install the tensorflow-gpu package instead of the tensorflow package. The distinction is not important after TensorFlow 2.1.0 release, but several downstream packages, such as tfx, require the tensorflow package.

Large container sizes slow down the worker startup time. This might occur when using containers such as Deep Learning Containers.

Installing a specific Python version

If you have strict requirements for Python version, you could build your image from an NVIDIA base image that has necessary GPU libraries and then install the Python interpreter.

The following example demonstrates selecting an NVIDIA image from the CUDA container image catalog that does not include the Python interpreter. You can adjust the example to install the desired version of Python 3 and pip. The example uses TensorFlow, so when choosing an image, we make sure the CUDA and cuDNN versions in the base image satisfy the requirements for the TensorFlow version.

A sample Dockerfile looks like the following:

# Select an NVIDIA base image with desired GPU stack from

FROM nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04

    # Add Deadsnakes repository that has a variety of Python packages for Ubuntu.
    # See:
    apt-key adv --keyserver --recv-keys F23C5A6CF475977595C89F51BA6932366A755776 \
    && echo "deb focal main" >> /etc/apt/sources.list.d/custom.list \
    && echo "deb-src focal main" >> /etc/apt/sources.list.d/custom.list \
    && apt-get update \
    && apt-get install -y curl \
        python3.8 \
        # With python3.8 package, distutils need to be installed separately.
        python3-distutils \
    && rm -rf /var/lib/apt/lists/* \
    && update-alternatives --install /usr/bin/python python /usr/bin/python3.8 10 \
    && curl | python \
    # Install Apache Beam and Python packages that will interact with GPUs.
    && pip install --no-cache-dir apache-beam[gcp]==2.29.0 tensorflow==2.4.0 \
    # Verify that there are no conflicting dependencies.
    && pip check

# Copy the Apache Beam worker dependencies from the Beam Python 3.8 SDK image.
COPY --from=apache/beam_python3.8_sdk:2.29.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK worker launcher.
ENTRYPOINT [ "/opt/apache/beam/boot" ]

On some OS distributions, it might be difficult to install specific Python versions using the OS package manager. In this case, you could install Python interpreter with tools like Miniconda or pyenv.

A sample Dockerfile looks like the following:

FROM nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04

# The Python version of the Dockerfile must match the Python version you use
# to launch the Dataflow job.


# Update PATH so we find our new Conda and Python installations.
ENV PATH=/opt/python/bin:/opt/conda/bin:$PATH

RUN apt-get update \
    && apt-get install -y wget \
    && rm -rf /var/lib/apt/lists/* \
    # The NVIDIA image doesn't come with Python pre-installed.
    # We use Miniconda to install the Python version of our choice.
    && wget -q \
    && sh -b -p /opt/conda \
    && rm \
    # Create a new Python environment with desired version, and install pip.
    && conda create -y -p /opt/python python=$PYTHON_VERSION pip \
    # Remove unused Conda packages, install necessary Python packages via pip
    # to avoid mixing packages from pip and Conda.
    && conda clean -y --all --force-pkgs-dirs \
    # Install Apache Beam and Python packages that will interact with GPUs.
    && pip install --no-cache-dir apache-beam[gcp]==2.29.0 tensorflow==2.4.0 \
    # Verify that there are no conflicting dependencies.
    && pip check \
    # Apache Beam worker expects pip at /usr/local/bin/pip by default.
    # This can be omitted in Beam 2.30.0 and later versions.
    && ln -s $(which pip) /usr/local/bin/pip

# Copy the Apache Beam worker dependencies from the Apache Beam SDK for Python 3.8 image.
COPY --from=apache/beam_python3.8_sdk:2.29.0 /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK worker launcher.
ENTRYPOINT [ "/opt/apache/beam/boot" ]

Approach 2. Using Apache Beam container images

You can configure a container image for GPU usage without using preconfigured images. This approach is not recommended unless preconfigured images do not work for you. Setting up your own container image requires selecting compatible libraries and configuring their execution environment.

A sample Dockerfile looks like the following:

FROM apache/beam_python3.7_sdk:2.24.0
ENV INSTALLER_DIR="/tmp/installer_dir"

# The base image has TensorFlow 2.2.0, which requires CUDA 10.1 and cuDNN 7.6.
# You can download cuDNN from NVIDIA website
COPY cudnn-10.1-linux-x64-v7.6.0.64.tgz $INSTALLER_DIR/cudnn.tgz
    # Download CUDA toolkit.
    wget -q -O $INSTALLER_DIR/ && \

    # Install CUDA toolkit. Print logs upon failure.
    sh $INSTALLER_DIR/ --toolkit --silent || (egrep '^\[ERROR\]' /var/log/cuda-installer.log && exit 1) && \
    # Install cuDNN.
    mkdir $INSTALLER_DIR/cudnn && \
    tar xvfz $INSTALLER_DIR/cudnn.tgz -C $INSTALLER_DIR/cudnn && \

    cp $INSTALLER_DIR/cudnn/cuda/include/cudnn*.h /usr/local/cuda/include && \
    cp $INSTALLER_DIR/cudnn/cuda/lib64/libcudnn* /usr/local/cuda/lib64 && \
    chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn* && \
    rm -rf $INSTALLER_DIR

# A volume with GPU drivers will be mounted at runtime at /usr/local/nvidia.
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

Driver libraries in /usr/local/nvidia/lib64 must be discoverable in the container as shared libraries by configuring the LD_LIBRARY_PATH environment variable.

If you use TensorFlow, you must choose a compatible combination of CUDA Toolkit and cuDNN versions. For more details, read Software requirements and Tested build configurations.

Selecting type and number of GPUs for Dataflow workers

Dataflow allows you to configure the type and number of GPUs to attach to Dataflow workers using the worker_accelerator parameter. You can select the type and number of GPUs based on your use case and how you plan to utilize the GPUs in your pipeline.

The following GPU types are supported with Dataflow:

  • NVIDIA® Tesla® T4
  • NVIDIA® Tesla® P4
  • NVIDIA® Tesla® V100
  • NVIDIA® Tesla® P100
  • NVIDIA® Tesla® K80

For more detailed information about each GPU type, including performance data, read the GPU comparison chart.

Running your job with GPUs

To run a Dataflow job with GPUs, use the following command:


python PIPELINE \
  --runner "DataflowRunner" \
  --project "PROJECT" \
  --temp_location "gs://BUCKET/tmp" \
  --region "REGION" \
  --worker_harness_container_image "IMAGE" \
  --disk_size_gb "DISK_SIZE_GB" \
  --experiments "worker_accelerator=type:GPU_TYPE;count:GPU_COUNT;install-nvidia-driver" \
  --experiments "use_runner_v2"

Replace the following:

  • PIPELINE: your pipeline source code file
  • PROJECT: the Google Cloud project name
  • BUCKET: the Cloud Storage bucket
  • REGION: a regional endpoint, for example, us-central1
  • IMAGE: the Container Registry path for your Docker image
  • DISK_SIZE_GB: Size of the boot disk for each worker VM, for example, 50
  • GPU_TYPE: an available GPU type, for example, nvidia-tesla-t4
  • GPU_COUNT: number of GPUs to attach to each worker VM, for example, 1

The considerations for running a Dataflow job with GPUs include the following:

If you use TensorFlow, consider configuring the workers to use a single process by setting a pipeline option --experiments=no_use_multiple_sdk_containers or using workers with one vCPU. If the n1-standard-1 does not provide sufficient memory, you can consider a custom machine type, such as the n1-custom-1-NUMBER_OF_MB or the n1-custom-1-NUMBER_OF_MB-ext for extended memory. For more information, read GPUs and worker parallelism.

Verifying your Dataflow job

To confirm that the job uses worker VMs with GPUs, follow these steps:

  1. Verify that Dataflow workers for the job have started.
  2. While a job is running, find a worker VM associated with the job.
    1. Paste the Job ID in Search Products and Resources prompt.
    2. Select the Compute Engine VM instance associated with the job.

You can also find list of all running instances in the Compute Engine console.

  1. In the Google Cloud Console, go to the VM instances page.

    Go to VM instances

  2. Click VM instance details.

  3. Verify that details page has a GPUs section and that your GPUs are attached.

If your job did not launch with GPUs, check that the --worker_accelerator experiment is configured properly and visible in the Dataflow monitoring UI in experiments. The order of tokens in the accelerator metadata is important.

For example, an 'experiments' pipeline option in the Dataflow monitoring UI might look like the following:

['use_runner_v2','worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver', ...]

Viewing GPU utilization

To see GPU utilization on the worker VMs, follow these steps:

  1. In the Google Cloud Console, go to Monitoring or use the following button:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. Specify Dataflow Job as Resource Type and GPU utilization or GPU memory utilization as metric, depending on which metric you would like to monitor.

For more information, read Metrics Explorer guide.

Using GPUs with Dataflow Prime

Dataflow Prime lets you request accelerators for a specific step of your pipeline. To use GPUs with Dataflow Prime, do not use the --experiments=worker_accelerator pipeline option. Instead, request the GPUs with the accelerator resource hint. For more information, see Using resource hints.

Troubleshooting your Dataflow job

If you run into problems running your Dataflow job with GPUs, please follow the troubleshooting steps below that might resolve your issue.

Workers don't start

If your job is stuck and the Dataflow workers never start processing data, it is likely that you have a problem related to using a custom container with Dataflow. For more details, read the custom containers troubleshooting guide.

If you are a Python user, verify that the following conditions are met:

  • The Python interpreter minor version in your container image is the same version as you use when launching your pipeline. In case of the mismatch, you may see errors like SystemError: unknown opcode with a stack trace involving apache_beam/internal/
  • If you are using the Apache Beam SDK 2.29.0 or earlier, pip must be accessible on the image in /usr/local/bin/pip.

We recommend that you reduce the customizations to a minimal working configuration the first time you use a custom image. Use the sample custom container images provided in the examples on this page, make sure you can run a simple Dataflow pipeline with this container image without requesting GPUs, and then iterate on the solution.

Verify that workers have sufficient disk space to download your container image and adjust disk size if necessary. Large images take longer to download, which increases worker startup time.

Job fails immediately at startup

If you encounter the ZONE_RESOURCE_POOL_EXHAUSTED or ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS errors, you can take the following steps:

  • Don't specify the worker zone so that Dataflow selects the optimal zone for you.

  • Launch the pipeline in a different zone or with a different accelerator type.

Job fails at runtime

If the job fails at runtime, rule out out of memory (OOM) errors on the worker machine and on the GPU. GPU OOM errors may manifest as cudaErrorMemoryAllocation out of memory errors in worker logs. If you are using TensorFlow, verify that you use only one TensorFlow process to access one GPU device. For more information, read GPUs and worker parallelism.

No GPU usage

If your pipeline runs successfully, but GPUs are not used, verify the following:

  • NVIDIA libraries installed in the container image match the requirements of pipeline user code and libraries it uses.
  • Installed NVIDIA libraries in container images are accessible as shared libraries.

If the devices are not available, you might be using an incompatible software configuration. For example, if you are using TensorFlow, verify that you have a compatible combination of TensorFlow, cuDNN version, and CUDA Toolkit version.

To verify the image configuration, consider running a simple pipeline that just checks that GPUs are available and accessible to the workers.

Debug with a standalone VM

While you are designing and iterating on a container image that works for you, it can be faster to reduce the feedback loop by trying out your container image on a standalone VM.

You can debug your custom container on a standalone VM with GPUs by creating a Compute Engine VM running GPUs on Container-Optimized OS, installing drivers, and starting your container as follows.

  1. Create a VM instance.

    gcloud compute instances create INSTANCE_NAME \
      --project "PROJECT" \
      --image-family cos-stable \
      --image-project=cos-cloud  \
      --zone=us-central1-f \
      --accelerator type=nvidia-tesla-t4,count=1 \
      --maintenance-policy TERMINATE \
      --restart-on-failure  \
      --boot-disk-size=200G \
  2. Use ssh to connect to the VM.

    gcloud compute ssh INSTANCE_NAME --project "PROJECT"
  3. Install the GPU drivers. After connecting to the VM via ssh, run the following commands on the VM:

    # Run these commands on the virtual machine
    cos-extensions install gpu
    sudo mount --bind /var/lib/nvidia /var/lib/nvidia
    sudo mount -o remount,exec /var/lib/nvidia
  4. Launch your custom container.

    Apache Beam SDK containers use the /opt/apache/beam/boot entrypoint. For debugging purposes you can launch your container manually with a different entrypoint, as shown below:

    docker-credential-gcr configure-docker
    docker run --rm \
      -it \
      --entrypoint=/bin/bash \
      --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
      --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
      --privileged \

    Replace IMAGE with the Container Registry path for your Docker image.

  5. Verify that the GPU libraries installed in your container can access the GPU devices.

    If you are using TensorFlow, you can print available devices in Python interpreter with the following:

    >>> import tensorflow as tf
    >>> print(tf.config.list_physical_devices("GPU"))

    If you are using PyTorch, you can inspect available devices in Python interpreter with the following:

    >>> import torch
    >>> print(torch.cuda.is_available())
    >>> print(torch.cuda.device_count())
    >>> print(torch.cuda.get_device_name(0))

To iterate on your pipeline, you can launch your pipeline on Direct Runner. You can also launch pipelines on Dataflow Runner from this environment.

For additional information, see:

What's next