Troubleshoot your Dataflow GPU job

If you encounter problems running your Dataflow job with GPUs, follow these steps:

  1. Follow the workflow in Best practices for working with Dataflow GPUs to ensure that your pipeline is configured correctly.
  2. Confirm that your Dataflow job is using GPUs. See Verify your Dataflow job in "Run a pipeline with GPUs."
  3. Debug with a standalone VM.
  4. If the problem persists, follow the rest of the troubleshooting steps on this page.

Debug with a standalone VM

While you're designing and iterating on a container image that works for you, it can be faster to reduce the feedback loop by trying out your container image on a standalone VM.

You can debug your custom container on a standalone VM with GPUs by creating a Compute Engine VM running GPUs on Container-Optimized OS, installing drivers, and starting your container as follows.

  1. Create a VM instance.

    gcloud compute instances create INSTANCE_NAME \
      --project "PROJECT" \
      --image-family cos-stable \
      --image-project=cos-cloud  \
      --zone=us-central1-f \
      --accelerator type=nvidia-tesla-t4,count=1 \
      --maintenance-policy TERMINATE \
      --restart-on-failure  \
      --boot-disk-size=200G \
      --scopes=cloud-platform
    
  2. Use ssh to connect to the VM.

    gcloud compute ssh INSTANCE_NAME --project "PROJECT"
    
  3. Install the GPU drivers. After connecting to the VM by using ssh, run the following commands on the VM:

    # Run these commands on the virtual machine
    cos-extensions install gpu
    sudo mount --bind /var/lib/nvidia /var/lib/nvidia
    sudo mount -o remount,exec /var/lib/nvidia
    /var/lib/nvidia/bin/nvidia-smi
    
  4. Launch your custom container.

    Apache Beam SDK containers use the /opt/apache/beam/boot entrypoint. For debugging purposes you can launch your container manually with a different entrypoint:

    docker-credential-gcr configure-docker
    docker run --rm \
      -it \
      --entrypoint=/bin/bash \
      --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
      --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
      --privileged \
      IMAGE
    

    Replace IMAGE with the Artifact Registry path for your Docker image.

  5. Verify that the GPU libraries installed in your container can access the GPU devices.

    If you're using TensorFlow, you can print available devices in Python interpreter with the following:

    >>> import tensorflow as tf
    >>> print(tf.config.list_physical_devices("GPU"))
    

    If you're using PyTorch, you can inspect available devices in Python interpreter with the following:

    >>> import torch
    >>> print(torch.cuda.is_available())
    >>> print(torch.cuda.device_count())
    >>> print(torch.cuda.get_device_name(0))
    

To iterate on your pipeline, you can launch your pipeline on Direct Runner. You can also launch pipelines on Dataflow Runner from this environment.

Workers don't start

If your job is stuck and the Dataflow workers never start processing data, it's likely that you have a problem related to using a custom container with Dataflow. For more details, read the custom containers troubleshooting guide.

If you're a Python user, verify that the following conditions are met:

  • The Python interpreter minor version in your container image is the same version as you use when launching your pipeline. If there's a mismatch, you might see errors like SystemError: unknown opcode with a stack trace involving apache_beam/internal/pickler.py.
  • If you're using the Apache Beam SDK 2.29.0 or earlier, pip must be accessible on the image in /usr/local/bin/pip.

We recommend that you reduce the customizations to a minimal working configuration the first time you use a custom image. Use the sample custom container images provided in the examples on this page. Make sure you can run a straightforward Dataflow pipeline with this container image without requesting GPUs. Then, iterate on the solution.

Verify that workers have sufficient disk space to download your container image. Adjust disk size if necessary. Large images take longer to download, which increases worker startup time.

Job fails immediately at startup

If you encounter the ZONE_RESOURCE_POOL_EXHAUSTED or ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS errors, you can take the following steps:

  • Don't specify the worker zone so that Dataflow selects the optimal zone for you.

  • Launch the pipeline in a different zone or with a different accelerator type.

Job fails at runtime

If the job fails at runtime, check for out of memory (OOM) errors on the worker machine and on the GPU. GPU OOM errors may manifest as cudaErrorMemoryAllocation out of memory errors in worker logs. If you're using TensorFlow, verify that you use only one TensorFlow process to access one GPU device. For more information, read GPUs and worker parallelism.

No GPU usage

If your pipeline runs successfully, but GPUs are not used, verify the following:

  • NVIDIA libraries installed in the container image match the requirements of pipeline user code and libraries that it uses.
  • Installed NVIDIA libraries in container images are accessible as shared libraries.

If the devices are not available, you might be using an incompatible software configuration. For example, if you're using TensorFlow, verify that you have a compatible combination of TensorFlow, cuDNN version, and CUDA Toolkit version.

To verify the image configuration, consider running a straightforward pipeline that just checks that GPUs are available and accessible to the workers.

What's next