Join the Apache Beam community on July 18th-20th for the Beam Summit 2022 to learn more about Beam and share your expertise.

Develop with GPUs

This page describes an example of a developer workflow for building pipelines using GPUs.

For more information about using GPUs with Dataflow, see Dataflow support for GPUs. For information and examples on how to enable GPUs in your Dataflow jobs, see Using GPUs and Processing Landsat satellite images with GPUs.

Using Apache Beam in combination with NVIDIA GPUs, you can create large-scale data processing pipelines that handle preprocessing and inference. There are a few things to be aware of when you're using GPUs for local development:

  • Oftentimes, the data processing workflows use additional libraries that need to be installed in the launch environment and in the execution environment on Dataflow workers. This adds additional steps to the development workflow for configuring pipeline requirements or Using custom containers in Dataflow, as you might want to have a local development environment that mimics the production environment as closely as possible.

  • If you're using a library that implicitly makes use of NVIDIA GPUs, and your code doesn't require any changes to support GPUs, you don't need to change your development workflow for configuring pipeline requirements or building custom containers.

  • Some libraries do not switch transparently between the CPU and GPU usage, and hence require specific builds and different code paths. To replicate the code-run-code development life cycle for this scenario, additional steps are required.

  • When running local experiments, it might be useful to replicate the Dataflow worker's environment as closely as possible. Depending on the library, you might need a machine with a GPU and the required GPU libraries installed, which might not be available from your local environment. You can emulate the Dataflow runner environment using a container running on a GPU-equipped Google Cloud virtual machine.

  • It is unlikely to have a pipeline composed entirely of transformations that require a GPU. A typical pipeline has an ingestion stage (using one of the many sources provided by Apache Beam), followed by data manipulation or shaping transforms, which then feed into a GPU transform.

The following two-stage workflow shows how to build a pipeline using GPUs. This flow takes care of GPU and non-GPU related issues separately and shortens the feedback loop.

  1. Create a pipeline

    Create a pipeline that can run on Dataflow. Replace the transforms that require GPUs with the transforms that don't use GPUs, but are functionally the same. The following steps describe how to do this:

    1. Create all transformations that surround the GPU usage, such as data ingestion and manipulation.

    2. Create a stub for the GPU transform with a simple pass-through or schema change.

  2. Test locally

    Test the GPU portion of the pipeline code in the environment that mimics the Dataflow worker execution environment. The following steps describe one of the methods to achieve this:

    1. Create a Docker image with all necessary libraries.

    2. Start development of the GPU code.

    3. Begin the code-run-code cycle using a Google Cloud virtual machine with the Docker image. Run the GPU code in a local Python process separately from an Apache Beam pipeline to rule out library incompatibilities, and then run the entire pipeline on the direct runner or launch the pipeline on Dataflow.

Using a VM running container-optimized operating system

For a minimum environment, use a container-optimized virtual machine (VM). For more information, see Create a VM with attached GPUs.

The general flow is:

  1. Create a VM.

  2. Connect to the VM and run the following commands:

    cos-extensions install gpu
    sudo mount --bind /var/lib/nvidia /var/lib/nvidia
    sudo mount -o remount,exec /var/lib/nvidia /var/lib/nvidia/bin/nvidia-smi
    
  3. Confirm that GPUs are available:

    ./nvidia-smi
    
  4. Start a Docker container with GPU drivers from the VM mounted as volumes. For example:

    docker run --rm -it --entrypoint /bin/bash
    --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64
    --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin
    --privileged gcr.io/bigdatapivot/image_process_example:latest
    

For a sample Dockerfile, see Building a custom container image. Make sure that you add all the dependencies that you need for your pipeline, to the Dockerfile.

For more information about using a Docker image that is pre-configured for GPU usage, see Using an existing image configured for GPU usage.

Useful tools when working with container-optimized systems

  • To configure Docker CLI to use docker-credential-gcr as a credential helper for the default set of Google Container Registries (GCR), use:

    docker-credential-gcr configure-docker
    

    For more information about setting up Docker credentials, see docker-credential-gcr.

  • To copy files, such as pipleline code, to or from a VM, use toolbox. This is useful, especially when using a Custom-Optimized image. For example:

    toolbox /google-cloud-sdk/bin/gsutil cp gs://bucket/gpu/image_process/* /media/root/home/<userid>/opencv/
    

    For more information, see Debugging node issues using toolbox.

What's next