This page provides background information on how GPUs work with Dataflow.
Before you begin your development job with GPUs, review the developer workflow and tips for building pipelines using GPUs described in Developing with GPUs.
For information and examples on how to enable GPUs in your Dataflow jobs, see Using GPUs and Processing Landsat satellite images with GPUs. For additional examples, see:
Using GPUs in Dataflow jobs allows you to accelerate some data processing tasks. GPUs can perform certain computations faster than CPUs; these computations are usually numeric or linear algebra, such as the ones in image processing and machine learning use cases. The extent of performance improvement varies by the use case, type of computation, and amount of data processed.
Prerequisites for using GPUs in Dataflow
Dataflow executes user code in worker VMs inside a Docker container. These worker VMs run Container-Optimized OS. In order for Dataflow jobs to use GPUs, the following installations must happen:
- GPU drivers are installed on worker VMs and accessible to the Docker container. For more information, read Installing GPU drivers.
- GPU libraries required by your pipeline, such as NVIDIA CUDA-X libraries or the NVIDIA CUDA Toolkit, are installed in the custom container image. For more information, read Configuring your container image.
Pricing
Jobs using GPUs incur charges as specified in the Dataflow pricing page.
Considerations
Machine types specifications
For details about machine type support for each GPU model, see GPU platforms. GPUs that are supported with N1 machine types also support the custom N1 machine types.
The type and number of GPUs define the upper bound restrictions on the available amounts of vCPU and memory that workers can have. Refer to the Availability section to find the corresponding restrictions.
Specifying a higher number of CPUs or memory might require that you specify a higher number of GPUs.
For more details, read GPUs on Compute Engine.
GPUs and worker parallelism
For Python pipelines using the Dataflow Runner v2 architecture, Dataflow launches one Apache Beam SDK process per VM core. Each SDK process runs in its own Docker container and in turn spawns many threads, each of which processes incoming data.
Because of this multiple process architecture and the fact that GPUs in Dataflow workers are visible to all processes and threads, to avoid GPU memory oversubscription, a deliberate management of GPU access might be required. If you are using TensorFlow, either of the following suggestions can help you avoid GPU memory oversubscription:
Configure the Dataflow workers to start only one containerized Python process, regardless of the worker vCPU count. To configure this, when launching your job, use the following pipeline options:
--experiments=no_use_multiple_sdk_containers
--number_of_worker_harness_threads
For more information about how many threads to use, see Reduce the number of threads
Use a machine type with only one vCPU.
When multiple TensorFlow processes use the same GPU, you might need to configure each process to only take a portion of GPU memory so that all processes together do not oversubscribe GPU memory. Because making such configuration is not staightforward, we recommend limiting the number of TensorFlow processes as suggested above.
Availability
For information on available GPU types and worker VM configuration, read Dataflow locations.
What's next
- Learn more about tasks for Using GPUs.
- Work through Processing Landsat satellite images with GPUs.