Dataflow support for GPUs

Using GPUs in Dataflow jobs allows you to accelerate some data processing tasks. GPUs can perform certain computations faster than CPUs; these computations are usually numeric or linear algebra, such as the ones in image processing and machine learning use cases. The extent of performance improvement varies by the use case, type of computation, and amount of data processed.

Prerequisites for using GPUs in Dataflow

Dataflow executes user code in worker VMs inside a Docker container. These worker VMs run Container-Optimized OS. In order for Dataflow jobs to use GPUs, the following installations must happen:

To provide a custom container image, you must use Dataflow Runner v2 and supply the container image using the worker_harness_container_image pipeline option.

The GPU driver version depends on Container-Optimized OS version currently used by Dataflow.

Pricing

Jobs using GPUs incur charges as specified in the Dataflow pricing page. GPU resources in Dataflow are not discounted during the Preview offering.

Considerations

Machine types specifications

GPUs are supported with N1 machine types, including custom N1 machine types.

The type and number of GPUs define the upper bound restrictions on the available amounts of vCPU and memory that workers can have. Refer to the Availability section to find the corresponding restrictions.

Specifying a higher number of CPUs or memory might require that you specify a higher number of GPUs.

For more details, read GPUs on Compute Engine.

GPUs and worker parallelism

For Python pipelines using the Runner v2 architecture, Dataflow launches one Apache Beam SDK process per VM core. Each SDK process runs in its own Docker container and in turn spawns many threads, each of which processes incoming data.

Because of this multiple process architecture and the fact that GPUs in Dataflow workers are visible to all processes and threads, to avoid GPU oversubscription, a deliberate management of GPU access might be required. For example, if you are using TensorFlow, you must configure each TensorFlow process to only take a portion of GPU memory. All processes together should not oversubscribe GPU memory. For more information, see Limiting GPU memory growth.

Alternatively, you can use workers with one vCPU to limit the number of concurrent processes that access the GPU. Note that you can increase the amount of memory for machines with one vCPU by using a custom machine type.

Availability

The following table captures the available GPU types and worker VM configuration available to users.

GPU model GPUs GPU memory Available vCPUs Available memory Available zones
NVIDIA® Tesla® T4 1 GPU 16 GB GDDR6 1‑24 vCPUs 1‑156 GB
  • asia-east1-a
  • asia-east1-c
  • asia-northeast1-a
  • asia-northeast1-c
  • asia-northeast3-b
  • asia-northeast3-c
  • asia-south1-a
  • asia-south1-b
  • asia-southeast1-b
  • asia-southeast1-c
  • australia-southeast1-a
  • europe-west2-a
  • europe-west2-b
  • europe-west3-b
  • europe-west4-b
  • europe-west4-c
  • southamerica-east1-c
  • us-central1-a
  • us-central1-b
  • us-central1-f
  • us-east1-c
  • us-east1-d
  • us-east4-b
  • us-west1-a
  • us-west1-b
2 GPUs 32 GB GDDR6 1‑48 vCPUs 1‑312 GB
4 GPUs 64 GB GDDR6 1‑96 vCPUs 1‑624 GB
NVIDIA® Tesla® P4 1 GPU 8 GB GDDR5 1‑24 vCPUs 1‑156 GB
  • asia-southeast1-b
  • asia-southeast1-c
  • australia-southeast1-a
  • australia-southeast1-b
  • europe-west4-b
  • europe-west4-c
  • northamerica-northeast1-a
  • northamerica-northeast1-b
  • northamerica-northeast1-c
  • us-central1-a
  • us-central1-c
  • us-east4-a
  • us-east4-b
  • us-east4-c
  • us-west2-b
  • us-west2-c
2 GPUs 16 GB GDDR5 1‑48 vCPUs 1‑312 GB
4 GPUs 32 GB GDDR5 1‑96 vCPUs 1‑624 GB
NVIDIA® Tesla® V100 1 GPU 16 GB HBM2 1‑12 vCPUs 1‑78 GB
  • asia-east1-c
  • europe-west4-a
  • europe-west4-b
  • europe-west4-c
  • us-east1-c
  • us-central1-a
  • us-central1-b
  • us-central1-c
  • us-central1-f
  • us-west1-a
  • us-west1-b
2 GPUs 32 GB HBM2 1‑24 vCPUs 1‑156 GB
4 GPUs 64 GB HBM2 1‑48 vCPUs 1‑312 GB
8 GPUs 128 GB HBM2 1‑96 vCPUs 1‑624 GB
NVIDIA® Tesla® P100 1 GPU 16 GB HBM2 1‑16 vCPUs 1‑104 GB
  • asia-east1-a
  • asia-east1-c
  • australia-southeast1-c
  • us-central1-c
  • us-central1-f
  • us-east1-b
  • us-east1-c
  • us-west1-a
  • us-west1-b
  • europe-west1-b
  • europe-west1-d
  • europe-west4-a
2 GPUs 32 GB HBM2 1‑32 vCPUs 1‑208 GB
4 GPUs 64 GB HBM2

1‑64 vCPUs
(us-east1-c, europe-west1-d, europe-west1-b)

1‑96 vCPUs
(all other zones)

1‑208 GB
(us-east1-c, europe-west1-d, europe-west1-b)

1‑624 GB
(all other zones)

NVIDIA® Tesla® K80 1 GPU 12 GB GDDR5 1‑8 vCPUs 1‑52 GB
  • asia-east1-a
  • asia-east1-b
  • europe-west1-b
  • europe-west1-d
  • us-central1-a
  • us-central1-c
  • us-east1-c
  • us-east1-d
  • us-west1-b
2 GPUs 24 GB GDDR5 1‑16 vCPUs 1‑104 GB
4 GPUs 48 GB GDDR5 1‑32 vCPUs 1‑208 GB
8 GPUs 96 GB GDDR5 1‑64 vCPUs

1‑416 GB
(asia-east1-a and us-east1-d)

1‑208 GB
(all other zones)

Note:
  • For a more detailed description of zones, see Regions and zones.
  • NVIDIA® K80® boards contain two GPUs each. The pricing for K80 GPUs is by individual GPU, not by the board.

What's next