Dataproc on GKE overview

Dataproc on GKE allows you to execute Big Data applications using the Dataproc jobs API on GKE clusters. Use the Google Cloud console, Google Cloud CLI or the Dataproc API (HTTP request or Cloud Client Libraries) to create a Dataproc on GKE virtual cluster, then submit a Spark, PySpark, SparkR, or Spark-SQL job to the Dataproc service.

Dataproc on GKE supports Spark 3.5 versions.

How Dataproc on GKE works

Dataproc on GKE deploys Dataproc virtual clusters on a GKE cluster. Unlike Dataproc on Compute Engine clusters, Dataproc on GKE virtual clusters do not include separate master and worker VMs. Instead, when you create a Dataproc on GKE virtual cluster, Dataproc on GKE creates node pools within a GKE cluster. Dataproc on GKE jobs are run as pods on these node pools. The node pools and scheduling of pods on the node pools are managed by GKE.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-15 UTC.