What is Dataproc Serverless?

Stay organized with collections Save and categorize content based on your preferences.

Dataproc Serverless lets you run Spark batch workloads without requiring you to provision and manage your own cluster. Specify workload parameters, and then submit the workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.

Schedule Dataproc Serverless for Spark batch workloads: You can schedule a Spark batch workload as part of an Airflow or Cloud Composer workflow using an Airflow batch operator. See Run Dataproc Serverless workloads with Cloud Composer for more information.

Dataproc Serverless for Spark compared to Dataproc on Compute Engine

Dataproc on Compute Engine is ideal for users who want to provision and manage infrastructure, then execute workloads on Spark and other open source processing frameworks. The following table list key differences between the Dataproc on Compute Engine and Dataproc Serverless for Spark.

Capability Dataproc Serverless for Spark Dataproc on Compute Engine
Processing frameworks Spark 3.2 Spark 3.1 and earlier versions. Other open source frameworks, such as, Hive
Serverless Yes No
Startup time 60s 90s
Infrastructure control No Yes
Resource management Spark based YARN based
GPU support Planned Yes
Interactive sessions Planned (Google managed) Yes (customer managed)
Custom containers Yes No
VM access (for example, SSH) No Yes
Java versions Java 11 Previous versions supported
OS Login support * No Yes

Notes:

  • An OS Login policy is not applicable to or supported by Dataproc Serverless. If your organization enforces an OS Login policy, its Dataproc Serverless workloads will fail.

Dataproc Serverless for Spark workload capabilities

You can run the following Spark workload types on the Dataproc Serverless for Spark service:

  • Pyspark
  • Spark SQL
  • Spark R
  • Spark Java/Scala