What is Dataproc Serverless?

Dataproc Serverless lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster.

There are two ways to run Dataproc Serverless workloads:

Dataproc Serverless for Spark Batch: Use the Google Cloud console, Google Cloud CLI, or Dataproc API to submit a batch workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed. Dataproc Serverless charges apply only to the time when the workload is executing.

To get started, see Run an Apache Spark batch workload.

You can schedule a Spark batch workload as part of an Airflow or Cloud Composer workflow using an Airflow batch operator. See Run Dataproc Serverless workloads with Cloud Composer for more information.
Dataproc Serverless for Spark Interactive: Write and run code in Jupyter notebooks during a Dataproc Serverless for Spark Interactive session. You can create a notebook session in the following ways:
- Use the Dataproc JupyterLab plugin to create multiple notebook sessions from templates that you create and manage. When you install the plugin on a local machine or Compute Engine VM, different cards that correspond to different Spark kernel configurations appear on the JupyterLab launcher page. Click a card to create a Dataproc Serverless notebook session, then start writing and testing your code in the notebook.
The Dataproc JupyterLab plugin also lets you use the JupyterLab launcher page to take the following actions:
- Create Dataproc on Compute Engine clusters.
- Submit jobs to Dataproc on Compute Engine clusters.
- View Google Cloud and Spark logs.

To get started, see Use JupyterLab for serverless batch and interactive notebook sessions.

Dataproc Serverless for Spark compared to Dataproc on Compute Engine

Dataproc on Compute Engine is ideal if you want to provision and manage infrastructure, then execute workloads on Spark and other open source processing frameworks. The following table lists key differences between the Dataproc on Compute Engine and Dataproc Serverless for Spark.

Capability	Dataproc Serverless for Spark	Dataproc on Compute Engine
Processing frameworks	Batch: Spark 3.4 and earlier versions Interactive: PySpark kernels for Spark 3.4 and earlier versions	Spark 3.3 and earlier versions. Other open source frameworks, such as Hive, Flink, Trino, and Kafka
Serverless	Yes	No
Startup time	60s	90s
Infrastructure control	No	Yes
Resource management	Spark based	YARN based
GPU support	Planned	Yes
Interactive sessions	Yes	No
Custom containers	Yes	No
VM access (for example, SSH)	No	Yes
Java versions	Java 17, 11	Previous versions supported
`OS Login` support *	No	Yes

Notes:

An OS Login policy is not applicable to or supported by Dataproc Serverless. If your organization enforces an OS Login policy, its Dataproc Serverless workloads will fail.

Dataproc Serverless security compliance

Dataproc Serverless adheres to all data residency, CMEK, VPC-SC, and other security requirements that Dataproc is compliant with.

Dataproc Serverless for Spark batch workload capabilities

You can run the following Dataproc Serverless for Spark batch workload types:

Pyspark
Spark SQL
Spark R
Spark (Java or Scala)

You can specify Spark properties when you submit a Dataproc Serverless for Spark batch workload.