Dataproc Serverless FAQ

This page contains frequently asked Dataproc Serverless questions with answers.

When should I use Dataproc Serverless for Spark instead of Dataproc on Compute Engine?

Dataproc Serverless:
- Supports Spark batch workloads and interactive sessions in PySpark kernel Jupyter notebooks.
- Dataproc Serverless creates and manages your workload and interactive session infrastructure.
Dataproc on Compute Engine:
- Supports the submission of different types Spark jobs, and jobs based on other open source components, such as Flink, Hadoop, Hive, Pig, Presto, and others.
- Does not create and manage infrastructure. You create and manage your Dataproc clusters.

What can I do with Dataproc Serverless for Spark?

Run batch jobs.
Use the Dataproc JupyterLab plugin for serverless batch and interactive notebook sessions.
Run streaming jobs using Spark streaming libraries. Note: Streaming is not a managed service, so you must manage checkpointing and restarts.
Train models using Spark MLlib.
Use interactive SQL notebooks for data exploration, graph, time series, and geospatial analytics.
Orchestrate Dataproc Serverless for Spark workloads with Cloud Composer, a managed Apache Airflow service.

How should I set up a workload execution plan?

You can run workloads concurrently or sequentially. Your execution plan impacts your Google Cloud resource quota. You can run as many workloads in parallel as your batch resource quotas allow.

Can I use a custom image with Dataproc Serverless for Spark?

Yes. You can use a custom container image instead of the default container image. See Use custom containers with Dataproc Serverless for Spark.

Can I specify memory and disk resources for Dataproc Serverless Spark workloads?

Yes. You can specify premium executor and driver compute and disk tiers and the amount of driver and executor compute and disk resources to allocate when you submit a workload (see Resource allocation properties).

How can I specify the IP address range for my Dataproc Serverless VPC network?

Dataproc Serverless for Spark workloads run within your environment. Each Spark driver and Spark executor in a Serverless Spark workload consumes one internal IP address in your Dataproc Serverless VPC network. /16 is a typical user-specified CIDR address range for a Dataproc Serverless VPC network. You can limit your network's IP address range based on the number of concurrent workloads you plan to run.

Does Dataproc Serverless support data residency?

Yes. You specify the region where your workload is processed. Locate you input and output datasets in the specified region.

How does Dataproc Serverless select a zone within your specified region to run the workload?

Dataproc Serverless selects the Compute Engine zone where it executes a workload based on capacity and availability. If a zone becomes unavailable after a workload starts, the workload fails, and you must resubmit the failed workload.

How do Dataproc Serverless workloads use compute resources?

Each workload executes on its own compute resources. Multiple batch submissions don't share or reuse compute resources.

Best Practices:

Optimize your workload for medium-running jobs, not short-running jobs.
Persist data that is accessed by multiple workloads in Cloud Storage.

Where can I find information on Dataproc Serverless announcements, features, bug fixes, known issues, and deprecations?

See the Dataproc Serverless release notes.

Do concurrent workloads compete for resources?

Dataproc Serverless workloads only compete for resources if your resource quota is insufficient to run all concurrently running workloads. Otherwise, workloads are fully isolated from each other.

How is Dataproc Serverless quota allocated?

Dataproc Serverless batches consume Google Cloud resources. See Dataproc Serverless quotas for more information.

Do I need to set up a Dataproc Persistent History Server?

Setting up a Persistent History Server (PHS) to use with Dataproc Serverless is optional.You can use the PHS to view Spark event and other logs in a specified Cloud Storage bucket up to and after the standard Dataproc Serverless staging and temp bucket 90-day retention (TTL) period.

What Dataproc Serverless Spark logs are available?

Spark executors and driver logs are available in Cloud Logging during and after Spark workload execution. Also, Spark applications are visible in the Persistent History Server (PHS) web interface while the workload is running (select PHS > Incomplete Applications in the PHS UI).

If you set up a Dataproc PHS, it provides persistent access to Spark event logs saved in Cloud Storage, which provide insight into Spark app execution, such DAG and executor events.

Can I set the number of executors for my Spark workload?

Yes. You can set the number of executors for a Spark workload using the spark.executor.instances property. However, the total number of cores that a workload can use is more important than the number of executors because Spark runs 1 task per core. For example, if a workload has four executors with two cores each, it will run 4 * 2 = 8 tasks at the same time. And it will also run the same number of tasks for a workload that has two executors with four cores each. Since the number of cores for each workload is the same, they will run the same number of tasks. You can use the spark.executor.cores property to set the number cores per executor for your Dataproc Serverless workload.

What Spark metrics does Dataproc Serverless use for autoscaling?

Dataproc Serverless for Spark looks at the maximum-needed and running Spark's dynamic allocation metrics to determine whether to scale up or down. See Dataproc Serverless for Spark autoscaling.

Can I configure Dataproc Serverless autoscaling behavior using Spark properties?

Yes. Dataproc Serverless autoscaling is based on Spark dynamic allocation, and is enabled by default. You can adjust the following Spark properties and Spark dynamic allocation properties:

spark.executor.instances
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors

Why do I need to package my code in a JAR file to submit my Spark workload?

Spark is written in Scala, which means that both the driver and the worker processes operate as JVM processes. In JVM languages, the JAR file is the primary way to package code. You pass the JAR file to Dataproc Serverless when you submit a workload.