Spark properties

\

Dataproc Serverless uses Spark properties to determine the compute, memory, and disk resources to allocate to your batch workload. These property settings can affect workload quota consumption and cost (see Dataproc Serverless quotas and Dataproc Serverless pricing for more information).

Set Spark batch workload properties

You can specify Spark properties when you submit a Dataproc Serverless Spark batch workload using the Google Cloud console, gcloud CLI, or the Dataproc API.

Console

  1. Go to Dataproc Create batch page in the Google Cloud console.

  2. in the Properties section, click Add Property, then enter the Key (name) and Value of a supported Spark property.

gcloud

gcloud CLI batch submission example:

gcloud dataproc batches submit spark
    --properties=spark.checkpoint.compress=true \
    --region=region \
    other args ...

API

Set RuntimeConfig.properties with supported Spark properties as part of a batches.create request.

Supported Spark properties

Dataproc Serverless for Spark supports most Spark properties, but it does not support YARN-related and shuffle-related Spark properties, such as spark.master=yarn and spark.shuffle.service.enabled. If Spark application code sets a YARN or shuffle property, the application will fail.

Runtime environment properties

Dataproc Serverless for Spark supports the following custom Spark properties for configuring runtime environment:

Property Description
spark.dataproc.driverEnv.EnvironmentVariableName Add the EnvironmentVariableName to the driver process. You can specify multiple environment variables.

Resource allocation properties

Dataproc Serverless for Spark supports the following Spark properties for configuring resource allocation:

Property Description Default Examples
spark.driver.cores The number of cores (vCPUs) to allocate to the Spark driver. Valid values: 4, 8, 16. 4
spark.driver.memory

The amount of memory to allocate to the Spark driver process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t").

Total driver memory per driver core, including driver memory overhead, which must be between 1024m and 7424m for the Standard compute tier (24576m for the Premium compute tier). For example, if spark.driver.cores = 4, then 4096m <= spark.driver.memory + spark.driver.memoryOverhead <= 29696m.

512m, 2g
spark.driver.memoryOverhead

The amount of additional JVM memory to allocate to the Spark driver process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t").

This is non-heap memory associated with JVM overheads, internal strings, and other native overheads, and includes memory used by other driver processes, such as PySpark driver processes and memory used by other non-driver processes running in the container. The maximum memory size of the container in which the driver runs is determined by the sum of spark.driver.memoryOverhead plus spark.driver.memory.

Total driver memory per driver core, including driver memory overhead, must be between 1024m and 7424m for the Standard compute tier (24576m for the Premium compute tier). For example, if spark.driver.cores = 4, then 4096m <= spark.driver.memory + spark.driver.memoryOverhead <= 29696m.

10% of driver memory, except for PySpark batch workloads, which default to 40% of driver memory 512m, 2g
spark.dataproc.driver.compute.tier The compute tier to use on the driver. The Premium compute tier offers higher per-core performance, but it is billed at a higher rate. standard standard, premium
spark.dataproc.driver.disk.size The amount of disk space allocated to the driver, specified with a size unit suffix ("k", "m", "g" or "t"). Must be at least 250GiB. If the Premium disk tier is selected on the driver, valid sizes are 375g, 750g, 1500g, 3000g, 6000g, or 9000g. 100GiB per core 1024g, 2t
spark.dataproc.driver.disk.tier The disk tier to use for local and shuffle storage on the driver. The Premium disk tier offers better performance in IOPS and throughput, but it is billed at a higher rate. If the Premium disk tier is selected on the driver, the Premium compute tier also must be selected using spark.dataproc.driver.compute.tier=premium, and the amount of disk space must be specified using spark.dataproc.executor.disk.size.

If the Premium disk tier is selected, the driver allocates an additional 50GiB of disk space for system storage, which is not usable by user applications.

standard standard, premium
spark.executor.cores The number of cores (vCPUs) to allocate to each Spark executor. Valid values: 4, 8, 16. 4
spark.executor.memory

The amount of memory to to allocate to each Spark executor process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t").

Total executor memory per executor core, including executor memory overhead, must be between 1024m and 7424m for the Standard compute tier (24576m for the Premium compute tier). For example, if spark.executor.cores = 4, then 4096m <= spark.executor.memory + spark.executor.memoryOverhead <= 29696m.

512m, 2g
spark.executor.memoryOverhead

The amount of additional JVM memory to allocate to the Spark executor process, specified in JVM memory string format with a size unit suffix ("m", "g" or "t").

This is non-heap memory used for JVM overheads, internal strings, and other native overheads, and includes PySpark executor memory and memory used by other non-executor processes running in the container. The maximum memory size of the container in which the executor runs is determined by the sum of spark.executor.memoryOverhead plus spark.executor.memory.

Total executor memory per executor core, including executor memory overhead, must be between 1024m and 7424m for the Standard compute tier (24576m for the Premium compute tier). For example, if spark.executor.cores = 4, then 4096m <= spark.executor.memory + spark.executor.memoryOverhead <= 29696m.

10% of executor memory, except for PySpark batch workloads, which default to 40% of executor memory 512m, 2g
spark.dataproc.executor.compute.tier The compute tier to use on the executors. The Premium compute tier offers higher per-core performance, but it is billed at a higher rate. standard standard, premium
spark.dataproc.executor.disk.size The amount of disk space allocated to each executor, specified with a size unit suffix ("k", "m", "g" or "t"). Executor disk space may be used for shuffle data and to stage dependencies. Must be at least 250GiB. If the Premium disk tier is selected on the executor, valid sizes are 375g, 750g, 1500g, 3000g, 6000g, or 9000g. 100GiB per core 1024g, 2t
spark.dataproc.executor.disk.tier The disk tier to use for local and shuffle storage on executors. The Premium disk tier offers better performance in IOPS and throughput, but it is billed at a higher rate. If the Premium disk tier is selected on the executor, the Premium compute tier also must be selected using spark.dataproc.executor.compute.tier=premium, and the amount of disk space must be specified using spark.dataproc.executor.disk.size.

If the Premium disk tier is selected, each executor is allocated an additional 50GiB of disk space for system storage, which is not usable by user applications.

standard standard, premium
spark.executor.instances The initial number of executors to allocate. After a batch workload starts, autoscaling may change the number of active executors. Must be at least 2 and at most 2000.

Autoscaling properties

See Spark dynamic allocation properties for a list of Spark properties you can use to configure Dataproc Serverless autoscaling.

Other properties

Property Description
dataproc.diagnostics.enabled Enable this property to run diagnostics on a batch workload failure or cancellation. If diagnostics are enabled, your batch workload continues to use compute resources after the workload is complete until diagnostics are finished. A URI pointing to the location of the diagnostics tarball is listed in the Batch.RuntimeInfo.diagnosticOutputUri API field.
dataproc.gcsConnector.version Use this property to upgrade to a Cloud Storage connector version that is different from the version installed with your batch workload's runtime version.
dataproc.sparkBqConnector.version Use this property to upgrade to a Spark BigQuery connector version that is different from the version installed with your batch workload's runtime version (see Use the BigQuery connector with Dataproc Serverless for Spark).