Stay organized with collections
Save and categorize content based on your preferences.
This document shows you how to enable the Dataproc
Spark performance enhancements to help your Dataproc Spark jobs
process more data in less time with reduced costs.
Dataproc Spark performance enhancements include:
Spark Optimizer enhancements:
Optimizer rules written for better Spark plans
Improved performance of the Dataproc
BigQuery connector when used in Spark jobs
Spark Execution enhancements:
Spark execution engine improvements
Other Dataproc performance improvements: See Dataproc
cluster caching,
which helps reduce the amount of time spent accessing data in Cloud Storage.
You can enable Spark performance enhancements on a cluster or on a Spark job:
Spark performance enhancements can also be
enabled or disabled on a job
that is submitted to the Dataproc service. Spark performance
enhancements settings applied to a job override any conflicting settings set
at the cluster level for the specified job only.
Spark performance enhancements adjusts Spark properties, including the following
properties:
spark.sql.shuffle.partitions:
Spark performance enhancements set this property to 1000 for 2.2 image version
clusters. This setting can slow small jobs.
spark.dataproc.sql.catalog.file.index.stats.enabled:
This setting can result in driver OOM (Out-Of-Memory) conditions if the
Hive partition count is high. Disabling this property can fix the OOM condition.
Enable enhancements at cluster creation
You can use the Google Cloud console, Google Cloud CLI, and the Dataproc
API to enable Dataproc Spark performance enhancements
when you create a Dataproc cluster with image versions
2.0.69+, 2.1.17+, 2.2.0+, and later image releases.
Console
In the Google Cloud console, open the Dataproc
Create a cluster
page.
On the Create Dataproc cluster form, click Create on the
Cluster on Compute Engine line.
On the Create a Dataproc cluster on Compute Engine page,
click the Customize cluster panel, then scroll to the
Cluster properties section.
To enable Spark optimization enhancements:
Click + ADD PROPERTIES.
Select spark in the Prefix list, then add
"spark.dataproc.enhanced.optimizer.enabled" in the Key field and
"true" in Value field.
To enable Spark execution enhancements:
Click + ADD PROPERTIES.
Select spark in the Prefix list, then add
"spark.dataproc.enhanced.execution.enabled" in the Key field and
"true" in Value field.
Complete filling in or confirming the other cluster creation fields, then
click Create.
CLUSTER_NAME: The cluster name, which must be unique within a project.
The name must start with a lowercase letter, and can contain up to 51 lowercase
letters, numbers, and hyphens. It cannot end with a hyphen. The name of a
deleted cluster can be reused.
PROJECT_ID: The project to associate with the cluster.
REGION: The
Compute Engine region
where the cluster will be located, such as us-central1.
You can add the optional --zone=ZONE flag
to specify a zone within the specified region,
such as us-central1-a. If you do not specify a zone, the
Dataproc autozone placement
feature selects a zone with the specified region.
IMAGE: The Dataproc Spark optimizer and execution performance
enhancements are available in Dataproc image versions 2.0.69+ and 2.1.17+
and later releases. If you omit this flag, Dataproc will select
the latest subminor version of the default Dataproc on Compute Engine
image for the cluster (see
Default Dataproc image version).
PROPERTIES:
To enable Spark optimization enhancements, specify:
You can use the Google Cloud console, Google Cloud CLI, and the Dataproc
API to enable or disable Spark performance enhancements on a Spark job submitted
to the Dataproc service.
Console
In the Google Cloud console, open the Dataproc
Jobs
page.
On the Jobs page, click Submit job, then scroll to the job
Properties section.
To enable Spark optimization enhancements:
Click + ADD PROPERTIES.
Add "spark.dataproc.enhanced.optimizer.enabled" in the Key field and
"true" in Value field.
To enable Spark execution enhancements:
Click + ADD PROPERTIES.
Add "spark.dataproc.enhanced.execution.enabled" in the Key field
and "true" in Value field.
Complete filling in or confirming the other job submission fields, then
click Submit.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-03 UTC."],[[["\u003cp\u003eDataproc Spark performance enhancements help process more data faster and at a reduced cost, featuring Spark Optimizer and Spark Execution improvements.\u003c/p\u003e\n"],["\u003cp\u003eThese enhancements can be enabled by default for all jobs on a cluster or selectively enabled/disabled for individual jobs submitted to the Dataproc service.\u003c/p\u003e\n"],["\u003cp\u003eEnabling Spark performance enhancements on a specific job will override any conflicting settings that were configured at the cluster level for that specific job.\u003c/p\u003e\n"],["\u003cp\u003eThere are no additional charges for Spark performance enhancements, and standard Dataproc on Compute Engine pricing applies to this feature.\u003c/p\u003e\n"],["\u003cp\u003eSpark performance enhancements can be enabled through the Google Cloud console, gcloud CLI, or the Dataproc API, and they are available on Dataproc image versions 2.0.69+, 2.1.17+, 2.2.0+, and later.\u003c/p\u003e\n"]]],[],null,["This document shows you how to enable the Dataproc\nSpark performance enhancements to help your Dataproc Spark jobs\nprocess more data in less time with reduced costs.\n\nDataproc Spark performance enhancements include:\n\n- Spark Optimizer enhancements:\n - Optimizer rules written for better Spark plans\n - Improved performance of the Dataproc BigQuery connector when used in Spark jobs\n- Spark Execution enhancements:\n - Spark execution engine improvements\n\n**Other Dataproc performance improvements:** See Dataproc\n[cluster caching](/dataproc/docs/concepts/cluster-caching),\nwhich helps reduce the amount of time spent accessing data in Cloud Storage.\n\nYou can enable Spark performance enhancements on a cluster or on a Spark job:\n\n- Spark performance enhancements [enabled on a cluster](#enable_enhancements_at_cluster_creation)\n apply, by default, to all Spark jobs run on the cluster, whether submitted to the\n Dataproc service or\n [submitted directly to the cluster](/dataproc/docs/guides/submit-job#submit_a_job_directly_on_your_cluster).\n\n- Spark performance enhancements can also be\n [enabled or disabled on a job](#enable_or_disable_enhancements_at_job_submission)\n that is submitted to the Dataproc service. Spark performance\n enhancements settings applied to a job override any conflicting settings set\n at the cluster level for the specified job only.\n\nPricing\n\nSpark performance enhancements don't incur additional charges. Standard\n[Dataproc on Compute Engine pricing](/dataproc/pricing#on_pricing) applies.\n\nConsiderations\n\nSpark performance enhancements adjusts Spark properties, including the following\nproperties:\n\n- [`spark.sql.shuffle.partitions`](/dataproc/docs/support/spark-job-tuning#configuring_partitions): Spark performance enhancements set this property to `1000` for `2.2` image version clusters. This setting can slow small jobs.\n- `spark.dataproc.sql.catalog.file.index.stats.enabled`: This setting can result in driver OOM (Out-Of-Memory) conditions if the Hive partition count is high. Disabling this property can fix the OOM condition.\n\nEnable enhancements at cluster creation\n\nYou can use the Google Cloud console, Google Cloud CLI, and the Dataproc\nAPI to enable Dataproc Spark performance enhancements\nwhen you create a Dataproc cluster with image versions\n2.0.69+, 2.1.17+, 2.2.0+, and later image releases. \n\nConsole\n\n1. In the Google Cloud console, open the Dataproc [Create a cluster](https://console.cloud.google.com/dataproc/clustersAdd) page.\n2. On the **Create Dataproc cluster** form, click **Create** on the **Cluster on Compute Engine** line.\n3. On the **Create a Dataproc cluster on Compute Engine** page, click the **Customize cluster** panel, then scroll to the **Cluster properties** section.\n 1. To enable Spark optimization enhancements:\n 1. Click **+ ADD PROPERTIES**.\n 2. Select **spark** in the **Prefix** list, then add \"spark.dataproc.enhanced.optimizer.enabled\" in the **Key** field and \"true\" in **Value** field.\n 2. To enable Spark execution enhancements:\n 1. Click **+ ADD PROPERTIES**.\n 2. Select **spark** in the **Prefix** list, then add \"spark.dataproc.enhanced.execution.enabled\" in the **Key** field and \"true\" in **Value** field.\n4. Complete filling in or confirming the other cluster creation fields, then click **Create**.\n\ngcloud\n\n1. Run the following\n [gcloud dataproc clusters create](/sdk/gcloud/reference/dataproc/clusters/create)\n command locally in a terminal window or in\n [Cloud Shell](https://console.cloud.google.com/?cloudshell=true%22).\n\n ```\n gcloud dataproc clusters create CLUSTER_NAME \\\n --project=PROJECT_ID \\\n --region=REGION \\\n --image-version=IMAGE \\\n --properties=PROPERTIES\n ```\n\n Notes:\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: The project to associate with the cluster.\n - \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: The [Compute Engine region](/compute/docs/regions-zones#available) where the cluster will be located, such as `us-central1`.\n - You can add the optional `--zone=`\u003cvar translate=\"no\"\u003eZONE\u003c/var\u003e flag to specify a zone within the specified region, such as `us-central1-a`. If you do not specify a zone, the Dataproc [autozone placement](/dataproc/docs/concepts/configuring-clusters/auto-zone) feature selects a zone with the specified region.\n - \u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e: The Dataproc Spark optimizer and execution performance enhancements are available in Dataproc image versions `2.0.69+` and `2.1.17+` and later releases. If you omit this flag, Dataproc will select the latest subminor version of the default Dataproc on Compute Engine image for the cluster (see [Default Dataproc image version](/dataproc/docs/concepts/versioning/dataproc-version-clusters#default_dataproc_image_version)).\n - \u003cvar translate=\"no\"\u003ePROPERTIES\u003c/var\u003e:\n\n - To enable Spark optimization enhancements, specify:\n\n spark:spark.dataproc.enhanced.optimizer.enabled=true\n\n - To enable Spark execution enhancements, specify:\n\n spark:spark.dataproc.enhanced.execution.enabled=true\n\n - To enable Spark optimization and execution enhancements, specify:\n\n spark:spark.dataproc.enhanced.optimizer.enabled=true,spark:spark.dataproc.enhanced.execution.enabled=true\n\nAPI\n\n1. Specify the following\n [`SoftwareConfig.properties`](/static/dataproc/docs/reference/rest/v1/ClusterConfig#SoftwareConfig.FIELDS.properties)\n as part of a [`clusters.create`](/dataproc/docs/reference/rest/v1/projects.regions.clusters/create) request:\n\n - To enable Spark optimization enhancements, specify:\n\n \"spark:spark.dataproc.enhanced.optimizer.enabled\": \"true\"\n\n - To enable Spark execution enhancements, specify:\n\n \"spark:spark.dataproc.enhanced.execution.enabled\": \"true\"\n\n - To enable Spark optimization and execution enhancements, specify:\n\n \"spark:spark.dataproc.enhanced.optimizer.enabled\": \"true\",\"spark:spark.dataproc.enhanced.execution.enabled\": \"true\"\n\n\u003cbr /\u003e\n\nEnable or disable enhancements at job submission\n\nYou can use the Google Cloud console, Google Cloud CLI, and the Dataproc\nAPI to enable or disable Spark performance enhancements on a Spark job submitted\nto the Dataproc service. \n\nConsole\n\n1. In the Google Cloud console, open the Dataproc [Jobs](https://console.cloud.google.com/dataproc/jobs) page.\n2. On the **Jobs** page, click **Submit job** , then scroll to the job **Properties** section.\n 1. To enable Spark optimization enhancements:\n 1. Click **+ ADD PROPERTIES** . Add \"spark.dataproc.enhanced.optimizer.enabled\" in the **Key** field and \"true\" in **Value** field.\n 2. To enable Spark execution enhancements:\n 1. Click **+ ADD PROPERTIES**.\n 2. Add \"spark.dataproc.enhanced.execution.enabled\" in the **Key** field and \"true\" in **Value** field.\n3. Complete filling in or confirming the other job submission fields, then click **Submit**.\n\ngcloud\n\n1. Run the following\n [gcloud dataproc jobs submit](/sdk/gcloud/reference/dataproc/jobs/submit)\n command locally in a terminal window or in\n [Cloud Shell](https://console.cloud.google.com/?cloudshell=true%22).\n\n ```\n gcloud dataproc jobs submit SPARK_JOB_TYPE \\\n --cluster=CLUSTER_NAME \\\n --region=REGION \\\n --properties=PROPERTIES\n ```\n\n Notes:\n - \u003cvar translate=\"no\"\u003eSPARK_JOB_TYPE\u003c/var\u003e: Specify `spark`, `pyspark`, `spark-sql` or `spark-r` .\n - \u003cvar translate=\"no\"\u003eCLUSTER_NAME\u003c/var\u003e: The name of the job where the job will run.\n - \u003cvar translate=\"no\"\u003eREGION\u003c/var\u003e: The region where the cluster is located.\n - \u003cvar translate=\"no\"\u003ePROPERTIES\u003c/var\u003e:\n\n - To enable Spark optimization enhancements, specify:\n\n spark.dataproc.enhanced.optimizer.enabled=true\n\n - To enable Spark execution enhancements, specify:\n\n spark.dataproc.enhanced.execution.enabled=true\n\n - To enable Spark optimization and execution enhancements, specify:\n\n spark.dataproc.enhanced.optimizer.enabled=true,spark.dataproc.enhanced.execution.enabled=true\n\nAPI\n\n1. Specify the following `properties` for a\n [SparkJob](/dataproc/docs/reference/rest/v1/SparkJob),\n [PySparkJob](/dataproc/docs/reference/rest/v1/PySparkJob),\n [SparkSqlJob](/dataproc/docs/reference/rest/v1/SparkSqlJob), or\n [SparkRJob](/dataproc/docs/reference/rest/v1/SparkRJob)\n as part of a\n [`jobs.submit`](/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit)\n request:\n\n - To enable Spark optimization enhancements, specify:\n\n \"spark.dataproc.enhanced.optimizer.enabled=true\"\n\n - To enable Spark execution enhancements, specify:\n\n \"spark.dataproc.enhanced.execution.enabled=true\"\n\n - To enable Spark optimization and execution enhancements, specify:\n\n \"spark.dataproc.enhanced.execution.enabled=true,spark.dataproc.enhanced.optimizer.enabled=true\"\n\n\u003cbr /\u003e\n\n| **Note:** You can click **Equivalent Command Line** or **Equivalent REST** at the bottom of the left panel of the [**Create a Dataproc cluster on Compute Engine**](https://console.cloud.google.com/dataproc/clustersAdd) page in the Google Cloud console to have the console construct an equivalent `gcloud` tool command or API REST request that you can use from the command line or in your code to create a cluster."]]