This document shows you how to enable the Dataproc Spark performance enhancements to help your Dataproc Spark jobs process more data in less time with reduced costs.
Dataproc Spark performance enhancements include:
- Spark Optimizer enhancements:
- Optimizer rules written for better Spark plans
- Improved performance of the Dataproc BigQuery connector when used in Spark jobs
- Spark Execution enhancements:
- Spark execution engine improvements
Other Dataproc performance improvements: See Dataproc cluster caching, which helps reduce the amount of time spent accessing data in Cloud Storage.
You can enable Spark performance enhancements on a cluster or on a Spark job:
Spark performance enhancements enabled on a cluster apply, by default, to all Spark jobs run on the cluster, whether submitted to the Dataproc service or submitted directly to the cluster.
Spark performance enhancements can also be enabled or disabled on a job that is submitted to the Dataproc service. Spark performance enhancements settings applied to a job override any conflicting settings set at the cluster level for the specified job only.
Pricing
Spark performance enhancements don't incur additional charges. Standard Dataproc on Compute Engine pricing applies.
Considerations
Spark performance enhancements adjusts Spark properties, including the following properties:
spark.sql.shuffle.partitions
: Spark performance enhancements set this property to1000
for2.2
image version clusters. This setting can slow small jobs.spark.dataproc.sql.catalog.file.index.stats.enabled
: This setting can result in driver OOM (Out-Of-Memory) conditions if the Hive partition count is high. Disabling this property can fix the OOM condition.
Enable enhancements at cluster creation
You can use the Google Cloud console, Google Cloud CLI, and the Dataproc API to enable Dataproc Spark performance enhancements when you create a Dataproc cluster with image versions 2.0.69+, 2.1.17+, 2.2.0+, and later image releases.
Console
- In the Google Cloud console, open the Dataproc Create a cluster page.
- On the Create Dataproc cluster form, click Create on the Cluster on Compute Engine line.
- On the Create a Dataproc cluster on Compute Engine page,
click the Customize cluster panel, then scroll to the
Cluster properties section.
- To enable Spark optimization enhancements:
- Click + ADD PROPERTIES.
- Select spark in the Prefix list, then add "spark.dataproc.enhanced.optimizer.enabled" in the Key field and "true" in Value field.
- To enable Spark execution enhancements:
- Click + ADD PROPERTIES.
- Select spark in the Prefix list, then add "spark.dataproc.enhanced.execution.enabled" in the Key field and "true" in Value field.
- To enable Spark optimization enhancements:
- Complete filling in or confirming the other cluster creation fields, then click Create.
gcloud
Run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.
gcloud dataproc clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --image-version=IMAGE \ --properties=PROPERTIES
Notes:
- CLUSTER_NAME: The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.
- PROJECT_ID: The project to associate with the cluster.
- REGION: The
Compute Engine region
where the cluster will be located, such as
us-central1
.- You can add the optional
--zone=ZONE
flag to specify a zone within the specified region, such asus-central1-a
. If you do not specify a zone, the Dataproc autozone placement feature selects a zone with the specified region.
- You can add the optional
- IMAGE: The Dataproc Spark optimizer and execution performance
enhancements are available in Dataproc image versions
2.0.69+
and2.1.17+
and later releases. If you omit this flag, Dataproc will select the latest subminor version of the default Dataproc on Compute Engine image for the cluster (see Default Dataproc image version). PROPERTIES:
- To enable Spark optimization enhancements, specify:
spark:spark.dataproc.enhanced.optimizer.enabled=true
- To enable Spark execution enhancements, specify:
spark:spark.dataproc.enhanced.execution.enabled=true
- To enable Spark optimization and execution enhancements, specify:
spark:spark.dataproc.enhanced.optimizer.enabled=true,spark:spark.dataproc.enhanced.execution.enabled=true
API
Specify the following
SoftwareConfig.properties
as part of aclusters.create
request:- To enable Spark optimization enhancements, specify:
"spark:spark.dataproc.enhanced.optimizer.enabled": "true"
- To enable Spark execution enhancements, specify:
"spark:spark.dataproc.enhanced.execution.enabled": "true"
- To enable Spark optimization and execution enhancements, specify:
"spark:spark.dataproc.enhanced.optimizer.enabled": "true","spark:spark.dataproc.enhanced.execution.enabled": "true"
Enable or disable enhancements at job submission
You can use the Google Cloud console, Google Cloud CLI, and the Dataproc API to enable or disable Spark performance enhancements on a Spark job submitted to the Dataproc service.
Console
- In the Google Cloud console, open the Dataproc Jobs page.
- On the Jobs page, click Submit job, then scroll to the job
Properties section.
- To enable Spark optimization enhancements:
- Click + ADD PROPERTIES. Add "spark.dataproc.enhanced.optimizer.enabled" in the Key field and "true" in Value field.
- To enable Spark execution enhancements:
- Click + ADD PROPERTIES.
- Add "spark.dataproc.enhanced.execution.enabled" in the Key field and "true" in Value field.
- To enable Spark optimization enhancements:
- Complete filling in or confirming the other job submission fields, then click Submit.
gcloud
Run the following gcloud dataproc jobs submit command locally in a terminal window or in Cloud Shell.
gcloud dataproc jobs submit SPARK_JOB_TYPE \ --cluster=CLUSTER_NAME \ --region=REGION \ --properties=PROPERTIES
Notes:
- SPARK_JOB_TYPE: Specify
spark
,pyspark
,spark-sql
orspark-r
. - CLUSTER_NAME: The name of the job where the job will run.
- REGION: The region where the cluster is located.
PROPERTIES:
- To enable Spark optimization enhancements, specify:
spark.dataproc.enhanced.optimizer.enabled=true
- To enable Spark execution enhancements, specify:
spark.dataproc.enhanced.execution.enabled=true
- To enable Spark optimization and execution enhancements, specify:
spark.dataproc.enhanced.optimizer.enabled=true,spark.dataproc.enhanced.execution.enabled=true
- SPARK_JOB_TYPE: Specify
API
Specify the following
properties
for a SparkJob, PySparkJob, SparkSqlJob, or SparkRJob as part of ajobs.submit
request:- To enable Spark optimization enhancements, specify:
"spark.dataproc.enhanced.optimizer.enabled=true"
- To enable Spark execution enhancements, specify:
"spark.dataproc.enhanced.execution.enabled=true"
- To enable Spark optimization and execution enhancements, specify:
"spark.dataproc.enhanced.execution.enabled=true,spark.dataproc.enhanced.optimizer.enabled=true"