Dataproc Spark performance enhancements

This document shows you how to enable the Dataproc Spark performance enhancements to help your Dataproc Spark jobs process more data in less time with reduced costs.

Dataproc Spark performance enhancements include:

  • Spark Optimizer enhancements:
    • Optimizer rules written for better Spark plans
    • Improved performance of the Dataproc BigQuery connector when used in Spark jobs
  • Spark Execution enhancements:
    • Spark execution engine improvements

Other Dataproc performance improvements: See Dataproc cluster caching, which helps reduce the amount of time spent accessing data in Cloud Storage.

How to enable Dataproc Spark performance enhancements

You can use the Google Cloud console, Google Cloud CLI, and the Dataproc API to enable Dataproc Spark performance enhancements when you create a Dataproc on Compute Engine cluster or when you submit your Spark job to your cluster.

Console

Enable enhancements at cluster creation

Perform the following steps to enable the Spark optimization and execution enhancements when you create a Dataproc cluster. Enabled enhancements remain in effect for all Spark jobs submitted to the cluster unless you disable the enhancements for a specific job when you submit the job. By default, Dataproc Spark performance enhancements are disabled on a Dataproc cluster.

  1. In the Google Cloud console, open the Dataproc Create a cluster page.
  2. On the Create Dataproc cluster form, click Create on the Cluster on Compute Engine line.
  3. On the Create a Dataproc cluster on Compute Engine page, click the Customize cluster panel, then scroll to the Cluster properties section.
    1. To enable Spark optimization enhancements:
      1. Click + ADD PROPERTIES.
      2. Select spark in the Prefix list, then add "spark.dataproc.enhanced.optimizer.enabled" in the Key field and "true" in Value field.
    2. To enable Spark execution enhancements:
      1. Click + ADD PROPERTIES.
      2. Select spark in the Prefix list, then add "spark.dataproc.enhanced.execution.enabled" in the Key field and "true" in Value field.
  4. Complete filling in or confirming the other cluster creation fields, then click Create.

Enable enhancements at job submission

Perform the following steps to enable Spark optimizations and execution enhancements for a Spark job. By default, Dataproc Spark performance enhancements are disabled on Spark jobs unless you enable the enhancements when you create a job cluster or enable them for a specific job.

  1. In the Google Cloud console, open the Dataproc Jobs page.
  2. On the Jobs page, click Submit job, then scroll to the job Properties section.
    1. To enable Spark optimization enhancements:
      1. Click + ADD PROPERTIES. Add "spark.dataproc.enhanced.optimizer.enabled" in the Key field and "true" in Value field.
    2. To enable Spark execution enhancements:
      1. Click + ADD PROPERTIES.
      2. Add "spark.dataproc.enhanced.execution.enabled" in the Key field and "true" in Value field.
  3. Complete filling in or confirming the other job submission fields, then click Submit.

gcloud

Enable enhancements at cluster creation

Perform the following steps to enable the Spark optimization and execution enhancements when you create a Dataproc cluster. Enabled enhancements remain in effect for all Spark jobs submitted to the cluster unless you disable the enhancements for a specific job when you submit the job. By default, Dataproc Spark performance enhancements are disabled on a Dataproc cluster.

  1. Run the following gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

    gcloud dataproc clusters create CLUSTER_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --image-version=IMAGE \
        --properties=PROPERTIES
    

    Notes:

    • CLUSTER_NAME: The cluster name, which must be unique within a project. The name must start with a lowercase letter, and can contain up to 51 lowercase letters, numbers, and hyphens. It cannot end with a hyphen. The name of a deleted cluster can be reused.
    • PROJECT_ID: The project to associate with the cluster.
    • REGION: The Compute Engine region where the cluster will be located, such as us-central1.
      • You can add the optional --zone=ZONE flag to specify a zone within the specified region, such as us-central1-a. If you do not specify a zone, the Dataproc autozone placement feature selects a zone with the specified region.
    • IMAGE: The Dataproc Spark optimizer and execution performance enhancements are available in Dataproc image versions 2.0.69+ and 2.1.17+ and later releases. If you omit this flag, Dataproc will select the latest subminor version of the default Dataproc on Compute Engine image for the cluster (see Default Dataproc image version).
    • PROPERTIES:

      • To enable Spark optimization enhancements, specify:
      spark:spark.dataproc.enhanced.optimizer.enabled=true
      
      • To enable Spark execution enhancements, specify:
      spark:spark.dataproc.enhanced.execution.enabled=true
      
      • To enable Spark optimization and execution enhancements, specify:
      spark:spark.dataproc.enhanced.optimizer.enabled=true,spark:spark.dataproc.enhanced.execution.enabled=true
      
      

Enable enhancements at job submission

Perform the following steps to enable Spark optimizations and execution enhancements for a Spark job. By default, Dataproc Spark performance enhancements are disabled on Spark jobs unless you enable the enhancements when you create a job cluster or enable them for a specific job.

  1. Run the following gcloud dataproc jobs submit command locally in a terminal window or in Cloud Shell.

    gcloud dataproc jobs submit SPARK_JOB_TYPE \
        --cluster=CLUSTER_NAME \
        --region=REGION \
        --properties=PROPERTIES
    

    Notes:

    • SPARK_JOB_TYPE: Specify spark, pyspark, spark-sql or spark-r .
    • CLUSTER_NAME: The name of the job where the job will run.
    • REGION: The region where the cluster is located.
    • PROPERTIES:

      • To enable Spark optimization enhancements, specify:
      spark.dataproc.enhanced.optimizer.enabled=true
      
      • To enable Spark execution enhancements, specify:
      spark.dataproc.enhanced.execution.enabled=true
      
      • To enable Spark optimization and execution enhancements, specify:
      spark.dataproc.enhanced.optimizer.enabled=true,spark.dataproc.enhanced.execution.enabled
      

API

Enable enhancements at cluster creation

Perform the following steps to enable the Spark optimization and execution enhancements when you create a Dataproc cluster. Enabled enhancements remain in effect for all Spark jobs submitted to the cluster unless you disable the enhancements for a specific job when you submit the job. By default, Dataproc Spark performance enhancements are disabled on a Dataproc cluster.

  1. Specify the following SoftwareConfig.properties as part of a clusters.create request:

    • To enable Spark optimization enhancements, specify:
    "spark:spark.dataproc.enhanced.optimizer.enabled": "true"
    
    • To enable Spark execution enhancements, specify:
    "spark:spark.dataproc.enhanced.execution.enabled": "true"
    
    • To enable Spark optimization and execution enhancements, specify:
    "spark:spark.dataproc.enhanced.optimizer.enabled": "true","spark:spark.dataproc.enhanced.execution.enabled": "true"
    

Enable enhancements at job submission

Perform the following steps to enable Spark optimizations and execution enhancements for a Spark job. By default, Dataproc Spark performance enhancements are disabled on Spark jobs unless you enable the enhancements when you create a job cluster or enable them for a specific job.

  1. Specify the following properties for a SparkJob, PySparkJob, SparkSqlJob, or SparkRJob as part of a jobs.submit request:

    • To enable Spark optimization enhancements, specify:
    "spark.dataproc.enhanced.optimizer.enabled=true"
    
    • To enable Spark execution enhancements, specify:
    "spark.dataproc.enhanced.execution.enabled=true"
    
    • To enable Spark optimization and execution enhancements, specify:
    "spark.dataproc.enhanced.execution.enabled=true,spark.dataproc.enhanced.optimizer.enabled=true"