Profile Dataproc Serverless for Spark resource usage

This document describes how to profile Dataproc Serverless for Spark resource usage. Cloud Profiler continuously gathers and reports application CPU usage and memory allocation information. You can enable profiling when you submit a batch or create a session workload by using the profiling properties listed in the following table. Dataproc Serverless for Spark appends related JVM options to the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions configurations used for the workload.

Option Description Value Default
dataproc.profiling.enabled Enable profiling of the workload true or false false
dataproc.profiling.name Profile name on the Profiler service PROFILE_NAME spark-WORKLOAD_TYPE-WORKLOAD_ID, where:
  • WORKLOAD_TYPE is set to batch or session
  • WORKLOAD_ID is set to batchId or sessionId

Notes:

  • Dataproc Serverless for Spark sets the profiler version to either the batch UUID or the session UUID.
  • Profiler supports the following Spark workload types: Spark, PySpark, SparkSql, and SparkR.
  • A workload must run for more than three minutes to allow Profiler to collect and upload data to a project.
  • You can override profiling options submitted with a workload by constructing a SparkConf, and then setting extraJavaOptions in your code. Note that setting extraJavaOptions properties when the workload is submitted doesn't override profiling options submitted with the workload.

For an example of profiler options used with a batch submission, see the PySpark batch workload example.

Enable profiling

Complete the following steps to enable profiling on a workload:

  1. Enable the Profiler.
  2. If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required Profiler permissions.
  3. Set profiling properties when you submit a batch workload or create a session template.

PySpark batch workload example

The following example uses the gcloud CLI to submit a PySpark batch workload with profiling enabled.

gcloud dataproc batches submit pyspark PYTHON_WORKLOAD_FILE \
    --region=REGION \
    --properties=dataproc.profiling.enabled=true,dataproc.profiling.name=PROFILE_NAME \
    --  other args

Two profiles are created:

  • PROFILE_NAME-driver to profile spark driver tasks
  • PROFILE_NAME-executor to profile spark executor tasks

View profiles

You can view profiles from Profiler in the Google Cloud console.

What's next