This document describes how to profile Dataproc Serverless for Spark resource
usage. Cloud Profiler continuously gathers and reports
application CPU usage and memory allocation information. You can enable
profiling when you submit a batch or create a session workload
by using the profiling properties listed in the following table.
Dataproc Serverless for Spark appends related JVM options to
the spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions
configurations used for the workload.
Option | Description | Value | Default |
---|---|---|---|
dataproc.profiling.enabled |
Enable profiling of the workload | true or false |
false |
dataproc.profiling.name |
Profile name on the Profiler service | PROFILE_NAME | spark-WORKLOAD_TYPE-WORKLOAD_ID, where: |
Notes:
- Dataproc Serverless for Spark sets the profiler version to either the batch UUID or the session UUID.
- Profiler supports the following Spark workload types:
Spark
,PySpark
,SparkSql
, andSparkR
. - A workload must run for more than three minutes to allow Profiler to collect and upload data to a project.
- You can override profiling options submitted with a workload by constructing a
SparkConf
, and then settingextraJavaOptions
in your code. Note that settingextraJavaOptions
properties when the workload is submitted doesn't override profiling options submitted with the workload.
For an example of profiler options used with a batch submission, see the PySpark batch workload example.
Enable profiling
Complete the following steps to enable profiling on a workload:
- Enable the Profiler.
- If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required Profiler permissions.
- Set profiling properties when you submit a batch workload or create a session template.
PySpark batch workload example
The following example uses the gcloud CLI to submit a PySpark batch workload with profiling enabled.
gcloud dataproc batches submit pyspark PYTHON_WORKLOAD_FILE \ --region=REGION \ --properties=dataproc.profiling.enabled=true,dataproc.profiling.name=PROFILE_NAME \ -- other args
Two profiles are created:
PROFILE_NAME-driver
to profile spark driver tasksPROFILE_NAME-executor
to profile spark executor tasks
View profiles
You can view profiles from Profiler in the Google Cloud console.