Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.

Requirements:

Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).
Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.

Dataproc recognizes cloud.profiler.enable and the other cloud.profiler.* properties (see Profiler options), and then appends the relevant profiler JVM options to the following configurations:

Spark: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
MapReduce: mapreduce.task.profile and other mapreduce.task.profile.* properties

Enable profiling

Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.

Enable the Profiler.
Create a Dataproc cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.
If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required profiler service permissions.

gcloud

gcloud dataproc clusters create cluster-name \
    --scopes=cloud-platform \
    --region=region \
    other args ...

Submit a Dataproc job with Profiler options

Submit a Dataproc Spark or Hadoop job with one or more of the following Profiler options:

Option	Description	Value	Required/Optional	Default	Notes
`cloud.profiler.enable`	Enable profiling of the job	`true` or `false`	Required	`false`
`cloud.profiler.name`	Name used to create profile on the Profiler Service	`profile-name`	Optional	Dataproc job UUID
`cloud.profiler.service.version`	A user-supplied string to identify and distinguish profiler results.	`Profiler Service Version`	Optional	Dataproc job UUID
`mapreduce.task.profile.maps`	Numeric range of map tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only
`mapreduce.task.profile.reduces`	Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only

PySpark Example

Google Cloud CLI

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file \
    --cluster=cluster-name \
    --region=region \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  job args

Two profiles will be created:

profiler_name-driver to profile spark driver tasks
profiler_name-executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud CLI

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \
    --cluster=cluster-name \
    --region=region \
    --jar=jar-file \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  teragen 100000 gs://bucket-name

View profiles

View profiles from the Profiler on the Google Cloud console.