Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.
Requirements:
Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).
Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.
Dataproc recognizes cloud.profiler.enable
and the other
cloud.profiler.*
properties (see
Profiler options), and then appends
the relevant profiler JVM options to the following configurations:
- Spark:
spark.driver.extraJavaOptions
andspark.executor.extraJavaOptions
- MapReduce:
mapreduce.task.profile
and othermapreduce.task.profile.*
properties
Enable profiling
Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.
Create a Dataproc cluster with service account scopes set to
monitoring
to allow the cluster to talk to the profiler service.If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required profiler service permissions.
gcloud
gcloud dataproc clusters create cluster-name \ --scopes=cloud-platform \ --region=region \ other args ...
Submit a Dataproc job with Profiler options
- Submit a Dataproc Spark or Hadoop job
with one or more of the following Profiler options:
Option Description Value Required/Optional Default Notes cloud.profiler.enable
Enable profiling of the job true
orfalse
Required false
cloud.profiler.name
Name used to create profile on the Profiler Service profile-name Optional Dataproc job UUID cloud.profiler.service.version
A user-supplied string to identify and distinguish profiler results. Profiler Service Version Optional Dataproc job UUID mapreduce.task.profile.maps
Numeric range of map tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only mapreduce.task.profile.reduces
Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only
PySpark Example
Google Cloud CLI
PySpark job submit with profiling example:
gcloud dataproc jobs submit pyspark python-job-file \ --cluster=cluster-name \ --region=region \ --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \ -- job args
Two profiles will be created:
profiler_name-driver
to profile spark driver tasksprofiler_name-executor
to profile spark executor tasks
For example, if the profiler_name
is "spark_word_count_job",
spark_word_count_job-driver
and spark_word_count_job-executor
profiles are created.
Hadoop Example
gcloud CLI
Hadoop (teragen mapreduce) job submit with profiling example:
gcloud dataproc jobs submit hadoop \ --cluster=cluster-name \ --region=region \ --jar=jar-file \ --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \ -- teragen 100000 gs://bucket-name
View profiles
View profiles from the Profiler on the Google Cloud console.
Whats next
- See the Monitoring documentation
- See the Logging documentation
- Explore Google Cloud Observability