Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information. Follow the steps, below, to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.

Enable profiling

  1. Enable the Profiler.

  2. Create a Dataproc cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.

gcloud

gcloud dataproc clusters create cluster-name \  
    --scopes=cloud-platform \  
    other args ...

Submit a Dataproc job with Profiler options

  1. Submit a Dataproc Spark or Hadoop job with one or more of the following Profiler options:
    Option Description Value Required/Optional Default Notes
    cloud.profiler.enable Enable profiling of the job true or false Required false
    cloud.profiler.name Name used to create profile on the Profiler Service profile-name Optional Dataproc job UUID
    cloud.profiler.service.version A user-supplied string to identify and distinguish profiler results. Profiler Service Version Optional Dataproc job UUID
    mapreduce.task.profile.maps Numeric range of map tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only
    mapreduce.task.profile.reduces Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only

PySpark Example

gcloud

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file \  
    --cluster  \  
    --jars jar-file \  
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \  
    --  job args

Two profiles will be created:

  1. profiler_name-driver to profile spark driver tasks
  2. profiler_name-executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \  
    --cluster cluster-name \  
    --jars jar-file \  
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \  
    --  teragen 100000 gs://bucket-name

View profiles

View profiles from the Profiler on the Cloud Console.

Whats next