Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.

Requirements:

  • Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).

  • Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.

Dataproc recognizes cloud.profiler.enable and the other cloud.profiler.* properties (see Profiler options), and then appends the relevant profiler JVM options to the following configurations:

  • Spark: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
  • MapReduce: mapreduce.task.profile and other mapreduce.task.profile.* properties

Enable profiling

Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.

  1. Enable the Profiler.

  2. Create a Dataproc cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.

gcloud

gcloud dataproc clusters create cluster-name \
    --scopes=cloud-platform \
    --region=region \
    other args ...

Submit a Dataproc job with Profiler options

  1. Submit a Dataproc Spark or Hadoop job with one or more of the following Profiler options:
    Option Description Value Required/Optional Default Notes
    cloud.profiler.enable Enable profiling of the job true or false Required false
    cloud.profiler.name Name used to create profile on the Profiler Service profile-name Optional Dataproc job UUID
    cloud.profiler.service.version A user-supplied string to identify and distinguish profiler results. Profiler Service Version Optional Dataproc job UUID
    mapreduce.task.profile.maps Numeric range of map tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only
    mapreduce.task.profile.reduces Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only

PySpark Example

gcloud

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file \
    --cluster=cluster-name \
    --region=region \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  job args

Two profiles will be created:

  1. profiler_name-driver to profile spark driver tasks
  2. profiler_name-executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \
    --cluster=cluster-name \
    --region=region \
    --jar=jar-file \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  teragen 100000 gs://bucket-name

View profiles

View profiles from the Profiler on the Google Cloud console.

Whats next