Dataproc Persistent History Server

Overview

After Dataproc clusters are deleted, users often wish to view job history files for diagnostic or other purposes. The Dataproc Persistent History Server (PHS) provides a UI to view job history for jobs run on active or deleted Dataproc clusters.

The Persistent History Server runs on a single node Dataproc cluster, stores and accesses job history files in Cloud Storage, and supports MapReduce, Spark, and Pig jobs. This feature is available in Dataproc image version 1.4-debian10 and later.

Set up a Dataproc job cluster

You specify the following flag and cluster properties when creating a Dataproc cluster that will run jobs and store job logs to be accessed and displayed by a Persistent History Server cluster.

  • --enable-component-gateway: Required flag. This flag must be used to enable the Component Gateway.
  • dataproc:job.history.to-gcs.enabled: Required cluster property.. This property must be set to "true" to enable job history storage in Cloud Storage.
  • spark:spark.history.fs.logDirectory and spark:spark.eventLog.dir Optional cluster properties. These flags specify the location to write Spark job history and event logs, respectively. If used, both flags must be set and point to directories within the same bucket.
    Sample properties:
    spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
    spark:spark.eventLog.dir=gs://bucket-name/directory-name/spark-job-history
    
  • mapred:mapreduce.jobhistory.intermediate-done-dir and mapred:mapreduce.jobhistory.done-dir: Optional cluster properties. These flags specify the Cloud Storage location to write intermediate and final MapReduce job history files, respectively. If used, both flags must be set and point to directories within the same bucket. The intermediate mapreduce.jobhistory.intermediate-done-dir location is temporary storage. The intermediate files are moved to the mapreduce.jobhistory.done-dir location when the MapReduce job completes.
    Sample properties:
    mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory-name/mapreduce-job-history/done,
    mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done
    
  • The following properties control Cloud Storage flush behavior for event logs for 1.4 and later images. Note: The default configuration of these properties enables the display of running jobs in the Spark History Server UI for clusters using Cloud Storage to store spark event logs.
    spark:spark.history.fs.gs.outputstream.type(default:BASIC)
    spark:spark.history.fs.gs.outputstream.sync.min.interval.ms (default: 5000ms).
    
  1. Run the gcloud dataproc clusters create command to create a job cluster. The cluster must be created with image 1.4-debian10 or later. Note: For readability, the --property flag values are displayed below on separate lines; when you run the command, all comma-separated --property flag values must be specified on one line.
    gcloud dataproc clusters create cluster-name \
        --region=region \
        --image-version=1.4-debian10 \
        --enable-component-gateway \
        --properties='dataproc:job.history.to-gcs.enabled=true,
    spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
    spark:spark.eventLog.dir=gs://bucket-name/directory/spark-job-history,
    mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory/mapreduce-job-history/done,
    mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done'
    

You can run the following simplified command to have Dataproc set up the Cloud Storage locations for job history files.

gcloud dataproc clusters create cluster-name \
    --region=region \
    --image-version=1.4-debian10 \
    --enable-component-gateway \
    --properties='dataproc:job.history.to-gcs.enabled=true'

If you use the simplified command shown above, job history files will be saved in the Dataproc temp bucket in default directories: /spark-job-history, /mapreduce-job-history/done, and /mapreduce-job-history/intermediate-done. The temp bucket Cloud Storage location is listed in the output of the gcloud dataproc clusters describe cluster-name --region=region command. The Cloud Storage location of job history files is also listed in the cluster's /etc/spark/conf/spark-defaults.conf and /etc/hadoop/conf/mapred-site.xml files.

Examples after SSHing into the job cluster master node:

cat /etc/spark/conf/spark-defaults.conf
...
spark.history.fs.logDirectory=gs://temp-bucket/spark-job-history
spark.eventLog.dir=gs://temp-bucket/spark-job-history
cat /etc/hadoop/conf/mapred-site.xml
...
<property>
  <name>mapreduce.jobhistory.done-dir</name>
  <value>gs://temp-bucket/mapreduce-job-history/done</value>
</property>
<property>
  <name>mapreduce.jobhistory.intermediate-done-dir</name>
  <value>gs://temp-bucket/mapreduce-job-history/done_intermediate</value>
</property>

Setting up a Persistent History Server

You specify the following flag and cluster properties when creating a PHS single node cluster:

  • --enable-component-gateway: Required flag. This flag must be used to enable the Component Gateway.
  • spark:spark.history.fs.logDirectory: Mandatory cluster property to enable persistent Spark job history. This property specifies the Cloud Storage bucket and directory(ies) where the PHS will access Spark job history logs written by job clusters (see Set up a Dataproc job cluster). Instead of specifying specific bucket directories, use asterisks as wildcards (for example, gs://bucket-name/*/spark-job-history) to allow the PHS server to match multiple directories in the specified bucket written to by different job clusters (but see Efficiency Consideration: Using Mid-Path Wildcards).
  • mapred:mapreduce.jobhistory.read-only.dir-pattern: Mandatory cluster property to enable persistent MapReduce job history. This property specifies the Cloud Storage bucket directory(ies) where the PHS will access MapReduce job history logs written by job clusters (see Setting up job clusters). Instead of specifying specific bucket directories, use asterisks as wildcards, (for example, gs://bucket-name/*/mapreduce-job-history/done) to allow the PHS server to match multiple directories in the specified bucket written to by different job clusters (but see Efficiency Consideration: Using Mid-Path Wildcards).
  1. Run the gcloud dataproc clusters create command to create a single-node Dataproc PHS cluster.
    gcloud dataproc clusters create cluster-name \
        --single-node \
        --region=region \
        --image-version=1.4-debian10 \
        --enable-component-gateway \
        --properties='spark:spark.history.fs.logDirectory=gs://bucket-name/*/spark-job-history,mapred:mapreduce.jobhistory.read-only.dir-pattern=gs://bucket-name/*/mapreduce-job-history/done'
    

Go to the PHS single-node cluster's Cluster details page in the Cloud Console, then click the WEB INTERFACES tab. Under Component gateway, click "MapReduce Job History" or "Spark History Server" to view the MapReduce and Spark job history UIs.

Spark History Server UI

The following screenshot shows the Spark History Server UI displaying links to Spark jobs run on job-cluster-1 and job-cluster-2 after setting up the job clusters'spark.history.fs.logDirectory and spark:spark.eventLog.dir and PHS cluster's spark.history.fs.logDirectory locations as follows:

job-cluster-1 gs://example-cloud-storage-bucket/job-cluster-1/spark-job-history
job-cluster-2 gs://example-cloud-storage-bucket/job-cluster-2/spark-job-history
phs-cluster gs://example-cloud-storage-bucket/*/spark-job-history

You can list jobs by App Name in the Spark History Server UI by entering the name in the search box. The App Name can be set in one of the following ways (listed by priority):

  1. Set inside the application code when creating the spark context
  2. Set by the spark.app.name property when the job is submitted
  3. Set by Dataproc to the full REST resource name for job (projects/project-id/regions/region/jobs/job-id)

Event logs

The Spark History Server UI provides an Event Log button you can click to download Spark event logs. These logs are useful for examining the lifecycle of the Spark application.

Spark jobs

Spark applications are broken down into multiple jobs, which are further broken down into multiple stages. Each stage can have multiple tasks, which are run on executor nodes (workers).

  • Click on a Spark App ID in the UI to open the Spark Jobs page, which provides an event timeline and summary of jobs within the application.

  • Click a job to open a Job Details page with a Directed Acyclic Graph (DAG) and summary of job stages.

  • Click on a stage or use the Stages tab to select a stage to open the Stage Details page.

    Stage Details includes a DAG visualization, an event timeline, and metrics for the tasks within the stage. You can use this page to troubleshoot issues related to strangled tasks, scheduler delays, and out of memory errors. The DAG visualizer shows the line of code from which the stage is derived, helping you track issues back to the code.

  • Click on the Executors tab for information about the Spark application's driver and executor nodes.

    Important pieces of information on this page include the number of cores and the number of tasks that were run on each executor.