Dataproc Persistent History Server

Overview

After Dataproc clusters are deleted, users often wish to view job history files for diagnostic or other purposes. The Dataproc Persistent History Server (PHS) provides a UI to view job history for jobs run on active and deleted Dataproc clusters.

The Persistent History Server runs on a single node Dataproc cluster, stores and accesses job history files in Cloud Storage, and supports MapReduce, Spark, and Pig jobs. This feature is available in Dataproc image version 1.4-debian10 and later.

Setting up job clusters

You specify the following flag and cluster properties when creating a Dataproc job cluster that will store job logs to be accessed and displayed by a Persistent History Server cluster.

  • --enable-component-gateway: Required flag. This flag must be used to enable the Component Gateway.
  • dataproc:job.history.to-gcs.enabled: Required cluster property.. This property must be set to "true" to enable job history storage in Cloud Storage.
  • spark:spark.history.fs.logDirectory and spark:spark.eventLog.dir Optional cluster properties. These flags specify the location to write Spark job history and event logs, respectively. If used, both flags must be set and point to directories within the same bucket.
    Sample properties:
    spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
    spark:spark.eventLog.dir=gs://bucket-name/directory-name/spark-job-history/events
    
  • mapred:mapreduce.jobhistory.intermediate-done-dir and mapred:mapreduce.jobhistory.done-dir: Optional cluster properties. These flags specify the Cloud Storage location to write intermediate and final MapReduce job history files, respectively. If used, both flags must be set and point to directories within the same bucket. The intermediate mapreduce.jobhistory.intermediate-done-dir location is temporary storage. The intermediate files are moved to the mapreduce.jobhistory.done-dir location when the MapReduce job completes.
    Sample properties:
    mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory-name/mapreduce-job-history/done,
    mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done
    
  1. Run the gcloud dataproc clusters create command to create a job cluster. The cluster must be created with image 1.4-debian10 or later. Note: For readability, the --property flag values are displayed below on separate lines; when you run the command, all comma-separated --property flag values must be specified on one line.
    gcloud dataproc clusters create cluster-name \
        --region=region \
        --image-version=1.4-debian10 \
        --enable-component-gateway \
        --properties='dataproc:job.history.to-gcs.enabled=true,
    spark:spark.history.fs.logDirectory=gs://bucket-name/directory-name/spark-job-history,
    spark:spark.eventLog.dir=gs://bucket-name/directory/spark-job-history/events,
    mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/directory/mapreduce-job-history/done,
    mapred:mapreduce.jobhistory.intermediate-done-dir=gs://bucket-name/directory-name/mapreduce-job-history/intermediate-done'
    
gcloud dataproc clusters create cluster-name \
    --region=region \
    --image-version=1.4-debian10 \
    --enable-component-gateway \
    --properties='dataproc:job.history.to-gcs.enabled=true'

If you use the simplified command shown above, job history files will be saved in the Dataproc temp bucket in default directories: /spark-job-history, /mapreduce-job-history/done, and /mapreduce-job-history/intermediate-done. The temp bucket Cloud Storage location is listed in the output of the gcloud dataproc clusters describe cluster-name --region=region command. The Cloud Storage location of job history files is also listed in the cluster's /etc/spark/conf/spark-defaults.conf and /etc/hadoop/conf/mapred-site.xml files.

Examples after SSHing into the job cluster master node:

cat /etc/spark/conf/spark-defaults.conf
...
spark.history.fs.logDirectory=gs://temp-bucket/spark-job-history
spark.eventLog.dir=gs://temp-bucket/spark-job-history
cat /etc/hadoop/conf/mapred-site.xml
...
<property>
  <name>mapreduce.jobhistory.done-dir</name>
  <value>gs://temp-bucket/mapreduce-job-history/done</value>
</property>
<property>
  <name>mapreduce.jobhistory.intermediate-done-dir</name>
  <value>gs://temp-bucket/mapreduce-job-history/done_intermediate</value>
</property>

Setting up a Persistent History Server

You specify the following flag and cluster properties when creating a PHS single node cluster:

  • --enable-component-gateway: Required flag. This flag must be used to enable the Component Gateway.
  • spark:spark.history.fs.logDirectory: Mandatory cluster property to enable persistent Spark job history. This property specifies the Cloud Storage bucket and directory(ies) where the PHS will access Spark job history logs written by job clusters (see Setting up job clusters). Instead of specifying specific bucket directories, use asterisks as wildcards (for example, gs://bucket-name/*/spark-job-history) to allow the PHS server to match multiple directories in the specified bucket written to by different job clusters (but see Efficiency Consideration: Using Mid-Path Wildcards).
  • mapred:mapreduce.jobhistory.read-only.dir-pattern: Mandatory cluster property to enable persistent MapReduce job history. This property specifies the Cloud Storage bucket directory(ies) where the PHS will access MapReduce job history logs written by job clusters (see Setting up job clusters). Instead of specifying specific bucket directories, use asterisks as wildcards, (for example, gs://bucket-name/*/mapreduce-job-history/done) to allow the PHS server to match multiple directories in the specified bucket written to by different job clusters (but see Efficiency Consideration: Using Mid-Path Wildcards).
  1. Run the gcloud dataproc clusters create command to create a single-node Dataproc PHS cluster.
    gcloud dataproc clusters create cluster-name \
        --single-node \
        --region=region \
        --image-version=1.4-debian10 \
        --enable-component-gateway \
        --properties='spark:spark.history.fs.logDirectory=gs://bucket-name/*/spark-job-history,mapred:mapreduce.jobhistory.done-dir=gs://bucket-name/*/mapreduce-job-history/done'
    

Viewing job history files

  1. Go to the PHS single-node Cluster details page in the Cloud Console, then click the "WEB INTERFACES" tab.

  2. Click on "MapReduce Job History" or "Spark History Server" to view the MapReduce and Spark job history UIs.

    Example:

    The following screenshot shows the PHS history server UI displaying links to Spark jobs run on job-cluster-1 and job-cluster-2 after setting up the job clusters'spark.history.fs.logDirectory and spark:spark.eventLog.dir and PHS cluster's spark.history.fs.logDirectory locations as follows:

    job-cluster-1 gs://example-cloud-storage-bucket/job-cluster-1/spark-job-history
    job-cluster-2 gs://example-cloud-storage-bucket/job-cluster-2/spark-job-history
    phs-cluster gs://example-cloud-storage-bucket/*/spark-job-history