Dataproc job output and logs

When you submit a Dataproc job, Dataproc automatically gathers the job output, and makes it available to you. This means you can quickly review job output without having to maintain a connection to the cluster while your jobs run or look through complicated log files.

Spark logs

There are two types of Spark logs: Spark driver logs and Spark executor logs. Spark driver logs contain job output; Spark executor logs contain job executable or launcher output, such as a spark-submit "Submitted application xxx" message, and can be helpful for debugging job failures.

The Dataproc job driver, which is distinct from the Spark driver, is a launcher for many job types. When launching Spark jobs, it runs as a wrapper on the underlying spark-submit executable, which launches the Spark driver. The Spark driver runs the job on the Dataproc cluster in Spark client or cluster mode:

  • client mode: the Spark driver runs the job in the spark-submit process, and Spark logs are sent to the Dataproc job driver.

  • cluster mode: the Spark driver runs the job in a YARN container. Spark driver logs are not available to the Dataproc job driver.

Dataproc and Spark job properties overview

Property Value Default Description
dataproc:dataproc.logging.stackdriver.job.driver.enable true or false false Must be set at cluster creation time. When true, job driver output is in Logging, associated with the job resource; when false, job driver output is not in Logging.
Note: The following cluster property settings are also required to enable job driver logs in Logging, and are set by default when a cluster is created: dataproc:dataproc.logging.stackdriver.enable=true and dataproc:jobs.file-backed-output.enable=true
dataproc:dataproc.logging.stackdriver.job.yarn.container.enable true or false false Must be set at cluster creation time. When true, job YARN container logs are associated with the job resource; when false, job YARN container logs are associated with the cluster resource.
spark:spark.submit.deployMode client or cluster client Controls Spark client or cluster mode.

Spark jobs submitted using the Dataproc jobs API

The tables in this section list the effect of different property settings on the destination of Dataproc job driver output when jobs are submitted through the Dataproc jobs API, which includes job submission through the Google Cloud console, gcloud CLI, and Cloud Client Libraries.

The listed Dataproc and Spark properties can be set with the --properties flag when a cluster is created, and will apply to all Spark jobs run on the cluster; Spark properties can also be set with the --properties flag (without the "spark:" prefix) when a job is submitted to the Dataproc jobs API, and will apply only to the job.

Dataproc job driver output

The following tables list the effect of different property settings on the destination of Dataproc job driver output.

dataproc:
dataproc.logging.stackdriver.job.driver.enable
Output
false (default)
  • Streamed to client
  • In Cloud Storage at the Dataproc-generated driverOutputResourceUri
  • Not in Logging
true
  • Streamed to client
  • In Cloud Storage at the Dataproc-generated driverOutputResourceUri
  • In Logging: dataproc.job.driver under the job resource.

Spark driver logs

The following tables list the effect of different property settings on the destination of Spark driver logs.

spark:
spark.submit.deployMode
dataproc:
dataproc.logging.stackdriver.job.driver.enable
dataproc:
dataproc.logging.stackdriver.job.yarn.container.enable
Driver Output
client false (default) true or false
  • Streamed to client
  • In Cloud Storage at the Dataproc-generated driverOutputResourceUri
  • Not in Logging
client true true or false
  • Streamed to client
  • In Cloud Storage at the Dataproc-generated driverOutputResourceUri
  • In Logging: dataproc.job.driver under the job resource
cluster false (default) false
  • Not streamed to client
  • Not in Cloud Storage
  • In Logging yarn-userlogs under the cluster resource
cluster true true
  • Not streamed to client
  • Not in Cloud Storage
  • In Logging: dataproc.job.yarn.container under the job resource

Spark executor logs

The following tables list the effect of different property settings on the destination of Spark executor logs.

dataproc:
dataproc.logging.stackdriver.job.yarn.container.enable
Executor log
false (default) In Logging: yarn-userlogs under the cluster resource
true In Logging dataproc.job.yarn.container under the job resource

Spark jobs submitted without using the Dataproc jobs API

This section lists the effect of different property settings on the destination of Spark job logs when jobs are submitted without using the Dataproc jobs API, for example when submitting a job directly on a cluster node using spark-submit or when using a Jupyter or Zeppelin notebook. These jobs do not have Dataproc job IDs or drivers.

Spark driver logs

The following tables list the effect of different property settings on the destination of Spark driver logs for jobs not submitted through the Dataproc jobs API.

spark:
spark.submit.deployMode
Driver Output
client
  • Streamed to client
  • Not in Cloud Storage
  • Not in Logging
cluster
  • Not streamed to client
  • Not in Cloud Storage
  • In Logging yarn-userlogs under the cluster resource

Spark executor logs

When Spark jobs are not submitted through the Dataproc jobs API, executor logs are in Logging yarn-userlogs under the cluster resource.

View job output

You can access Dataproc job output in the Google Cloud console, the gcloud CLI, Cloud Storage, or Logging.

Console

To view job output, go to your project's Dataproc Jobs section, then click on the Job ID to view job output.

If the job is running, job output periodically refreshes with new content.

gcloud command

When you submit a job with the gcloud dataproc jobs submit command, job output is displayed on the console. You can "rejoin" output at a later time, on a different computer, or in a new window by passing your job's ID to the gcloud dataproc jobs wait command. The Job ID is a GUID, such as 5c1754a5-34f7-4553-b667-8a1199cb9cab. Here's an example.

gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab \
    --project my-project-id --region my-cluster-region
Waiting for job output...
... INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2
... 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/
...

Cloud Storage

Job output is stored in Cloud Storage in either the staging bucket or the bucket you specified when you created your cluster. A link to job output in Cloud Storage is provided in the Job.driverOutputResourceUri field returned by:

  • a jobs.get API request.
  • a gcloud dataproc jobs describe job-id command.
    $ gcloud dataproc jobs describe spark-pi
    ...
    driverOutputResourceUri: gs://dataproc-nnn/jobs/spark-pi/driveroutput
    ...
    

Logging

See Dataproc Logs for information on how to view Dataproc job output in Logging.