Job driver output

You can easily submit, monitor, and control jobs on Dataproc clusters using the gcloud command-line tool, the Google Cloud Console, or the Cloud Dataproc REST API. When you use one of these mechanisms to submit your job, Cloud Dataproc automatically gathers the driver (console) output from your job, and makes it available to you. This means you can quickly review driver output without having to maintain a connection to the cluster while your jobs run or look through complicated log files.

Configuring logging

By default, Cloud Dataproc uses a default logging level of WARN for driver programs. This setting can be adjusted when using the command line, which allows you submit a job with the --driver-log-levelsoption.

The special root package controls the root logger level. For example:

gcloud dataproc jobs submit hadoop ...\
  --driver-log-levels root=FATAL,com.example=INFO

Logging can be set at a more granular level for each job. For example, to assist in debugging issues when reading files from Cloud Storage, you can submit a job with the --driver-log-levels option, specifying the DEBUG log level as follows:

gcloud dataproc jobs submit hadoop ...\

Setting executor log levels

You can set Spark, Hadoop, Flink and other OSS component executor log levels on cluster nodes with a cluster initialization action that edits or replaces the .../ file (see Apache Log4j 2).

Sample /etc/spark/conf/ file:

# Set everything to be logged to the console.
log4j.rootCategory=INFO, console
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n

# Settings to quiet third party logs.$exprTyper=INFO$SparkILoopInterpreter=INFO

# Reduce verbosity for other core classes.

# Spark 2.0 specific output.

Accessing job driver output

You can access Cloud Dataproc job driver output using the Cloud Console, the gcloud command-line tool, or Cloud Storage.

gcloud command

When you submit a job with the gcloud dataproc jobs submit command, the job's driver output is displayed on the console. You can "rejoin" driver output at a later time, on a different computer, or in a new window by passing your job's ID to the gcloud dataproc jobs wait command. The Job ID is a GUID, such as 5c1754a5-34f7-4553-b667-8a1199cb9cab. Here's an example.

gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab \
    --project my-project-id --region my-cluster-region
Waiting for job output...
... INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2
... 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/


To view job driver output, go to your project's Dataproc Jobs section, then click on the Job ID to view job output.

If the job is running, the job driver output periodically refreshes with new content.

Cloud Storage

Job driver output is stored in Cloud Storage in either the staging bucket or the bucket you specified when you created your cluster. A link to job driver output in Cloud Storage is provided in the Job.driverOutputResourceUri field returned by:

  • a jobs.get API request.
  • a gcloud dataproc jobs describe job-id command.
    $ gcloud dataproc jobs describe spark-pi
    driverOutputResourceUri: gs://dataproc-nnn/jobs/spark-pi/driveroutput