When you submit a Dataproc job, Dataproc automatically gathers the job output, and makes it available to you. This means you can quickly review job output without having to maintain a connection to the cluster while your jobs run or look through complicated log files.
There are two types of Spark logs: Spark driver logs and Spark executor logs.
Spark driver logs contain job output; Spark executor logs contain job executable
or launcher output, such as a
spark-submit "Submitted application xxx" message, and
can be helpful for debugging job failures.
The Dataproc job driver, which is distinct from the Spark driver,
is a launcher for many job types. When launching Spark jobs, it runs as a
wrapper on the underlying
spark-submit executable, which launches the Spark
driver. The Spark driver runs the job on the Dataproc cluster in Spark
clientmode: the Spark driver runs the job in the
spark-submitprocess, and Spark logs are sent to the Dataproc job driver.
clustermode: the Spark driver runs the job in a YARN container. Spark driver logs are not available to the Dataproc job driver.
Dataproc and Spark job properties overview
||true or false||false||Must be set at cluster creation time. When
Note: The following cluster property settings are also required to enable job driver logs in Logging, and are set by default when a cluster is created:
||true or false||false||Must be set at cluster creation time.
||client or cluster||client||Controls Spark
Spark jobs submitted using the Dataproc
The tables in this section list the effect of different property settings on the
destination of Dataproc job driver output when jobs are submitted
through the Dataproc
jobs API, which includes job submission through the
Google Cloud console, gcloud CLI, and Cloud Client Libraries.
The listed Dataproc and Spark properties
can be set with the
--properties flag when a cluster is created, and will apply
to all Spark jobs run on the cluster; Spark properties can also be set with the
--properties flag (without the "spark:" prefix) when a job is
submitted to the Dataproc
jobs API, and will apply only to the job.
Dataproc job driver output
The following tables list the effect of different property settings on the destination of Dataproc job driver output.
Spark driver logs
The following tables list the effect of different property settings on the destination of Spark driver logs.
|client||false (default)||true or false||
|client||true||true or false||
Spark executor logs
The following tables list the effect of different property settings on the destination of Spark executor logs.
|false (default)||In Logging:
Spark jobs submitted without using the Dataproc
This section lists the effect of different property settings on the
destination of Spark job logs when jobs are submitted
without using the Dataproc
jobs API, for example when submitting
a job directly on a cluster node using
spark-submit or when using a Jupyter
or Zeppelin notebook. These jobs do not have Dataproc job IDs or drivers.
Spark driver logs
The following tables list the effect of different property settings on the
destination of Spark driver logs for jobs not submitted through the Dataproc
Spark executor logs
When Spark jobs are not submitted through the Dataproc
jobs API, executor
logs are in Logging
yarn-userlogs under the cluster resource.
View job output
You can access Dataproc job output in the Google Cloud console, the gcloud CLI, Cloud Storage, or Logging.
To view job output, go to your project's Dataproc Jobs section, then click on the Job ID to view job output.
If the job is running, job output periodically refreshes with new content.
When you submit a job with the
gcloud dataproc jobs submit
command, job output is displayed on the console. You can "rejoin"
output at a later time, on a different computer, or in
a new window by passing your job's ID to the
gcloud dataproc jobs wait
command. The Job ID is a
5c1754a5-34f7-4553-b667-8a1199cb9cab. Here's an example.
gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab \ --project my-project-id --region my-cluster-region
Waiting for job output... ... INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2 ... 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/ ...
Job output is stored in Cloud Storage in either the staging bucket or the bucket you specified when you created your cluster. A link to job output in Cloud Storage is provided in the Job.driverOutputResourceUri field returned by:
- a jobs.get API request.
- a gcloud dataproc jobs describe job-id
$ gcloud dataproc jobs describe spark-pi ... driverOutputResourceUri: gs://dataproc-nnn/jobs/spark-pi/driveroutput ...