You can easily submit, monitor, and control jobs on Cloud Dataproc clusters using the gcloud command-line tool, the Google Cloud Platform Console, or the Cloud Dataproc REST API. When you use one of these mechanisms to submit your job, Cloud Dataproc automatically gathers the driver (console) output from your job, and makes it available to you. This means you can quickly review driver output without having to maintain a connection to the cluster while your jobs run or look through complicated log files.
By default, Cloud Dataproc uses a default
INFO for driver programs. This setting can be adjusted when using the
command line, which allows you submit a job with the
root package controls the root logger level. For example:
gcloud dataproc jobs submit hadoop ...\ --driver-log-levels root=FATAL,com.example=INFO
Logging can be set at a more granular level for each job. For example, to assist in debugging issues
when reading files from Cloud Storage, you can submit a job with the
option, specifying the
DEBUG log level as follows:
gcloud dataproc jobs submit hadoop ...\ --driver-log-levels com.google.cloud.hadoop.gcsio=DEBUG
Accessing job driver output
You can access Cloud Dataproc job driver output using the GCP Console,
gcloud command-line tool, or Cloud Storage.
When you submit a job with the
gcloud dataproc jobs submit
command, the job's driver output is displayed on the console. You can "rejoin"
driver output at a later time, on a different computer, or in
a new window by passing your job's ID to the
gcloud dataproc jobs wait
command. The Job ID is a
5c1754a5-34f7-4553-b667-8a1199cb9cab. Here's an example.
gcloud dataproc jobs wait 5c1754a5-34f7-4553-b667-8a1199cb9cab Waiting for job output... ... INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.2-hadoop2 ... 16:47:45 INFO client.RMProxy: Connecting to ResourceManager at my-test-cluster-m/ ...
The GCP Console allows you to view a job's realtime driver output. To view job output, go to your project's Cloud Dataproc Jobs section, then click on the Job ID to view job output.
If the job is running, the output periodically refreshes with new content.
When you create a Cloud Dataproc cluster, you can specify a Cloud Storage bucket to use with your cluster. Job driver output is saved in this bucket.
Cloud Dataproc uses a defined folder structure for Cloud Storage buckets attached to clusters. Cloud Dataproc also supports attaching more than one cluster to a Cloud Storage bucket. The folder structure used for saving job driver output in Cloud Storage is:
cloud-storage-bucket-name - google-cloud-dataproc-metainfo - list of cluster IDs - list of job IDs - list of output logs for a job
You can navigate to a cluster and job in the GCP Console to view the driver output for that job in Cloud Storage.