Dataproc logs

Dataproc job and cluster logs can be viewed, searched, filtered, and archived in Cloud Logging.

Job driver logging levels

Dataproc uses a default logging level of INFO for job driver programs. This setting can be changed for one or more packages with the gcloud dataproc jobs submit command, which allows you to submit a job and specify job driver logging levels with the --driver-log-levelsflag.

The root package controls the root logger level. For example:

gcloud dataproc jobs submit hadoop ...\
  --driver-log-levels root=FATAL,com.example=INFO

Cloud Logging can be set at a more granular level for a specific job. For example, to assist in debugging issues when reading files from Cloud Storage, you can submit a job as follows:

gcloud dataproc jobs submit hadoop ...\
  --driver-log-levels com.google.cloud.hadoop.gcsio=DEBUG

Component executive logging levels

You can set Spark, Hadoop, Flink and other Dataproc component executive logging levels when you create a cluster by one or both of the following methods:

Dataproc job driver logs in Logging

See Dataproc job output and logs for information on enabling Dataproc job driver logs in Logging.

Access job logs in Logging

You can access Dataproc job logs using the Logs Explorer, the gcloud logging command, or the Logging API.

Console

Dataproc Job driver and YARN container logs are listed under the Cloud Dataproc Job resource.

Example: Job driver log after running a Logs Explorer query with the following selections:

  • Resource: Cloud Dataproc Job
  • Log name: dataproc.job.driver

Example: YARN container log after running a Logs Explorer query with the following selections:

  • Resource: Cloud Dataproc Job
  • Log name: dataproc.job.yarn.container

gcloud

You can read job log entries using the gcloud logging read command. The resource arguments must be enclosed in quotes ("..."). The following command uses cluster labels to filter the returned log entries.

gcloud logging read \
    "resource.type=cloud_dataproc_job \
    resource.labels.region=cluster-region \
    resource.labels.job_id=my-job-id"

Sample output (partial):

jsonPayload:
  class: org.apache.hadoop.hdfs.StateChange
  filename: hadoop-hdfs-namenode-test-dataproc-resize-cluster-20190410-38an-m-0.log
  ,,,
logName: projects/project-id/logs/hadoop-hdfs-namenode
---
jsonPayload:
  class: SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager
  filename: cluster-name-dataproc-resize-cluster-20190410-38an-m-0.log
  ...
logName: projects/google.com:hadoop-cloud-dev/logs/hadoop-hdfs-namenode

REST API

You can use the Logging REST API to list log entries (see entries.list).

Dataproc cluster logs in Logging

Dataproc exports the following Apache Hadoop, Spark, Hive, Zookeeper, and other Dataproc cluster logs to Cloud Logging.

Log Type Log Name Description
Master daemon logs hadoop-hdfs
hadoop-hdfs-namenode
hadoop-hdfs-secondary namenode
hadoop-hdfs-zkfc
hadoop-yarn-resourcemanager
hadoop-yarn-timelineserver
hive-metastore
hive-server2
mapred-mapred-historyserver
zookeeper
Journal node
HDFS namenode
HDFS secondary namenode
Zookeeper failover controller
YARN resource manager
YARN timeline server
Hive metastore
Hive server2
Mapreduce job history server
Zookeeper server
Worker daemon logs hadoop-hdfs-datanode
hadoop-yarn-nodemanager
HDFS datanode
YARN nodemanager
System logs autoscaler
google.dataproc.agent
google.dataproc.startup
Dataproc autoscaler log
Dataproc agent log
Dataproc startup script log + initialization action log

Access cluster logs in Cloud Logging

You can access Dataproc cluster logs using the Logs Explorer, the gcloud logging command, or the Logging API.

Console

Make the following query selections to view cluster logs in the Logs Explorer:

  • Resource: Cloud Dataproc Cluster
  • Log name: log name

gcloud

You can read cluster log entries using the gcloud logging read command. The resource arguments must be enclosed in quotes ("..."). The following command uses cluster labels to filter the returned log entries.

gcloud logging read <<'EOF'
    "resource.type=cloud_dataproc_cluster
    resource.labels.region=cluster-region
    resource.labels.cluster_name=cluster-name
    resource.labels.cluster_uuid=cluster-uuid"
EOF

Sample output (partial):

jsonPayload:
  class: org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService
  filename: hadoop-yarn-resourcemanager-cluster-name-m.log
  ...
logName: projects/project-id/logs/hadoop-yarn-resourcemanager
---
jsonPayload:
  class: org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService
  filename: hadoop-yarn-resourcemanager-component-gateway-cluster-m.log
  ...
logName: projects/project-id/logs/hadoop-yarn-resourcemanager

REST API

You can use the Logging REST API to list log entries (see entries.list).

Permissions

To write logs to Logging, the Dataproc VM service account must have the logging.logWriter role IAM role. The default Dataproc service account has this role. If you use a custom service account, you must assign this role to the service account.

Protecting the logs

By default, logs in Logging are encrypted at rest. You can enable customer-managed encryption keys (CMEK) to encrypt the logs. For more information on CMEK support, see Manage the keys that protect Log Router data and Manage the keys that protect Logging storage data.

Whats next