Diagnose Dataproc clusters

Looking at log and configuration information can be useful to troubleshoot a cluster or job. Unfortunately, there are many log and configuration files, and gathering each one for investigation can be time consuming. To address this problem, Cloud Dataproc clusters support a special diagnose command through the Cloud SDK. This command gathers and archives important system, Spark/Hadoop, and Cloud Dataproc logs, and then uploads the archive to the Cloud Storage bucket attached to your cluster.

Using the diagnose command

You can use the Cloud SDK diagnose command on your Cloud Dataproc clusters (see Dataproc and Cloud SDK).

Once the Cloud SDK is installed and configured, you can run the gcloud dataproc clusters diagnose command on your cluster as shown below. Replace cluster-name with the name of your cluster and region with your cluster's region, for example, --region=us-central1.

gcloud dataproc clusters diagnose cluster-name \
    --region=region \
    ... other args ...

The command outputs the name and location of the archive that contains your data.

...
Saving archive to cloud
Copying file:///tmp/tmp.FgWEq3f2DJ/diagnostic.tar ...
Uploading   ...23db9-762e-4593-8a5a-f4abd75527e6/diagnostic.tar ...
Diagnostic results saved in:
gs://bucket-name/.../cluster-uuid/.../job-id/diagnostic.tar
    ...
In this example, bucket-name is the Cloud Storage bucket attached to your cluster, cluster-uuid is the unique ID (UUID) of your cluster, and job-id is the UUID belonging to the system task that ran the diagnose command.

When you create a Cloud Dataproc cluster, Cloud Dataproc automatically creates a Cloud Storage bucket and attaches it to your cluster. The diagnose command outputs the archive file to this bucket. To determine the name of the bucket created by Cloud Dataproc, use the Cloud SDK clusters describe command. The bucket associated with your cluster is listed next to configurationBucket.

gcloud dataproc clusters describe cluster-name \
    --region=region \
...
  clusterName: cluster-name
  clusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec6e0eac
  configuration:
  configurationBucket: dataproc-edc9d85f-...-us
  ...

Sharing the data gathered by diagnose

You can share the archive generated by the diagnose command in two ways:

  1. Download the file from Cloud Storage, then share the downloaded archive.
  2. Change the permissions on the archive to allow other Google Cloud Platform users or projects to access the file.

For example, the following command adds read permissions to the diagnose archive in a test-project:

gsutil -m acl ch -g test-project:R path-to-archive

Items included in diagnose command output

The diagnose command includes the following configuration files, logs, and outputs from your cluster in an archive file. The archive file is placed in the Cloud Storage bucket associated with your Cloud Dataproc cluster, as discussed above.

Daemons and services information

Command executed Location in archive
yarn node -list -all /system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning /system/hdfs-nodes.log
hdfs dfs -du -h /system/hdfs-du.log
service --status-all /system/service.log
systemctl --type service /system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx" /metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" /metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" /metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx" /metrics/namenode_jmx

JVM information

Command executed Location in archive
jstack -l "${DATAPROC_AGENT_PID}" jstack/agent_${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTO_PID}" jstack/agent_${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVER_PID}" jstack/driver_${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENT_PID}" jinfo/agent_${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTO_PID}" jinfo/agent_${PRESTO_PID}.jstack
jinfo "${JOB_DRIVER_PID}" jinfo/agent_${JOB_DRIVER_PID}.jstack

Linux system information

Command executed Location in archive
df -h /system/df.log
ps aux /system/ps.log
free -m /system/free.log
netstat -anp /system/netstat.log
sysctl -a /system/sysctl.log
uptime /system/uptime.log
cat /proc/sys/fs/file-nr /system/fs-file-nr.log
ping -c 1 /system/cluster-ping.log

Log files

Item(s) included Location in archive
All logs in /var/log with the following prefixes in their filename:
cloud-sql-proxy
dataproc
druid
gcdp
gcs
google
hadoop
hdfs
hive
knox
presto
spark
syslog
yarn
zookeeper
Files are placed in the archive logs folder, and keep their original filenames.
Cloud Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup folder, which contains separate sub-folders for each machine in the cluster.
Component gateway logs from journalctl -u google-dataproc-component-gateway /logs/google-dataproc-component-gateway.log

Configuration files

Item(s) included Location in archive
VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment /conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/ /conf/dataproc/
All files in /etc/hadoop/conf/ /conf/hadoop/
All files in /etc/hive/conf/ /conf/hive/
All files in /etc/hive-hcatalog/conf/ /conf/hive-hcatalog/
All files in /etc/knox/conf/ /conf/knox/
All files in /etc/pig/conf/ /conf/pig/
All files in /etc/presto/conf/ /conf/presto/
All files in /etc/spark/conf/ /conf/spark/
All files in /etc/tez/conf/ /conf/tez/
All files in /etc/zookeeper/conf/ /conf/zookeeper/