Diagnose Dataproc clusters

Looking at log and configuration information can be useful to troubleshoot a cluster or job. Unfortunately, there are many log and configuration files, and gathering each one for investigation can be time consuming. To address this problem, Dataproc clusters support a special diagnose command through the Google Cloud CLI. This command gathers and archives important system, Spark/Hadoop, and Dataproc logs, and then uploads the archive to the Cloud Storage bucket attached to your cluster.

Using the Google Cloud CLI diagnose command

You can use the Google Cloud CLI diagnose command on your Dataproc clusters (see Dataproc and Google Cloud CLI).

Once the gcloud CLI is installed and configured, you can run the gcloud dataproc clusters diagnose command on your cluster as shown below. Replace cluster-name with the name of your cluster and region with your cluster's region, for example, --region=us-central1.

gcloud dataproc clusters diagnose cluster-name \
    --region=region \
    ... other args ...

The command outputs the Cloud Storage location of the archive file that contains the data (see Items included in diagnose command output). See Sharing the data gathered by diagnose for information on accessing and copying the archive file.

Running the diagnostic script from the master node (optional)

The Google Cloud CLI diagnose command can fail or time out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. To avoid this issue, you can SSH into the master node, download the diagnostic script, then run the script locally on the master node:

gcloud compute ssh hostname
gsutil cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh

The diagnostic tarball will be saved in a local temporary directory. If you want, you can follow the instructions in the command output to upload it to a Cloud Storage bucket and share with Google Support.

Sharing the data gathered by diagnose

You can share the archive generated by the diagnose command in two ways:

  1. Download the file from Cloud Storage, then share the downloaded archive.
  2. Change the permissions on the archive to allow other Google Cloud Platform users or projects to access the file.

For example, the following command adds read permissions to the diagnose archive in a test-project:

gsutil -m acl ch -g test-project:R path-to-archive

Items included in diagnose command output

The diagnose command includes the following configuration files, logs, and outputs from your cluster in an archive file. The archive file is placed in the Cloud Storage bucket associated with your Dataproc cluster, as discussed above.

Diagnostic summary

The diagnostic script automatically analyzes collected data and generates a summary.txt at the root of the diagnostic tarball. The summary provides a high-level overview of cluster status, including YARN, HDFS, disk, networking etc, and includes warnings to alert you to potential problems.

Daemons and services information

Command executed Location in archive
yarn node -list -all /system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning /system/hdfs-nodes.log
hdfs dfs -du -h /system/hdfs-du.log
service --status-all /system/service.log
systemctl --type service /system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx" /metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" /metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" /metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx" /metrics/namenode_jmx

JVM information

Command executed Location in archive
jstack -l "${DATAPROC_AGENT_PID}" jstack/agent_${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTO_PID}" jstack/agent_${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVER_PID}" jstack/driver_${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENT_PID}" jinfo/agent_${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTO_PID}" jinfo/agent_${PRESTO_PID}.jstack
jinfo "${JOB_DRIVER_PID}" jinfo/agent_${JOB_DRIVER_PID}.jstack

Linux system information

Command executed Location in archive
df -h /system/df.log
ps aux /system/ps.log
free -m /system/free.log
netstat -anp /system/netstat.log
sysctl -a /system/sysctl.log
uptime /system/uptime.log
cat /proc/sys/fs/file-nr /system/fs-file-nr.log
ping -c 1 /system/cluster-ping.log

Log files

Item(s) included Location in archive
All logs in /var/log with the following prefixes in their filename:
cloud-sql-proxy
dataproc
druid
gcdp
gcs
google
hadoop
hdfs
hive
knox
presto
spark
syslog
yarn
zookeeper
Files are placed in the archive logs folder, and keep their original filenames.
Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup folder, which contains separate sub-folders for each machine in the cluster.
Component gateway logs from journalctl -u google-dataproc-component-gateway /logs/google-dataproc-component-gateway.log

Configuration files

Item(s) included Location in archive
VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment /conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/ /conf/dataproc/
All files in /etc/hadoop/conf/ /conf/hadoop/
All files in /etc/hive/conf/ /conf/hive/
All files in /etc/hive-hcatalog/conf/ /conf/hive-hcatalog/
All files in /etc/knox/conf/ /conf/knox/
All files in /etc/pig/conf/ /conf/pig/
All files in /etc/presto/conf/ /conf/presto/
All files in /etc/spark/conf/ /conf/spark/
All files in /etc/tez/conf/ /conf/tez/
All files in /etc/zookeeper/conf/ /conf/zookeeper/