Diagnose Dataproc clusters

Looking at log and configuration information can be useful to troubleshoot a cluster or job. Unfortunately, there are many log and configuration files, and gathering each one for investigation can be time consuming. To address this problem, Dataproc clusters support a special diagnose command through the Cloud SDK. This command gathers and archives important system, Spark/Hadoop, and Dataproc logs, and then uploads the archive to the Cloud Storage bucket attached to your cluster.

Using the Cloud SDK diagnose command

You can use the Cloud SDK diagnose command on your Dataproc clusters (see Dataproc and Cloud SDK).

Once the Cloud SDK is installed and configured, you can run the gcloud dataproc clusters diagnose command on your cluster as shown below. Replace cluster-name with the name of your cluster and region with your cluster's region, for example, --region=us-central1.

gcloud dataproc clusters diagnose cluster-name \
    --region=region \
    ... other args ...

The command outputs the name and location of the archive that contains your data.

Saving archive to cloud
Copying file:///tmp/tmp.FgWEq3f2DJ/diagnostic.tar ...
Uploading   ...23db9-762e-4593-8a5a-f4abd75527e6/diagnostic.tar ...
Diagnostic results saved in:
In this example, bucket-name is the Cloud Storage bucket attached to your cluster, cluster-uuid is the unique ID (UUID) of your cluster, and job-id is the UUID belonging to the system task that ran the diagnose command.

When you create a Dataproc cluster, Dataproc creates a Cloud Storage bucket and attaches it to your cluster. The diagnose command outputs the archive file to this bucket. To determine the name of the bucket created by Dataproc, use the Cloud SDK clusters describe command. The bucket associated with your cluster is listed next to configurationBucket.

gcloud dataproc clusters describe cluster-name \
    --region=region \
  clusterName: cluster-name
  clusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec6e0eac
  configurationBucket: dataproc-edc9d85f-...-us

Running the diagnostic script from the master node (optional)

The Cloud SDK diagnose command can fail or time out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. To avoid this issue, you can SSH into the master node, download the diagnostic script, then run the script locally on the master node:

gcloud compute ssh hostname
gsutil cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh

The diagnostic tarball will be saved in a local temporary directory. If you want, you can follow the instructions in the command output to upload it to a Cloud Storage bucket and share with Google Support.

Sharing the data gathered by diagnose

You can share the archive generated by the diagnose command in two ways:

  1. Download the file from Cloud Storage, then share the downloaded archive.
  2. Change the permissions on the archive to allow other Google Cloud Platform users or projects to access the file.

For example, the following command adds read permissions to the diagnose archive in a test-project:

gsutil -m acl ch -g test-project:R path-to-archive

Items included in diagnose command output

The diagnose command includes the following configuration files, logs, and outputs from your cluster in an archive file. The archive file is placed in the Cloud Storage bucket associated with your Dataproc cluster, as discussed above.

Daemons and services information

Command executed Location in archive
yarn node -list -all /system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning /system/hdfs-nodes.log
hdfs dfs -du -h /system/hdfs-du.log
service --status-all /system/service.log
systemctl --type service /system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx" /metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" /metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" /metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx" /metrics/namenode_jmx

JVM information

Command executed Location in archive
jstack -l "${DATAPROC_AGENT_PID}" jstack/agent_${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTO_PID}" jstack/agent_${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVER_PID}" jstack/driver_${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENT_PID}" jinfo/agent_${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTO_PID}" jinfo/agent_${PRESTO_PID}.jstack
jinfo "${JOB_DRIVER_PID}" jinfo/agent_${JOB_DRIVER_PID}.jstack

Linux system information

Command executed Location in archive
df -h /system/df.log
ps aux /system/ps.log
free -m /system/free.log
netstat -anp /system/netstat.log
sysctl -a /system/sysctl.log
uptime /system/uptime.log
cat /proc/sys/fs/file-nr /system/fs-file-nr.log
ping -c 1 /system/cluster-ping.log

Log files

Item(s) included Location in archive
All logs in /var/log with the following prefixes in their filename:
Files are placed in the archive logs folder, and keep their original filenames.
Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup folder, which contains separate sub-folders for each machine in the cluster.
Component gateway logs from journalctl -u google-dataproc-component-gateway /logs/google-dataproc-component-gateway.log

Configuration files

Item(s) included Location in archive
VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment /conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/ /conf/dataproc/
All files in /etc/hadoop/conf/ /conf/hadoop/
All files in /etc/hive/conf/ /conf/hive/
All files in /etc/hive-hcatalog/conf/ /conf/hive-hcatalog/
All files in /etc/knox/conf/ /conf/knox/
All files in /etc/pig/conf/ /conf/pig/
All files in /etc/presto/conf/ /conf/presto/
All files in /etc/spark/conf/ /conf/spark/
All files in /etc/tez/conf/ /conf/tez/
All files in /etc/zookeeper/conf/ /conf/zookeeper/