Diagnose Dataproc clusters

"Dataproc Cluster Command Diagnostic Summary | Google Cloud"

You can run the gcloud dataproc clusters diagnose command to collect system, Spark, Hadoop, and Dataproc logs, cluster configuration files, and other information that you can examine or share with Google support to help you troubleshoot a Dataproc cluster or job. The command uploads the diagnostic data and summary to the Dataproc staging bucket in Cloud Storage.

Run the Google Cloud CLI diagnose cluster command

Run the gcloud dataproc clusters diagnose command to create and output the location of the diagnostic archive file.

gcloud dataproc clusters diagnose CLUSTER_NAME \
    --region=REGION \
    OPTIONAL FLAGS ...

Notes:

  • CLUSTER_NAME: The name of the cluster to diagnose.
  • REGION: The cluster's region, for example, us-central1.
  • OPTIONAL FLAGS:

    • --job-ids: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs, in addition to the default log files, for a specified comma-separated list of job IDs. For MapReduce jobs, only YARN application logs are collected. YARN log aggregation must be enabled for the collection of YARN application logs.

    • --yarn-application-ids: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs in addition to the default log files, for a specified comma-separated list of YARN application IDs. YARN log aggregation must be enabled for the collection of YARN application logs.

    • --start-time with --end-time: Use both flags to specify a time range, in %Y-%m-%dT%H:%M:%S.%fZ format, for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic data).

    • --tarball-access=GOOGLE_DATAPROC_DIAGNOSE Use this flag to submit or provide access to the diagnostic tar file to the Google Cloud support team. Also provide information to Google Cloud support team as follows:

      • Cloud Storage path of the diagnostic tar file, or
      • Cluster configuration bucket, cluster UUID, and operation ID of the diagnose command

Run the diagnostic script from the cluster master node (if needed)

The gcloud dataproc clusters diagnose command can fail or time-out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. As an alternative to running the diagnose command, you can connect to the cluster master node using SSH, download the diagnostic script, and then run the script locally on the master node.

gcloud compute ssh HOSTNAME
gcloud storage cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh

The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to to upload the tar file to a Cloud Storage bucket.

How to share diagnostic data

To share the archive:

Example: The following command adds read permissions to the archive for a user jane@gmail.com:

gcloud storage objects update PATH_TO_ARCHIVE} --add-acl-grant=entity=user-jane@gmail.com,role=roles/storage.legacyObjectReader

Diagnostic summary and archive contents

The diagnose command outputs a diagnostic summary and an archive tar file that contains cluster configuration files, logs, and other files and information. The archive tar file is written to the Dataproc staging bucket in Cloud Storage.

Diagnostic summary: The diagnostic script analyzes collected data, and generates a summary.txt at the root of the diagnostic archive. The summary provides an overview of cluster status, including YARN, HDFS, disk, and networking status, and includes warnings to alert you to potential problems.

Archive tar file: The following sections list the files and information contained in the diagnostic archive tar file.

Daemons and services information

Command executed Location in archive
yarn node -list -all /system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning /system/hdfs-nodes.log
hdfs dfs -du -h /system/hdfs-du.log
service --status-all /system/service.log
systemctl --type service /system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx" /metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" /metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" /metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx" /metrics/namenode_jmx

JVM information

Command executed Location in archive
jstack -l "${DATAPROC_AGENT_PID}" jstack/agent_${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTO_PID}" jstack/agent_${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVER_PID}" jstack/driver_${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENT_PID}" jinfo/agent_${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTO_PID}" jinfo/agent_${PRESTO_PID}.jstack
jinfo "${JOB_DRIVER_PID}" jinfo/agent_${JOB_DRIVER_PID}.jstack

Linux system information

Command executed Location in archive
df -h /system/df.log
ps aux /system/ps.log
free -m /system/free.log
netstat -anp /system/netstat.log
sysctl -a /system/sysctl.log
uptime /system/uptime.log
cat /proc/sys/fs/file-nr /system/fs-file-nr.log
ping -c 1 /system/cluster-ping.log

Log files

Item(s) included Location in archive
All logs in /var/log with the following prefixes in their filename:
cloud-sql-proxy
dataproc
druid
gcdp
google
hadoop
hdfs
hive
knox
presto
spark
syslog
yarn
zookeeper
Files are placed in the archive logs folder, and keep their original filenames.
Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup folder, which contains separate sub-folders for each machine in the cluster.
Component gateway logs from journalctl -u google-dataproc-component-gateway /logs/google-dataproc-component-gateway.log

Configuration files

Item(s) included Location in archive
VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment /conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/ /conf/dataproc/
All files in /etc/hadoop/conf/ /conf/hadoop/
All files in /etc/hive/conf/ /conf/hive/
All files in /etc/hive-hcatalog/conf/ /conf/hive-hcatalog/
All files in /etc/knox/conf/ /conf/knox/
All files in /etc/pig/conf/ /conf/pig/
All files in /etc/presto/conf/ /conf/presto/
All files in /etc/spark/conf/ /conf/spark/
All files in /etc/tez/conf/ /conf/tez/
All files in /etc/zookeeper/conf/ /conf/zookeeper/