Dataproc diagnose clusters command

"Dataproc Cluster Command Diagnostic Summary | Google Cloud"

You can run the gcloud dataproc clusters diagnose command to collect system, Spark, Hadoop, and Dataproc logs, cluster configuration files, and other information that you can examine or share with Google support to help you troubleshoot a Dataproc cluster or job. The command uploads the diagnostic data and summary to the Dataproc staging bucket in Cloud Storage.

Run the Google Cloud CLI diagnose cluster command

Run the gcloud dataproc clusters diagnose command to create and output the location of the diagnostic archive file.

gcloud dataproc clusters diagnose CLUSTER_NAME \
    --region=REGION \
    OPTIONAL FLAGS ...

Notes:

CLUSTER_NAME: The name of the cluster to diagnose.
REGION: The cluster's region, for example, us-central1.
OPTIONAL FLAGS:
- --job-ids: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs, in addition to the default log files, for a specified comma-separated list of job IDs. For MapReduce jobs, only YARN application logs are collected. YARN log aggregation must be enabled for the collection of YARN application logs.
- --yarn-application-ids: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs in addition to the default log files, for a specified comma-separated list of YARN application IDs. YARN log aggregation must be enabled for the collection of YARN application logs.
- --start-time with --end-time: Use both flags to specify a time range, in %Y-%m-%dT%H:%M:%S.%fZ format, for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic data).
- --tarball-access=GOOGLE_DATAPROC_DIAGNOSE Use this flag to submit or provide access to the diagnostic tar file to the Google Cloud support team. Also provide information to Google Cloud support team as follows:
  - Cloud Storage path of the diagnostic tar file, or
  - Cluster configuration bucket, cluster UUID, and operation ID of the diagnose command

Run the diagnostic script from the cluster master node (if needed)

The gcloud dataproc clusters diagnose command can fail or time-out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. As an alternative to running the diagnose command, you can connect to the cluster master node using SSH, download the diagnostic script, and then run the script locally on the master node.

gcloud compute ssh HOSTNAME

gsutil cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .

sudo bash diagnostic-script.sh

The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to to upload the tar file to a Cloud Storage bucket.

How to share diagnostic data

To share the archive:

Download the archive from Cloud Storage, then share the downloaded archive, or
Change the permissions on the archive to allow other Google Cloud users or projects to access the file.

Example: The following command adds read permissions to the archive in a test-project:

gsutil -m acl ch -g test-project:R PATH_TO_ARCHIVE}

Diagnostic summary and archive contents

The diagnose command outputs a diagnostic summary and an archive tar file that contains cluster configuration files, logs, and other files and information. The archive tar file is written to the Dataproc staging bucket in Cloud Storage.

Diagnostic summary: The diagnostic script analyzes collected data, and generates a summary.txt at the root of the diagnostic archive. The summary provides an overview of cluster status, including YARN, HDFS, disk, and networking status, and includes warnings to alert you to potential problems.

Archive tar file: The following sections list the files and information contained in the diagnostic archive tar file.

Daemons and services information

Command executed	Location in archive
`yarn node -list -all`	`/system/yarn-nodes.log`
`hdfs dfsadmin -report -live -decommissioning`	`/system/hdfs-nodes.log`
`hdfs dfs -du -h`	`/system/hdfs-du.log`
`service --status-all`	`/system/service.log`
`systemctl --type service`	`/system/systemd-services.log`
`curl "http://${HOSTNAME}:8088/jmx"`	`/metrics/resource_manager_jmx`
`curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps"`	`/metrics/yarn_app_info`
`curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes"`	`/metrics/yarn_node_info`
`curl "http://${HOSTNAME}:9870/jmx"`	`/metrics/namenode_jmx`

JVM information

Command executed	Location in archive
`jstack -l "${DATAPROC_AGENT_PID}"`	`jstack/agent_${DATAPROC_AGENT_PID}.jstack`
`jstack -l "${PRESTO_PID}"`	`jstack/agent_${PRESTO_PID}.jstack`
`jstack -l "${JOB_DRIVER_PID}"`	`jstack/driver_${JOB_DRIVER_PID}.jstack`
`jinfo "${DATAPROC_AGENT_PID}"`	`jinfo/agent_${DATAPROC_AGENT_PID}.jstack`
`jinfo "${PRESTO_PID}"`	`jinfo/agent_${PRESTO_PID}.jstack`
`jinfo "${JOB_DRIVER_PID}"`	`jinfo/agent_${JOB_DRIVER_PID}.jstack`

Linux system information

Command executed	Location in archive
`df -h`	`/system/df.log`
`ps aux`	`/system/ps.log`
`free -m`	`/system/free.log`
`netstat -anp`	`/system/netstat.log`
`sysctl -a`	`/system/sysctl.log`
`uptime`	`/system/uptime.log`
`cat /proc/sys/fs/file-nr`	`/system/fs-file-nr.log`
`ping -c 1`	`/system/cluster-ping.log`

Log files

Item(s) included	Location in archive
All logs in `/var/log` with the following prefixes in their filename: `cloud-sql-proxy` `dataproc` `druid` `gcdp` `google` `hadoop` `hdfs` `hive` `knox` `presto` `spark` `syslog` `yarn` `zookeeper`	Files are placed in the archive `logs` folder, and keep their original filenames.
Dataproc node startup logs for each node (master and worker) in your cluster.	Files are placed in the archive `node_startup` folder, which contains separate sub-folders for each machine in the cluster.
Component gateway logs from `journalctl -u google-dataproc-component-gateway`	`/logs/google-dataproc-component-gateway.log`

Configuration files

Item(s) included	Location in archive
VM metadata	`/conf/dataproc/metadata`
Environment variables in `/etc/environment`	`/conf/dataproc/environment`
Dataproc properties	`/conf/dataproc/dataproc.properties`
All files in `/etc/google-dataproc/`	`/conf/dataproc/`
All files in `/etc/hadoop/conf/`	`/conf/hadoop/`
All files in `/etc/hive/conf/`	`/conf/hive/`
All files in `/etc/hive-hcatalog/conf/`	`/conf/hive-hcatalog/`
All files in `/etc/knox/conf/`	`/conf/knox/`
All files in `/etc/pig/conf/`	`/conf/pig/`
All files in `/etc/presto/conf/`	`/conf/presto/`
All files in `/etc/spark/conf/`	`/conf/spark/`
All files in `/etc/tez/conf/`	`/conf/tez/`
All files in `/etc/zookeeper/conf/`	`/conf/zookeeper/`