You can run the gcloud dataproc clusters diagnose
command to collect system, Spark, Hadoop, and Dataproc logs, cluster configuration files,
and other information that you can examine or share with Google support
to help you troubleshoot a Dataproc cluster or job.
The command uploads the diagnostic data and summary
to the Dataproc staging bucket
in Cloud Storage.
Run the Google Cloud CLI diagnose cluster command
Run the gcloud dataproc clusters diagnose
command to create and output the location of the diagnostic archive file.
gcloud dataproc clusters diagnose CLUSTER_NAME \ --region=REGION \ OPTIONAL FLAGS ...
Notes:
- CLUSTER_NAME: The name of the cluster to diagnose.
- REGION: The cluster's region, for example,
us-central1
. OPTIONAL FLAGS:
--job-ids
: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs, in addition to the default log files, for a specified comma-separated list of job IDs. For MapReduce jobs, only YARN application logs are collected. YARN log aggregation must be enabled for the collection of YARN application logs.--yarn-application-ids
: You can this flag to collect job driver, Spark event, YARN application, and Spark Lense output logs in addition to the default log files, for a specified comma-separated list of YARN application IDs. YARN log aggregation must be enabled for the collection of YARN application logs.--start-time
with--end-time
: Use both flags to specify a time range, in%Y-%m-%dT%H:%M:%S.%fZ
format, for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic data).--tarball-access
=GOOGLE_DATAPROC_DIAGNOSE
Use this flag to submit or provide access to the diagnostic tar file to the Google Cloud support team. Also provide information to Google Cloud support team as follows:- Cloud Storage path of the diagnostic tar file, or
- Cluster configuration bucket, cluster UUID, and operation ID of the diagnose command
Run the diagnostic script from the cluster master node (if needed)
The gcloud dataproc clusters diagnose
command can fail or time-out if a cluster
is in an error state and cannot accept diagnose tasks from
the Dataproc server. As an alternative to running the
diagnose command, you can
connect to the cluster master node using SSH,
download the diagnostic script, and then run the script locally on the master node.
gcloud compute ssh HOSTNAME
gcloud storage cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh
The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to to upload the tar file to a Cloud Storage bucket.
How to share diagnostic data
To share the archive:
- Download the archive from Cloud Storage, then share the downloaded archive, or
- Change the permissions on the archive to allow other Google Cloud users or projects to access the file.
Example: The following command adds read permissions to the archive
for a user jane@gmail.com
:
gcloud storage objects update PATH_TO_ARCHIVE} --add-acl-grant=entity=user-jane@gmail.com,role=roles/storage.legacyObjectReader
Diagnostic summary and archive contents
The diagnose
command outputs a diagnostic summary and an archive tar file that contains
cluster configuration files, logs, and other files and information. The archive
tar file is written to the Dataproc
staging bucket
in Cloud Storage.
Diagnostic summary: The diagnostic script analyzes collected data, and generates a
summary.txt
at the root of the diagnostic archive. The summary provides an
overview of cluster status, including YARN, HDFS, disk, and networking
status, and includes warnings to alert you to potential problems.
Archive tar file: The following sections list the files and information contained in the diagnostic archive tar file.
Daemons and services information
Command executed | Location in archive |
---|---|
yarn node -list -all |
/system/yarn-nodes.log |
hdfs dfsadmin -report -live -decommissioning |
/system/hdfs-nodes.log |
hdfs dfs -du -h |
/system/hdfs-du.log |
service --status-all |
/system/service.log |
systemctl --type service |
/system/systemd-services.log |
curl "http://${HOSTNAME}:8088/jmx" |
/metrics/resource_manager_jmx |
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" |
/metrics/yarn_app_info |
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" |
/metrics/yarn_node_info |
curl "http://${HOSTNAME}:9870/jmx" |
/metrics/namenode_jmx |
JVM information
Command executed | Location in archive |
---|---|
jstack -l "${DATAPROC_AGENT_PID}" |
jstack/agent_${DATAPROC_AGENT_PID}.jstack |
jstack -l "${PRESTO_PID}" |
jstack/agent_${PRESTO_PID}.jstack |
jstack -l "${JOB_DRIVER_PID}" |
jstack/driver_${JOB_DRIVER_PID}.jstack |
jinfo "${DATAPROC_AGENT_PID}" |
jinfo/agent_${DATAPROC_AGENT_PID}.jstack |
jinfo "${PRESTO_PID}" |
jinfo/agent_${PRESTO_PID}.jstack |
jinfo "${JOB_DRIVER_PID}" |
jinfo/agent_${JOB_DRIVER_PID}.jstack |
Linux system information
Command executed | Location in archive |
---|---|
df -h |
/system/df.log |
ps aux |
/system/ps.log |
free -m |
/system/free.log |
netstat -anp |
/system/netstat.log |
sysctl -a |
/system/sysctl.log |
uptime |
/system/uptime.log |
cat /proc/sys/fs/file-nr |
/system/fs-file-nr.log |
ping -c 1 |
/system/cluster-ping.log |
Log files
Item(s) included | Location in archive |
---|---|
All logs in /var/log with the following prefixes in their filename:cloud-sql-proxy dataproc druid gcdp google hadoop hdfs hive knox presto spark syslog yarn zookeeper |
Files are placed in the archive logs folder, and keep their original filenames. |
Dataproc node startup logs for each node (master and worker) in your cluster. | Files are placed in the archive node_startup folder, which contains separate sub-folders for each machine in the cluster. |
Component gateway logs from journalctl -u google-dataproc-component-gateway |
/logs/google-dataproc-component-gateway.log |
Configuration files
Item(s) included | Location in archive |
---|---|
VM metadata | /conf/dataproc/metadata |
Environment variables in /etc/environment |
/conf/dataproc/environment |
Dataproc properties | /conf/dataproc/dataproc.properties |
All files in /etc/google-dataproc/ |
/conf/dataproc/ |
All files in /etc/hadoop/conf/ |
/conf/hadoop/ |
All files in /etc/hive/conf/ |
/conf/hive/ |
All files in /etc/hive-hcatalog/conf/ |
/conf/hive-hcatalog/ |
All files in /etc/knox/conf/ |
/conf/knox/ |
All files in /etc/pig/conf/ |
/conf/pig/ |
All files in /etc/presto/conf/ |
/conf/presto/ |
All files in /etc/spark/conf/ |
/conf/spark/ |
All files in /etc/tez/conf/ |
/conf/tez/ |
All files in /etc/zookeeper/conf/ |
/conf/zookeeper/ |