Dataproc collects the following cluster diagnostic data to help you troubleshoot cluster and job issues:
- Checkpoint data: When enabled, Dataproc collects and updates diagnostic data throughout the lifecycle of a cluster.
- Snapshot data: You can collect a snapshot of cluster diagnostic data from a running cluster.
Checkpoint data
When the checkpoint data feature is enabled, Dataproc collects
diagnostics data
during cluster creation,
cluster update,
and Dataproc Jobs API
operations. Dataproc saves the data in the cluster
temp bucket
in
Cloud Storage, which has a TTL retention period of 90 days. The data is deleted
at the end of the retention period.
Enable data collection properties: You can include the following optional cluster properties when you create a cluster. They affect the collection of checkpoint diagnostic data on the created cluster only.
- Enable data collection: Setting the
dataproc:diagnostic.capture.enabled=true
property enables the collection of checkpoint diagnostic data on the cluster. - Share diagnostic data: Setting the
dataproc:diagnostic.capture.access=GOOGLE_DATAPROC_DIAGNOSE
property shares collected checkpoint diagnostic data with Google Cloud support.- After cluster creation, you can share the diagnostic data
with Google Cloud support by giving read access to data to the
service account used by the Google Cloud support, as follows:
gsutil -m acl ch -r -u \ cloud-diagnose@cloud-dataproc.iam.gserviceaccount.com:R \ gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID
- After cluster creation, you can share the diagnostic data
with Google Cloud support by giving read access to data to the
service account used by the Google Cloud support, as follows:
Diagnostic data
The diagnostic data consists of the following data written to
gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID/
in Cloud Storage. This location is referred to as the diagnostic data folder in this
section.
Cluster node detail logs: Dataproc runs following commands to collect and write YARN and HDFS information to the following locations in the diagnostic data folder in Cloud Storage.
Command executed Location in diagnostic folder yarn node -list -all
.../nodes/timestamp/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning
.../nodes/timestamp/hdfs-nodes.log
Job Details: Dataproc saves MapReduce job information and Spark job logs for jobs using the Dataproc Jobs API. This job data is collected for each MR and spark job submitted.
- MapReduce
job.xml
: A file containing job configuration settings, saved at.../jobs/JOB_UUID/mapreduce/job.xml
. - Spark event logs: Job execution details useful for debugging,
saved at
.../jobs/JOB_UUID/spark/application-id
.
- MapReduce
Linux system information: Dataproc runs the following commands to collect and save system information in the following locations in the diagnostic data folder in Cloud Storage.
Command Location in diagnostics folder sysctl -a
.../system/sysctl.log
cat /proc/sys/fs/file-nr
.../system/fs-file-nr.log
ping -c 1
.../system/cluster-ping.log
cp /etc/hosts
.../system/hosts_entries.log
cp /etc/resolv.conf
.../system/resolv.conf
Configuration files: Dataproc saves the following configuration files in the following locations in the diagnostic data folder in Cloud Storage.
Item(s) included Location in diagnostics folder Dataproc properties .../configs/dataproc/dataproc.properties
All files in
`/etc/google-dataproc/`.../configs/dataproc/
All files in
`/etc/hadoop/conf/`.../configs/hadoop/
All files in `/etc/hive/conf/` .../configs/hive/
All files in
`/etc/hive-hcatalog/conf/`.../configs/hive-hcatalog/
All files in `/etc/knox/conf/` .../configs/knox/
All files in `/etc/pig/conf/` .../configs/pig/
All files in
`/etc/presto/conf/`.../configs/presto/
All files in
`/etc/spark/conf/`.../configs/spark/
All files in `/etc/tez/conf/` .../configs/tez/
All files in
`/etc/zookeeper/conf/`.../configs/zookeeper/
Snapshot data
You can run the following
gcloud dataproc clusters diagnose
command to collect a snapshot of diagnostic data from a
running cluster. The data is written as an archive (tar) file to the Dataproc
staging bucket in
Cloud Storage.
gcloud dataproc clusters diagnose CLUSTER_NAME \ --region=REGION \ --tarball-access=GOOGLE_DATAPROC_DIAGNOSE
Notes:
- CLUSTER_NAME: The name of the cluster to diagnose.
- REGION: The cluster's region, for example,
us-central1
. OPTIONAL FLAGS:
You can use either of both of the following flags to collect specific job driver, Spark event, YARN application, and Sparklens output logs. Notes:
- YARN log aggregation must be enabled
(
yarn.log-aggregation-enable=true
) for the collection of YARN application logs. For MapReduce jobs, Only YARN application logs are collected.
--job-ids
: A comma-separated list of job IDs.--yarn-application-ids
: A comma-separated list of YARN application IDs.
- YARN log aggregation must be enabled
(
--start-time
with--end-time
: Use both flags to specify a time range in%Y-%m-%dT%H:%M:%S.%fZ
format for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic snapshot data).--tarball-access
=GOOGLE_DATAPROC_DIAGNOSE
Use this flag to submit or provide access to the diagnostic tar file to Google Cloud support. Also provide information to Google Cloud support as follows:- Cloud Storage path of the diagnostic tar file or
- Cluster configuration bucket, cluster UUID, and operation ID of the diagnose command
If needed, run the diagnostic script
The gcloud dataproc clusters diagnose
command can fail or time-out if a
cluster is in an error state and cannot accept diagnose tasks from the
Dataproc server. As an alternative to running the diagnose command, you
can use SSH to connect to the cluster
master node, download the diagnostic script, and then run the script locally
on the master node.
gcloud compute ssh HOSTNAME
gsutil cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh
The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to upload the tar file to a Cloud Storage bucket.
Diagnostic snapshot data
Cluster snapshot data includes a diagnostic summary and several archive sections.
Diagnostic summary: The archive file includes summary.txt
that is
at the root of the archive. It provides an overview of cluster status,
including YARN, HDFS, disk, and networking status, and includes warnings to
alert you to potential problems.
Archive sections: The archive file includes the following information that is written to the following archive file locations.
Daemons and services information
Command executed Location in archive yarn node -list -all
/system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning
/system/hdfs-nodes.log
hdfs dfs -du -h
/system/hdfs-du.log
service --status-all
/system/service.log
systemctl --type service
/system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx"
/metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps"
/metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes"
/metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx"
/metrics/namenode_jmx
JVM information
Command executed Location in archive jstack -l "${DATAPROC_AGENTPID}"
jstack/agent${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTOPID}"
jstack/agent${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVERPID}"
jstack/driver${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENTPID}"
jinfo/agent${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTOPID}"
jinfo/agent${PRESTO_PID}.jstack
jinfo "${JOB_DRIVERPID}"
jinfo/agent${JOB_DRIVER_PID}.jstack
Linux system information
Command executed Location in archive df -h
/system/df.log
ps aux
/system/ps.log
free -m
/system/free.log
netstat -anp
/system/netstat.log
sysctl -a
/system/sysctl.log
uptime
/system/uptime.log
cat /proc/sys/fs/file-nr
/system/fs-file-nr.log
ping -c 1
/system/cluster-ping.log
Item included Location in archive All logs in /var/log
with the following prefixes in their filename:
cloud-sql-proxy
dataproc
druid
gcdp
google
hadoop
hdfs
hive
knox
presto
spark
syslog
yarn
zookeeper
Files are placed in the archive logs
folder, and keep their original filenames.Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup
folder, which contains separate subfolders for each machine in the cluster.Component gateway logs from journalctl -u google-dataproc-component-gateway
/logs/google-dataproc-component-gateway.log
Configuration files
Item(s) included Location in archive VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment
/conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/
/conf/dataproc/
All files in /etc/hadoop/conf/
/conf/hadoop/
All files in /etc/hive/conf/
/conf/hive/
All files in /etc/hive-hcatalog/conf/
/conf/hive-hcatalog/
All files in /etc/knox/conf/
/conf/knox/
All files in /etc/pig/conf/
/conf/pig/
All files in /etc/presto/conf/
/conf/presto/
All files in /etc/spark/conf/
/conf/spark/
All files in /etc/tez/conf/
/conf/tez/
All files in /etc/zookeeper/conf/
/conf/zookeeper/
Share the archive file
You can share the archive file with Google Cloud support or users to obtain help to troubleshoot cluster or job issues.
To share the archive file:
- Copy the archive file from Cloud Storage, and then share the downloaded archive, or
Change the permissions on the archive to allow other Google Cloud users or projects to access the file.
Example: The following command adds read permissions to the archive in a
test-project
:gsutil -m acl ch -g test-project:R PATH_TO_ARCHIVE}