You can collect the following diagnostic data to help you troubleshoot cluster and job issues:
- Checkpoint diagnostic data: If enabled at cluster creation, Dataproc collects checkpoint diagnostic data during the cluster lifecycle.
- Snapshot diagnostic data: You can collect a one-time snapshot of cluster diagnostic data.
Checkpoint diagnostic data collection
If enabled at cluster creation, Dataproc collects checkpoint diagnostic data during cluster creation, cluster update, and Dataproc job operations.
Checkpoint data location
Dataproc saves checkpoint diagnostic data
in the cluster temp bucket
in Cloud Storage in the following folder (the checkpoint diagnostic data folder
):
gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID
The default cluster temp bucket has a TTL retention period of 90 days. The data is deleted at the end of the retention period. You can specify your own cluster temp bucket with a different TTL period when you create a cluster.
Checkpoint diagnostic data
Cluster node detail logs: Dataproc runs the following commands to collect and write YARN and HDFS information to the following locations in the checkpoint diagnostic data folder in Cloud Storage.
Command executed Location in diagnostic folder yarn node -list -all
.../nodes/timestamp/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning
.../nodes/timestamp/hdfs-nodes.log
Job Details: Dataproc saves MapReduce job information and Spark job logs for jobs using the Dataproc Jobs API. This job data is collected for each MapReduce and spark job submitted.
- MapReduce
job.xml
: A file containing job configuration settings, saved at.../jobs/JOB_UUID/mapreduce/job.xml
. - Spark event logs: Job execution details useful for debugging,
saved at
.../jobs/JOB_UUID/spark/application-id
.
- MapReduce
Linux system information: Dataproc runs the following commands to collect and save system information in the following locations in the checkpoint diagnostic data folder in Cloud Storage.
Command Location in diagnostics folder sysctl -a
.../system/sysctl.log
cat /proc/sys/fs/file-nr
.../system/fs-file-nr.log
ping -c 1
.../system/cluster-ping.log
cp /etc/hosts
.../system/hosts_entries.log
cp /etc/resolv.conf
.../system/resolv.conf
Configuration files: Dataproc saves the following configuration files in the following locations in the checkpoint diagnostic data folder in Cloud Storage.
Item(s) included Location in diagnostics folder Dataproc properties .../configs/dataproc/dataproc.properties
All files in
`/etc/google-dataproc/`.../configs/dataproc/
All files in
`/etc/hadoop/conf/`.../configs/hadoop/
All files in `/etc/hive/conf/` .../configs/hive/
All files in
`/etc/hive-hcatalog/conf/`.../configs/hive-hcatalog/
All files in `/etc/knox/conf/` .../configs/knox/
All files in `/etc/pig/conf/` .../configs/pig/
All files in
`/etc/presto/conf/`.../configs/presto/
All files in
`/etc/spark/conf/`.../configs/spark/
All files in `/etc/tez/conf/` .../configs/tez/
All files in
`/etc/zookeeper/conf/`.../configs/zookeeper/
Enable and share checkpoint diagnostic data
When you create a cluster, you can enable and share cluster checkpoint diagnostic data with Google Cloud support:
Enable checkpoint data collection
Add the
dataproc:diagnostic.capture.enabled=true
cluster property
when you create a Dataproc cluster to enable
the collection of checkpoint diagnostic data on the cluster.
Share checkpoint data
If you enable uniform bucket-level access on the cluster temp bucket, all of the temp bucket contents, including collected checkpoint diagnostic data and ephemeral data, such as MapReduce job history files, aggregated YARN logs, and Spark event logs, is shared with Google Cloud support.
If object-level access control is enabled on the cluster temp bucket, only collected checkpoint diagnostic data (data in the checkpoint diagnostic data folder in Cloud Storage) is shared.
You can enable sharing checkpoint diagnostic data with Google Cloud support when you create a cluster or after cluster creation:
Enable sharing at cluster creation: Add the following property when you create a cluster:
dataproc:diagnostic.capture.access=GOOGLE_DATAPROC_DIAGNOSE
Enable sharing after cluster creation: You can enable sharing checkpoint data with Google Cloud support after cluster creation in the following ways:
Provide read access to checkpoint data to the service account used by the Google Cloud support.
If uniform bucket-level access is enabled on the temp bucket, add IAM policy binding:
gcloud storage buckets add-iam-policy-binding gs://TEMP_BUCKET \ --member=serviceAccount:cloud-diagnose@cloud-dataproc.iam.gserviceaccount.com \ --role=roles/storage.objectViewer
If object-level access control is enabled on the temp bucket, run the following command to share checkpoint diagnostic data:
gcloud storage objects update \ gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID \ --add-acl-grant=entity=user-cloud-diagnose@cloud-dataproc.iam.gserviceaccount.com,role=READER --recursive
Snapshot diagnostic data collection
You can run the following
gcloud dataproc clusters diagnose
command to collect a snapshot of diagnostic data from a
running cluster. The data is written as an archive (tar) file to the Dataproc
staging bucket in
Cloud Storage.
gcloud dataproc clusters diagnose CLUSTER_NAME \ --region=REGION \ --tarball-access=GOOGLE_DATAPROC_DIAGNOSE
Notes:
- CLUSTER_NAME: The name of the cluster to diagnose.
- REGION: The cluster's region, for example,
us-central1
. --tarball-access
=GOOGLE_DATAPROC_DIAGNOSE
This flag provides access to the diagnostic tar file to Google Cloud support. Provide Google Cloud support with the Cloud Storage path of the diagnostic tar file.Additional flags:
--start-time
with--end-time
: Use both flags to specify a time range in%Y-%m-%dT%H:%M:%S.%fZ
format for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic snapshot data).You can use either of both of the following flags to collect specific job driver, Spark event, YARN application, and Sparklens output logs:
--job-ids
: A comma-separated list of job IDs--yarn-application-ids
: A comma-separated list of YARN application IDs- YARN log aggregation must be enabled (
yarn.log-aggregation-enable=true
) for the collection of YARN application logs. - For MapReduce jobs, YARN application logs only are collected.
- YARN log aggregation must be enabled (
Optional: Run the diagnostic script
The gcloud dataproc clusters diagnose
command can fail or time-out if a
cluster is in an error state and cannot accept diagnose tasks from the
Dataproc server. As an alternative to running the diagnose command, you
can use SSH to connect to the cluster
then run the script locally on the master node.
gcloud compute ssh HOSTNAME
gcloud storage cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh
The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to upload the tar file to a Cloud Storage bucket.
Diagnostic snapshot data
Cluster snapshot data includes a diagnostic summary and several archive sections.
Diagnostic summary: The archive file includes summary.txt
that is
at the root of the archive. It provides an overview of cluster status,
including YARN, HDFS, disk, and networking status, and includes warnings to
alert you to potential problems.
Archive sections: The archive file includes the following information that is written to the following archive file locations.
Daemons and services information
Command executed Location in archive yarn node -list -all
/system/yarn-nodes.log
hdfs dfsadmin -report -live -decommissioning
/system/hdfs-nodes.log
hdfs dfs -du -h
/system/hdfs-du.log
service --status-all
/system/service.log
systemctl --type service
/system/systemd-services.log
curl "http://${HOSTNAME}:8088/jmx"
/metrics/resource_manager_jmx
curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps"
/metrics/yarn_app_info
curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes"
/metrics/yarn_node_info
curl "http://${HOSTNAME}:9870/jmx"
/metrics/namenode_jmx
JVM information
Command executed Location in archive jstack -l "${DATAPROC_AGENTPID}"
jstack/agent${DATAPROC_AGENT_PID}.jstack
jstack -l "${PRESTOPID}"
jstack/agent${PRESTO_PID}.jstack
jstack -l "${JOB_DRIVERPID}"
jstack/driver${JOB_DRIVER_PID}.jstack
jinfo "${DATAPROC_AGENTPID}"
jinfo/agent${DATAPROC_AGENT_PID}.jstack
jinfo "${PRESTOPID}"
jinfo/agent${PRESTO_PID}.jstack
jinfo "${JOB_DRIVERPID}"
jinfo/agent${JOB_DRIVER_PID}.jstack
Linux system information
Command executed Location in archive df -h
/system/df.log
ps aux
/system/ps.log
free -m
/system/free.log
netstat -anp
/system/netstat.log
sysctl -a
/system/sysctl.log
uptime
/system/uptime.log
cat /proc/sys/fs/file-nr
/system/fs-file-nr.log
ping -c 1
/system/cluster-ping.log
Item included Location in archive All logs in /var/log
with the following prefixes in their filename:
cloud-sql-proxy
dataproc
druid
gcdp
google
hadoop
hdfs
hive
knox
presto
spark
syslog
yarn
zookeeper
Files are placed in the archive logs
folder, and keep their original filenames.Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup
folder, which contains separate subfolders for each machine in the cluster.Component gateway logs from journalctl -u google-dataproc-component-gateway
/logs/google-dataproc-component-gateway.log
Configuration files
Item(s) included Location in archive VM metadata /conf/dataproc/metadata
Environment variables in /etc/environment
/conf/dataproc/environment
Dataproc properties /conf/dataproc/dataproc.properties
All files in /etc/google-dataproc/
/conf/dataproc/
All files in /etc/hadoop/conf/
/conf/hadoop/
All files in /etc/hive/conf/
/conf/hive/
All files in /etc/hive-hcatalog/conf/
/conf/hive-hcatalog/
All files in /etc/knox/conf/
/conf/knox/
All files in /etc/pig/conf/
/conf/pig/
All files in /etc/presto/conf/
/conf/presto/
All files in /etc/spark/conf/
/conf/spark/
All files in /etc/tez/conf/
/conf/tez/
All files in /etc/zookeeper/conf/
/conf/zookeeper/
Share the archive file
You can share the archive file with Google Cloud support or users to obtain help to troubleshoot cluster or job issues.
To share the archive file:
- Copy the archive file from Cloud Storage, and then share the downloaded archive, or
Change the permissions on the archive to allow other Google Cloud users or projects to access the file.
Example: The following command gives read permissions to the archive to owners of the project
test-project
:gcloud storage objects update PATH_TO_ARCHIVE} --add-acl-grant=entity=project-owners-test-project,role=READER