View Dataproc cluster diagnostic data

"View Dataproc cluster diagnostic data"

Dataproc collects the following cluster diagnostic data to help you troubleshoot cluster and job issues:

Checkpoint data: When enabled, Dataproc collects and updates diagnostic data throughout the lifecycle of a cluster.
Snapshot data: You can collect a snapshot of cluster diagnostic data from a running cluster.

Checkpoint data

When the checkpoint data feature is enabled, Dataproc collects diagnostics data during cluster creation, cluster update, and Dataproc Jobs API operations. Dataproc saves the data in the cluster temp bucket in Cloud Storage, which has a TTL retention period of 90 days. The data is deleted at the end of the retention period.

Enable data collection properties: You can include the following optional cluster properties when you create a cluster. They affect the collection of checkpoint diagnostic data on the created cluster only.

Enable data collection: Setting the dataproc:diagnostic.capture.enabled=true property enables the collection of checkpoint diagnostic data on the cluster.
Share diagnostic data: Setting the dataproc:diagnostic.capture.access=GOOGLE_DATAPROC_DIAGNOSE property shares collected checkpoint diagnostic data with Google Cloud support.
- After cluster creation, you can share the diagnostic data with Google Cloud support by giving read access to data to the service account used by the Google Cloud support, as follows:
```
gcloud storage objects update \
    gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID \
    --add-acl-grant=entity=user-cloud-diagnose@cloud-dataproc.iam.gserviceaccount.com,role=READER --recursive \
```

Diagnostic data

The diagnostic data consists of the following data written to gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID/ in Cloud Storage. This location is referred to as the diagnostic data folder in this section.

Cluster node detail logs: Dataproc runs following commands to collect and write YARN and HDFS information to the following locations in the diagnostic data folder in Cloud Storage.

Command executed	Location in diagnostic folder
`yarn node -list -all`	`.../nodes/timestamp/yarn-nodes.log`
`hdfs dfsadmin -report -live -decommissioning`	`.../nodes/timestamp/hdfs-nodes.log`

Job Details: Dataproc saves MapReduce job information and Spark job logs for jobs using the Dataproc Jobs API. This job data is collected for each MR and spark job submitted.
- MapReduce job.xml: A file containing job configuration settings, saved at .../jobs/JOB_UUID/mapreduce/job.xml.
- Spark event logs: Job execution details useful for debugging, saved at .../jobs/JOB_UUID/spark/application-id.

Linux system information: Dataproc runs the following commands to collect and save system information in the following locations in the diagnostic data folder in Cloud Storage.

Command	Location in diagnostics folder
`sysctl -a`	`.../system/sysctl.log`
`cat /proc/sys/fs/file-nr`	`.../system/fs-file-nr.log`
`ping -c 1`	`.../system/cluster-ping.log`
`cp /etc/hosts`	`.../system/hosts_entries.log`
`cp /etc/resolv.conf`	`.../system/resolv.conf`

Configuration files: Dataproc saves the following configuration files in the following locations in the diagnostic data folder in Cloud Storage.

Item(s) included	Location in diagnostics folder
Dataproc properties	`.../configs/dataproc/dataproc.properties`
All files in `/etc/google-dataproc/`	`.../configs/dataproc/`
All files in `/etc/hadoop/conf/`	`.../configs/hadoop/`
All files in `/etc/hive/conf/`	`.../configs/hive/`
All files in `/etc/hive-hcatalog/conf/`	`.../configs/hive-hcatalog/`
All files in `/etc/knox/conf/`	`.../configs/knox/`
All files in `/etc/pig/conf/`	`.../configs/pig/`
All files in `/etc/presto/conf/`	`.../configs/presto/`
All files in `/etc/spark/conf/`	`.../configs/spark/`
All files in `/etc/tez/conf/`	`.../configs/tez/`
All files in `/etc/zookeeper/conf/`	`.../configs/zookeeper/`

Snapshot data

You can run the following gcloud dataproc clusters diagnose command to collect a snapshot of diagnostic data from a running cluster. The data is written as an archive (tar) file to the Dataproc staging bucket in Cloud Storage.

gcloud dataproc clusters diagnose CLUSTER_NAME \
    --region=REGION \
    --tarball-access=GOOGLE_DATAPROC_DIAGNOSE

Notes:

CLUSTER_NAME: The name of the cluster to diagnose.
REGION: The cluster's region, for example, us-central1.
--tarball-access=GOOGLE_DATAPROC_DIAGNOSE This flag provides access to the diagnostic tar file to Google Cloud support. Provide Google Cloud support with the Cloud Storage path of the diagnostic tar file.

Note: As an alternative to providing the tar file to support, you can provide the cluster UUID, operation ID of the diagnose command, and the Cloud Storage location of the cluster configuration bucket.
Additional flags:
- --start-time with --end-time: Use both flags to specify a time range in %Y-%m-%dT%H:%M:%S.%fZ format for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic snapshot data).
- You can use either of both of the following flags to collect specific job driver, Spark event, YARN application, and Sparklens output logs:
  - --job-ids: A comma-separated list of job IDs
  - --yarn-application-ids: A comma-separated list of YARN application IDs
    - YARN log aggregation must be enabled (yarn.log-aggregation-enable=true) for the collection of YARN application logs.
    - For MapReduce jobs, YARN application logs only are collected.

Optional: Run the diagnostic script

The gcloud dataproc clusters diagnose command can fail or time-out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. As an alternative to running the diagnose command, you can use SSH to connect to the cluster then run the script locally on the master node.

gcloud compute ssh HOSTNAME
gcloud storage cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh

The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to upload the tar file to a Cloud Storage bucket.

Diagnostic snapshot data

Cluster snapshot data includes a diagnostic summary and several archive sections.

Diagnostic summary: The archive file includes summary.txt that is at the root of the archive. It provides an overview of cluster status, including YARN, HDFS, disk, and networking status, and includes warnings to alert you to potential problems.

Archive sections: The archive file includes the following information that is written to the following archive file locations.

Daemons and services information

Command executed	Location in archive
`yarn node -list -all`	`/system/yarn-nodes.log`
`hdfs dfsadmin -report -live -decommissioning`	`/system/hdfs-nodes.log`
`hdfs dfs -du -h`	`/system/hdfs-du.log`
`service --status-all`	`/system/service.log`
`systemctl --type service`	`/system/systemd-services.log`
`curl "http://${HOSTNAME}:8088/jmx"`	`/metrics/resource_manager_jmx`
`curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps"`	`/metrics/yarn_app_info`
`curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes"`	`/metrics/yarn_node_info`
`curl "http://${HOSTNAME}:9870/jmx"`	`/metrics/namenode_jmx`

JVM information

Command executed	Location in archive
`jstack -l "${DATAPROC_AGENTPID}"`	`jstack/agent${DATAPROC_AGENT_PID}.jstack`
`jstack -l "${PRESTOPID}"`	`jstack/agent${PRESTO_PID}.jstack`
`jstack -l "${JOB_DRIVERPID}"`	`jstack/driver${JOB_DRIVER_PID}.jstack`
`jinfo "${DATAPROC_AGENTPID}"`	`jinfo/agent${DATAPROC_AGENT_PID}.jstack`
`jinfo "${PRESTOPID}"`	`jinfo/agent${PRESTO_PID}.jstack`
`jinfo "${JOB_DRIVERPID}"`	`jinfo/agent${JOB_DRIVER_PID}.jstack`

Linux system information

Command executed	Location in archive
`df -h`	`/system/df.log`
`ps aux`	`/system/ps.log`
`free -m`	`/system/free.log`
`netstat -anp`	`/system/netstat.log`
`sysctl -a`	`/system/sysctl.log`
`uptime`	`/system/uptime.log`
`cat /proc/sys/fs/file-nr`	`/system/fs-file-nr.log`
`ping -c 1`	`/system/cluster-ping.log`

Log files

Item included	Location in archive
All logs in `/var/log` with the following prefixes in their filename: `cloud-sql-proxy` `dataproc` `druid` `gcdp` `google` `hadoop` `hdfs` `hive` `knox` `presto` `spark` `syslog` `yarn` `zookeeper`	Files are placed in the archive `logs` folder, and keep their original filenames.
Dataproc node startup logs for each node (master and worker) in your cluster.	Files are placed in the archive `node_startup` folder, which contains separate subfolders for each machine in the cluster.
Component gateway logs from `journalctl -u google-dataproc-component-gateway`	`/logs/google-dataproc-component-gateway.log`

Configuration files

Item(s) included	Location in archive
VM metadata	`/conf/dataproc/metadata`
Environment variables in `/etc/environment`	`/conf/dataproc/environment`
Dataproc properties	`/conf/dataproc/dataproc.properties`
All files in `/etc/google-dataproc/`	`/conf/dataproc/`
All files in `/etc/hadoop/conf/`	`/conf/hadoop/`
All files in `/etc/hive/conf/`	`/conf/hive/`
All files in `/etc/hive-hcatalog/conf/`	`/conf/hive-hcatalog/`
All files in `/etc/knox/conf/`	`/conf/knox/`
All files in `/etc/pig/conf/`	`/conf/pig/`
All files in `/etc/presto/conf/`	`/conf/presto/`
All files in `/etc/spark/conf/`	`/conf/spark/`
All files in `/etc/tez/conf/`	`/conf/tez/`
All files in `/etc/zookeeper/conf/`	`/conf/zookeeper/`

Share the archive file

You can share the archive file with Google Cloud support or users to obtain help to troubleshoot cluster or job issues.

To share the archive file:

Copy the archive file from Cloud Storage, and then share the downloaded archive, or
Change the permissions on the archive to allow other Google Cloud users or projects to access the file.

Example: The following command gives read permissions to the archive to owners of the project test-project:
```
gcloud storage objects update PATH_TO_ARCHIVE} --add-acl-grant=entity=project-owners-test-project,role=READER
```