View Dataproc cluster diagnostic data

"View Dataproc cluster diagnostic data"

Dataproc collects the following cluster diagnostic data to help you troubleshoot cluster and job issues:

Checkpoint data

When the checkpoint data feature is enabled, Dataproc collects diagnostics data during cluster creation, cluster update, and Dataproc Jobs API operations. Dataproc saves the data in the cluster temp bucket in Cloud Storage, which has a TTL retention period of 90 days. The data is deleted at the end of the retention period.

Enable data collection properties: You can include the following optional cluster properties when you create a cluster. They affect the collection of checkpoint diagnostic data on the created cluster only.

  • Enable data collection: Setting the dataproc:diagnostic.capture.enabled=true property enables the collection of checkpoint diagnostic data on the cluster.
  • Share diagnostic data: Setting the dataproc:diagnostic.capture.access=GOOGLE_DATAPROC_DIAGNOSE property shares collected checkpoint diagnostic data with Google Cloud support.
    • After cluster creation, you can share the diagnostic data with Google Cloud support by giving read access to data to the service account used by the Google Cloud support, as follows:
      gsutil -m acl ch -r -u \
          cloud-diagnose@cloud-dataproc.iam.gserviceaccount.com:R \
          gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID
      

Diagnostic data

The diagnostic data consists of the following data written to gs://TEMP_BUCKET/google-cloud-dataproc-diagnostic/CLUSTER_UUID/ in Cloud Storage. This location is referred to as the diagnostic data folder in this section.

  • Cluster node detail logs: Dataproc runs following commands to collect and write YARN and HDFS information to the following locations in the diagnostic data folder in Cloud Storage.

    Command executed Location in diagnostic folder
    yarn node -list -all .../nodes/timestamp/yarn-nodes.log
    hdfs dfsadmin -report -live -decommissioning .../nodes/timestamp/hdfs-nodes.log

  • Job Details: Dataproc saves MapReduce job information and Spark job logs for jobs using the Dataproc Jobs API. This job data is collected for each MR and spark job submitted.

    • MapReduce job.xml: A file containing job configuration settings, saved at .../jobs/JOB_UUID/mapreduce/job.xml.
    • Spark event logs: Job execution details useful for debugging, saved at .../jobs/JOB_UUID/spark/application-id.
  • Linux system information: Dataproc runs the following commands to collect and save system information in the following locations in the diagnostic data folder in Cloud Storage.

    Command Location in diagnostics folder
    sysctl -a .../system/sysctl.log
    cat /proc/sys/fs/file-nr .../system/fs-file-nr.log
    ping -c 1 .../system/cluster-ping.log
    cp /etc/hosts .../system/hosts_entries.log
    cp /etc/resolv.conf .../system/resolv.conf
  • Configuration files: Dataproc saves the following configuration files in the following locations in the diagnostic data folder in Cloud Storage.

    Item(s) included Location in diagnostics folder
    Dataproc properties .../configs/dataproc/dataproc.properties
    All files in
    `/etc/google-dataproc/`
    .../configs/dataproc/
    All files in
    `/etc/hadoop/conf/`
    .../configs/hadoop/
    All files in `/etc/hive/conf/` .../configs/hive/
    All files in
    `/etc/hive-hcatalog/conf/`
    .../configs/hive-hcatalog/
    All files in `/etc/knox/conf/` .../configs/knox/
    All files in `/etc/pig/conf/` .../configs/pig/
    All files in
    `/etc/presto/conf/`
    .../configs/presto/
    All files in
    `/etc/spark/conf/`
    .../configs/spark/
    All files in `/etc/tez/conf/` .../configs/tez/
    All files in
    `/etc/zookeeper/conf/`
    .../configs/zookeeper/

Snapshot data

You can run the following gcloud dataproc clusters diagnose command to collect a snapshot of diagnostic data from a running cluster. The data is written as an archive (tar) file to the Dataproc staging bucket in Cloud Storage.

gcloud dataproc clusters diagnose CLUSTER_NAME \
    --region=REGION \
    --tarball-access=GOOGLE_DATAPROC_DIAGNOSE

Notes:

  • CLUSTER_NAME: The name of the cluster to diagnose.
  • REGION: The cluster's region, for example, us-central1.
  • OPTIONAL FLAGS:

    • You can use either of both of the following flags to collect specific job driver, Spark event, YARN application, and Sparklens output logs. Notes:

      • YARN log aggregation must be enabled (yarn.log-aggregation-enable=true) for the collection of YARN application logs.
      • For MapReduce jobs, Only YARN application logs are collected.

        • --job-ids: A comma-separated list of job IDs.

        • --yarn-application-ids: A comma-separated list of YARN application IDs.

    • --start-time with --end-time: Use both flags to specify a time range in %Y-%m-%dT%H:%M:%S.%fZ format for the collection of diagnostic data. Specifying a time range also enables the collection of Dataproc autoscaling logs during the time range (by default, Dataproc autoscaling logs are not collected in the diagnostic snapshot data).

    • --tarball-access=GOOGLE_DATAPROC_DIAGNOSE Use this flag to submit or provide access to the diagnostic tar file to Google Cloud support. Also provide information to Google Cloud support as follows:

      • Cloud Storage path of the diagnostic tar file or
      • Cluster configuration bucket, cluster UUID, and operation ID of the diagnose command

If needed, run the diagnostic script

The gcloud dataproc clusters diagnose command can fail or time-out if a cluster is in an error state and cannot accept diagnose tasks from the Dataproc server. As an alternative to running the diagnose command, you can use SSH to connect to the cluster master node, download the diagnostic script, and then run the script locally on the master node.

gcloud compute ssh HOSTNAME
gsutil cp gs://dataproc-diagnostic-scripts/diagnostic-script.sh .
sudo bash diagnostic-script.sh

The diagnostic archive tar file is saved in a local directory. The command output lists the location of the tar file with instructions on how to upload the tar file to a Cloud Storage bucket.

Diagnostic snapshot data

Cluster snapshot data includes a diagnostic summary and several archive sections.

Diagnostic summary: The archive file includes summary.txt that is at the root of the archive. It provides an overview of cluster status, including YARN, HDFS, disk, and networking status, and includes warnings to alert you to potential problems.

Archive sections: The archive file includes the following information that is written to the following archive file locations.

  • Daemons and services information

    Command executed Location in archive
    yarn node -list -all /system/yarn-nodes.log
    hdfs dfsadmin -report -live -decommissioning /system/hdfs-nodes.log
    hdfs dfs -du -h /system/hdfs-du.log
    service --status-all /system/service.log
    systemctl --type service /system/systemd-services.log
    curl "http://${HOSTNAME}:8088/jmx" /metrics/resource_manager_jmx
    curl "http://${HOSTNAME}:8088/ws/v1/cluster/apps" /metrics/yarn_app_info
    curl "http://${HOSTNAME}:8088/ws/v1/cluster/nodes" /metrics/yarn_node_info
    curl "http://${HOSTNAME}:9870/jmx" /metrics/namenode_jmx

  • JVM information

    Command executed Location in archive
    jstack -l "${DATAPROC_AGENTPID}" jstack/agent${DATAPROC_AGENT_PID}.jstack
    jstack -l "${PRESTOPID}" jstack/agent${PRESTO_PID}.jstack
    jstack -l "${JOB_DRIVERPID}" jstack/driver${JOB_DRIVER_PID}.jstack
    jinfo "${DATAPROC_AGENTPID}" jinfo/agent${DATAPROC_AGENT_PID}.jstack
    jinfo "${PRESTOPID}" jinfo/agent${PRESTO_PID}.jstack
    jinfo "${JOB_DRIVERPID}" jinfo/agent${JOB_DRIVER_PID}.jstack

  • Linux system information

    Command executed Location in archive
    df -h /system/df.log
    ps aux /system/ps.log
    free -m /system/free.log
    netstat -anp /system/netstat.log
    sysctl -a /system/sysctl.log
    uptime /system/uptime.log
    cat /proc/sys/fs/file-nr /system/fs-file-nr.log
    ping -c 1 /system/cluster-ping.log

  • Log files

    Item included Location in archive
    All logs in /var/log with the following prefixes in their filename:
    cloud-sql-proxy
    dataproc
    druid
    gcdp
    google
    hadoop
    hdfs
    hive
    knox
    presto
    spark
    syslog
    yarn
    zookeeper
    Files are placed in the archive logs folder, and keep their original filenames.
    Dataproc node startup logs for each node (master and worker) in your cluster. Files are placed in the archive node_startup folder, which contains separate subfolders for each machine in the cluster.
    Component gateway logs from journalctl -u google-dataproc-component-gateway /logs/google-dataproc-component-gateway.log

  • Configuration files

    Item(s) included Location in archive
    VM metadata /conf/dataproc/metadata
    Environment variables in /etc/environment /conf/dataproc/environment
    Dataproc properties /conf/dataproc/dataproc.properties
    All files in /etc/google-dataproc/ /conf/dataproc/
    All files in /etc/hadoop/conf/ /conf/hadoop/
    All files in /etc/hive/conf/ /conf/hive/
    All files in /etc/hive-hcatalog/conf/ /conf/hive-hcatalog/
    All files in /etc/knox/conf/ /conf/knox/
    All files in /etc/pig/conf/ /conf/pig/
    All files in /etc/presto/conf/ /conf/presto/
    All files in /etc/spark/conf/ /conf/spark/
    All files in /etc/tez/conf/ /conf/tez/
    All files in /etc/zookeeper/conf/ /conf/zookeeper/

Share the archive file

You can share the archive file with Google Cloud support or users to obtain help to troubleshoot cluster or job issues.

To share the archive file:

  • Copy the archive file from Cloud Storage, and then share the downloaded archive, or
  • Change the permissions on the archive to allow other Google Cloud users or projects to access the file.

    Example: The following command adds read permissions to the archive in a test-project:

    gsutil -m acl ch -g test-project:R PATH_TO_ARCHIVE}