Troubleshoot GPU clusters

This document explains how to troubleshoot issues with your GPU clusters using the Cluster Health Scanner (CHS) tool.

The CHS tool checks the health of your GPU clusters, running tests to verify that the clusters are ready to run your workloads. You can use CHS to perform proactive health checks, or as a diagnostic tool when you encounter problems with a workload. In addition to checking the configuration of your cluster, you can perform the following tests:

  • NCCL check: validates the network communication between GPUs using the NVIDIA Collective Communications Library (NCCL).
  • GPU check: utilizes NVIDIA's Data Center GPU Manager (DCGM) tool to check the health of individual GPUs.
  • Neper check: uses the Neper tool to assess network performance within the cluster.
  • Straggler detection: runs a network traffic pattern between nodes that closely resemble patterns seen during LLM training workload pipeline parallelism.
  • Tinymax check: uses Maxtext, an open source LLM framework, to assess ML training within the cluster.

You can only run CHS checks and tests on nodes that aren't running any jobs or workloads. If you try to run a check or a test on a busy node, the check or test fails.

The CHS tool is available for GPU clusters that are orchestrated by Google Kubernetes Engine (GKE) or Slurm, regardless of what provisioning model that you used to create the clusters. However, CHS is only available for the following machine types:

  • A4
  • A3 Ultra
  • A3 Mega
  • A3 High

The following sections describe how to install CHS, and then how to use it to perform health checks and check your configuration.

Install CHS

Use the following procedure to install CHS:

  1. Go to the Compute Engine > VM instances page.

    Go to the VM instances page

  2. Locate the login node. It might have a name with the pattern DEPLOYMENT_NAME +login-001.

  3. From the Connect column of the login node, click SSH.

  4. Use the following command to clone the repository and move to the root directory for the repository:

    git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner && cd cluster-health-scanner
    
  5. Use the following command to install dependencies for Google Cloud CLI:

    pip3 install -r cli/requirements.txt
    
  6. Optional: to let the configcheck command fetch configuration values from your cluster without needing to reauthenticate for each machine, use the following command to add your Google Cloud CLI SSH key to your local SSH agent:

    ssh-add ~/.ssh/google_compute_engine
    
  7. Use the following command to add the alias cluster_diag for cluster_diag.py:

    alias cluster_diag="python3 cli/cluster_diag.py"
    

Perform a health check

After you've installed CHS, do the following to check the health of your GPU cluster:

  1. Go to the Compute Engine > VM instances page.

    Go to the VM instances page

  2. Locate the login node. It might have a name with the pattern DEPLOYMENT_NAME +login-001.

  3. From the Connect column of the login node, click SSH.

  4. Verify that you're in the root directory for the repository.

  5. Use the following command to check the current status of your cluster:

    cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE status
    

    Replace the following:

    • ORCHESTRATOR: either gke or slurm, depending on which orchestrator you're using.
    • GPU_TYPE: the GPU machine type that you're using, which can be one of the following values:
      • a4-highgpu-8g
      • a3-ultragpu-8g
      • a3-megagpu-8g
      • a3-highgpu-8g
      • a3-highgpu-4g
      • a3-highgpu-2g
      • a3-highgpu-1g
  6. Use the following command to check the health of individual GPUs within your cluster:

    cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE gpu
    

    Replace the following:

    • ORCHESTRATOR: either gke or slurm, depending on which orchestrator you're using.
    • GPU_TYPE: the GPU machine type that you're using, which can be one of the following values:
      • a4-highgpu-8g
      • a3-ultragpu-8g
      • a3-megagpu-8g
      • a3-highgpu-8g
      • a3-highgpu-4g
      • a3-highgpu-2g
      • a3-highgpu-1g
  7. Optional: use the following template command to run additional checks. Consider adding the --run_only_on_available_nodes flag to skip unavailable nodes:

    cluster_diag -o ORCHESTRATOR healthscan GPU_TYPE CHECK
    

    Replace the following:

    • ORCHESTRATOR: either gke or slurm, depending on which orchestrator you're using.
    • GPU_TYPE: the GPU machine type that you're using, which can be one of the following values:
      • a4-highgpu-8g
      • a3-ultragpu-8g
      • a3-megagpu-8g
      • a3-highgpu-8g
      • a3-highgpu-4g
      • a3-highgpu-2g
      • a3-highgpu-1g
    • CHECK: the check that you want to run. Use one of the following options:
      • status
      • nccl
      • gpu
      • straggler
      • neper
      • tinymax

Check your configuration

After you've installed CHS, do the following to check the configuration of your cluster:

  1. Verify that you're in the root directory for the repository.
  2. Use the following command to check the configuration of your cluster. By default, this command produces a diff; to skip the diff and just print the configuration, add the --no-diff flag:

    cluster_diag -o ORCHESTRATOR configcheck GPU_TYPE
    

    Replace the following:

    • ORCHESTRATOR: either gke or slurm, depending on which orchestrator you're using.
    • GPU_TYPE: the GPU machine type that you're using, which can be one of the following values:
      • a4-highgpu-8g
      • a3-ultragpu-8g
      • a3-megagpu-8g
      • a3-highgpu-8g
      • a3-highgpu-4g
      • a3-highgpu-2g
      • a3-highgpu-1g

The following screenshot shows the result from a successful configuration check:

A successful configuration check result.
A successful configuration check result (click to enlarge).

What's next