Troubleshoot slow performance

This document explains how to troubleshoot slow performance that you've identified for workloads that run on AI-optimized VMs or clusters.

To learn how to identify slow performance, see Monitor VMs and Slurm clusters.

  1. Identify and address any suspected stragglers for your workload: Complete the following steps:

    1. Check if you can use straggler detection for your workload. To review the limitations and requirements for using straggler detection, see Monitor VMs and Slurm clusters.

      If you can't use straggler detection, then use other options for troubleshooting slow-performance.

    2. To check if any VMs for your workload are suspected stragglers, view straggler detection logs.

      Follow the instructions to view straggler detection logs and specify the query for logs with suspected stragglers for specific VMs. Use the time-range selector in the toolbar to select the time range of the slow performance.

    3. Based on the number of VMs for your workload that are suspected stragglers, proceed as follows:

  2. Use other options for troubleshooting slow performance: If the reported list of suspected straggler VMs is large or if removing reported straggler VMs doesn't restore performance, use other options to troubleshoot slow performance, such as the following: