This document explains how to troubleshoot slow performance that you've identified for workloads that run on AI-optimized VMs or clusters.
To learn how to identify slow performance, see Monitor VMs and Slurm clusters.
Identify and address any suspected stragglers for your workload: Complete the following steps:
Check if you can use straggler detection for your workload. To review the limitations and requirements for using straggler detection, see Monitor VMs and Slurm clusters.
If you can't use straggler detection, then use other options for troubleshooting slow-performance.
To check if any VMs for your workload are suspected stragglers, view straggler detection logs.
Follow the instructions to view straggler detection logs and specify the query for logs with suspected stragglers for specific VMs. Use the time-range selector in the toolbar to select the time range of the slow performance.
Based on the number of VMs for your workload that are suspected stragglers, proceed as follows:
If no VMs are suspected stragglers, then verify if straggler detection is running correctly. To verify if the straggler detection service is running for your project, follow the instructions to view straggler detection logs and specify the query for all straggler detection logs in your project. Then, proceed as follows:
If your project doesn't have straggler detection logs while VMs are running for at least 10 minutes, then the straggler detection service is not running for your project. To resolve this, contact Cloud Customer Care or try again later.
Otherwise, if you've verified that straggler detection is running for your project and your workload supports straggler detection, then the slow performance might be caused by a different issue. Use other options for troubleshooting slow-performance.
If a small number of VMs in your workload are reported as suspected stragglers, test migrating your workload off of the suspected VMs. Then, proceed as follows:
If migration does restore performance for your workload, then the suspected VMs might be faulty. For each of these VMs, follow steps to report a faulty host, and set the
FAULT_REASON
as"STRAGGLER"
.If migration doesn't restore performance, then there might be more suspected straggler VMs or the slow performance might be caused by a different issue. You can check if more VMs for your workload are suspected stragglers or use other options for troubleshooting slow-performance.
If a large number of VMs in your workload are reported as suspected stragglers, then use other options for troubleshooting slow-performance.
Use other options for troubleshooting slow performance: If the reported list of suspected straggler VMs is large or if removing reported straggler VMs doesn't restore performance, use other options to troubleshoot slow performance, such as the following:
- Test clusters using cluster health scanner.
- Review other metrics for performance.
- Review other troubleshooting documentation. For example, see Troubleshoot GPU VMs in the Compute Engine documentation.