This page shows you how to resolve issues for VMs running on Compute Engine that have attached GPUs.
If you are trying to create a VM with attached GPUs and are getting errors, review Troubleshooting resource availability errors and Troubleshooting creating and updating VMs.
Troubleshoot GPU VMs by using NVIDIA DCGM
NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA data center GPUs in cluster environments.
If you want to use DCGM to troubleshoot issues in your GPU environment, complete the following:
- Ensure that you are using the latest recommended NVIDIA driver for the GPU model that is attached to your VM. To review driver versions, see Recommended NVIDIA driver versions.
- Ensure that you installed the latest version of DCGM. To install the latest version, see DCGM installation.
Diagnose issues
When you run a dcgmi
diagnostic command, the issues reported by the diagnostic
tool include next steps for taking action on the issue. The following example
shows the actionable output from the dcgmi diag -r memory -j
command.
{ ........ "category":"Hardware", "tests":[ { "name":"GPU Memory", "results":[ { "gpu_id":"0", "info":"GPU 0 Allocated 23376170169 bytes (98.3%)", "status":"Fail", ""warnings":[ { "warning":"Pending page retirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.", "error_id":83, "error_category":10, "error_severity":6 } ] } .........
From the preceding output snippet, you can see that GPU 0
has pending page
retirements that are caused by a non-recoverable error.
The output provided the unique error_id
and advice on debugging the issue.
For this example output, it is recommended that you drain the GPU and reboot
the VM. In most cases, following the instructions in this section of the output
can help to resolve the issue.
Open a support case
If you are unable to resolve the issues by using the guidance provided by the
output of your dcgmi
diagnostic run, you can open a support case. When you
open a support case, you need to provide the following information:
- The command that was run and the output returned.
Relevant log files such as host engine and diagnostic logs. To gather the required log files, you can run the
gather-dcgm-logs.sh
script.For a default installation on Debian and RPM-based systems, this script is located in
/usr/local/dcgm/scripts
.For
dcgmi diag
failures, provide the stats files for the plugins that failed. The stats file uses the following naming convention:stats_PLUGIN_NAME.json
.For example, if the
pcie
plugin failed, include the file namedstats_pcie.json
.NVIDIA system information and driver state. To gather this information, you can run the
nvidia-bug-report.sh
script.Running this script also helps with additional debugging if the problem is caused by other NVIDIA dependencies and not a bug in DCGM itself.
Details about any recent changes that were made to your environment preceding the failure.
Xid messages
After you create a VM that has attached GPUs, you must install NVIDIA device drivers on your GPU VMs so that your applications can access the GPUs. However, sometimes these drivers return error messages.
An Xid message is an error report from the NVIDIA driver that is printed to the
operating system's kernel log or event log for your Linux VM. These messages are
placed in the /var/log/messages
file.
For more information about Xid messages including potential causes, see NVIDIA documentation.
The following section provides guidance on handling some Xid messages grouped by the most common types: GPU memory errors, GPU System Processor (GSP) errors, and illegal memory access errors.
GPU memory errors
GPU memory is the memory that is available on a GPU that can be used for temporary storage of data. GPU memory is protected with Error Correction Code, ECC, which detects and corrects single bit errors (SBE) and detects and reports Double Bit Errors (DBE).
Prior to the release of the NVIDIA A100 GPUs, dynamic page retirement was supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100), row remap error recovery is introduced. ECC is enabled by default. Google highly recommends keeping ECC enabled.
The following are common GPU memory errors and their suggested resolutions.
Xid error message | Resolution |
---|---|
Xid 48: Double Bit ECC |
|
Xid 63: ECC page retirement or row remapping recording
event |
|
Xid 64: ECC page retirement or row remapper recording
failure
And the message contains the following information: Xid 64: All reserved rows for bank are remapped
|
|
If you get at least two of the following Xid messages together:
And the message contains the following information: Xid XX: row remap pending
|
|
Xid 92: High single-bit ECC error rate |
This Xid message is returned after the GPU driver corrects a correctable error, and it shouldn't affect your workloads. This Xid message is informational only. No action is needed. |
Xid 94: Contained ECC error |
|
Xid 95: Uncontained ECC error |
|
GSP errors
A GPU System Processor (GSP) is a microcontroller that runs on GPUs and handles some of the low level hardware management functions.
Xid error message | Resolution |
---|---|
Xid 119: GSP RPC timeout |
|
Xid 120: GSP error |
Illegal memory access errors
The following Xids are returned when applications have illegal memory access issues:
Xid 13: Graphics Engine Exception
Xid 31: GPU memory page fault
Illegal memory access errors are typically caused by your workloads trying to access memory that is already freed or is out of bounds. This can be caused by issues such as the dereferencing of an invalid pointer, or an out bounds array.
To resolve this issue, you need to debug your application. To debug your application, you can use cuda-memcheck and CUDA-GDB.
In some very rare cases, hardware degradation might cause illegal memory access
errors to be returned. To identify if the issue is with your hardware, use
NVIDIA Data Center GPU Manager (DCGM).
You can run dcgmi diag -r 3
or dcgmi diag -r 4
to run different levels of
test coverage and duration. If you identify that the issue is with the hardware,
file a case with Cloud Customer Care.
Other common Xid error messages
Xid error message | Resolution |
---|---|
Xid 74: NVLINK error |
|
Xid 79: GPU has fallen off the bus
This means the driver is not able to communicate with the GPU. |
Reboot the VM. |
Reset GPUs
Some issues might require you to reset your GPUs. To reset GPUs, complete the following steps:
- For N1, G2, and A2 VMs, reboot the VM by running
sudo reboot
. - For A3 VMs, run
sudo nvidia-smi --gpu-reset
.- For most Linux VMs, the
nvidia-smi
executable is located in the/var/lib/nvidia/bin
directory. - For GKE nodes, the
nvidia-smi
executable is located in the/home/kubernetes/bin/nvidia
directory.
- For most Linux VMs, the
If errors persist after resetting the GPU, you need to delete and recreate the VM.
If the error persists after a delete and recreate, file a case with Cloud Customer Care to move the VM into the repair stage.
What's next
Review GPU machine types.