{
........
"category":"Hardware",
"tests":[
{
"name":"GPU Memory",
"results":[
{
"gpu_id":"0",
"info":"GPU 0 Allocated 23376170169
bytes (98.3%)",
"status":"Fail",
""warnings":[
{
"warning":"Pending page
retirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.",
"error_id":83,
"error_category":10,
"error_severity":6
}
]
}
.........
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-03 (世界標準時間)。"],[[["\u003cp\u003eThis guide provides steps to troubleshoot issues with virtual machines (VMs) on Compute Engine that have attached GPUs, primarily using NVIDIA Data Center GPU Manager (DCGM) and analyzing Xid error messages.\u003c/p\u003e\n"],["\u003cp\u003eUsing \u003ccode\u003edcgmi\u003c/code\u003e diagnostic commands can help identify GPU issues, such as memory problems, and the output provides guidance, like error IDs and severity levels, to address those problems, including steps like draining the GPU or rebooting the VM.\u003c/p\u003e\n"],["\u003cp\u003eIf DCGM diagnostics fail to resolve the issue, users should open a support case, providing the \u003ccode\u003edcgmi\u003c/code\u003e command output, relevant log files, stats files, NVIDIA system information, and details of recent environmental changes.\u003c/p\u003e\n"],["\u003cp\u003eXid messages in the kernel or event logs indicate NVIDIA driver errors, which can be categorized into GPU memory errors, GPU System Processor (GSP) errors, and illegal memory access errors, each having specific resolutions like resetting GPUs, deleting, or recreating the VM.\u003c/p\u003e\n"],["\u003cp\u003eResolving certain GPU errors may require you to reset the GPU by rebooting the VM or using \u003ccode\u003envidia-smi --gpu-reset\u003c/code\u003e command, and if issues continue, a VM delete and recreate may be needed before opening a support case.\u003c/p\u003e\n"]]],[],null,["# Troubleshoot GPU VMs\n\n*** ** * ** ***\n\nThis page shows you how to resolve issues for VMs running on Compute Engine\nthat have attached GPUs.\n\nIf you are trying to create a VM with attached GPUs and are getting errors,\nreview [Troubleshooting resource availability errors](/compute/docs/troubleshooting/troubleshooting-resource-availability) and\n[Troubleshooting creating and updating VMs](/compute/docs/troubleshooting/troubleshooting-vm-creation).\n\nTroubleshoot GPU VMs by using NVIDIA DCGM\n-----------------------------------------\n\nNVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and\nmonitoring NVIDIA data center GPUs in cluster environments.\n\nIf you want to use DCGM to troubleshoot issues in your GPU environment, complete\nthe following:\n\n- Ensure that you are using the latest recommended NVIDIA driver for the GPU model that is attached to your VM. To review driver versions, see [Recommended NVIDIA driver versions](/compute/docs/gpus/install-drivers-gpu#minimum-driver).\n- Ensure that you installed the latest version of DCGM. To install the latest version, see [DCGM installation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html).\n\n### Diagnose issues\n\nWhen you run a `dcgmi` diagnostic command, the issues reported by the diagnostic\ntool include next steps for taking action on the issue. The following example\nshows the actionable output from the `dcgmi diag -r memory -j` command. \n\n```\n{\n ........\n \"category\":\"Hardware\",\n \"tests\":[\n {\n \"name\":\"GPU Memory\",\n \"results\":[\n {\n \"gpu_id\":\"0\",\n \"info\":\"GPU 0 Allocated 23376170169\nbytes (98.3%)\",\n \"status\":\"Fail\",\n \"\"warnings\":[\n {\n \"warning\":\"Pending page\nretirements together with a DBE were detected on GPU 0. Drain the GPU and reset it or reboot the node to resolve this issue.\",\n \"error_id\":83,\n \"error_category\":10,\n \"error_severity\":6\n }\n ]\n }\n .........\n```\n\nFrom the preceding output snippet, you can see that `GPU 0` has pending page\nretirements that are caused by a non-recoverable error.\nThe output provided the unique `error_id` and advice on debugging the issue.\nFor this example output, it is recommended that you drain the GPU and reboot\nthe VM. In most cases, following the instructions in this section of the output\ncan help to resolve the issue.\n| **Pro Tip:** Take note of the error severity.\n| In the example output a value of `\"error_severity\":6`\n| corresponds to a `DCGM_ERROR_RESET` which means that a\n| reset resolves issues with this severity value.\n|\n| For a full list of\n| `error_severity` values, review the `dcgmErrorSeverity_enum`\n| section on the\n| [`dcgm_errors` GitHub file](https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_errors.h).\n\n### Open a support case\n\nIf you are unable to resolve the issues by using the guidance provided by the\noutput of your `dcgmi` diagnostic run, you can open a support case. When you\nopen a support case, you need to provide the following information:\n\n1. The command that was run and the output returned.\n2. Relevant log files such as host engine and diagnostic logs. To gather the\n required log files, you can run the `gather-dcgm-logs.sh` script.\n\n For a default installation on Debian and RPM-based systems, this script is\n located in `/usr/local/dcgm/scripts`.\n3. For `dcgmi diag` failures, provide the stats files for the plugins that failed.\n The stats file uses the following naming convention:\n `stats_`\u003cvar translate=\"no\"\u003ePLUGIN_NAME\u003c/var\u003e`.json`.\n\n For example, if the `pcie` plugin failed, include the file named `stats_pcie.json`.\n4. NVIDIA system information and driver state. To gather this information, you\n can run the `nvidia-bug-report.sh` script. If you are using an instance with\n Blackwell GPUs, follow [Generate NVIDIA Bug Report for Blackwell GPUs](/compute/docs/troubleshooting/generate-nvidia-bug-report-for-blackwell-gpus) to obtain a comprehensive bug report.\n\n Running this script also helps with additional debugging if the problem is\n caused by other NVIDIA dependencies and not a bug in DCGM itself.\n5. Details about any recent changes that were made to your environment\n preceding the failure.\n\nXid messages\n------------\n\nAfter you create a VM that has attached GPUs, you must install NVIDIA device\ndrivers [on your GPU VMs](/compute/docs/gpus/install-drivers-gpu)\nso that your applications can access the GPUs. However, sometimes these drivers\nreturn error messages.\n\nAn Xid message is an error report from the NVIDIA driver that is printed to the\noperating system's kernel log or event log for your Linux VM. These messages are\nplaced in the `/var/log/messages` file.\n\nFor more information about Xid messages including potential causes,\nsee [NVIDIA documentation](https://docs.nvidia.com/deploy/xid-errors/index.html).\n\nThe following section provides guidance on handling some Xid messages grouped\nby the most common types: GPU memory errors, GPU System Processor (GSP) errors,\nand illegal memory access errors.\n\n### GPU memory errors\n\nGPU memory is the memory that is available on a GPU that can be used for\ntemporary storage of data. GPU memory is protected with Error Correction Code,\nECC, which detects and corrects single bit errors (SBE) and detects and reports\nDouble Bit Errors (DBE).\n\nPrior to the release of the NVIDIA A100 GPUs,\n[dynamic page retirement](https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-blacklisting)\nwas supported. For NVIDIA A100 and later GPU releases (such as NVIDIA H100),\n[row remap error](https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping)\nrecovery is introduced. ECC is enabled by default. Google highly recommends\nkeeping ECC enabled.\n\nThe following are common GPU memory errors and their suggested resolutions.\n\n### GSP errors\n\nA [GPU System Processor](https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gsp.html)\n(GSP) is a microcontroller that runs on GPUs and handles some of the low level\nhardware management functions.\n\n### Illegal memory access errors\n\nThe following Xids are returned when applications have illegal memory access\nissues:\n\n- `Xid 13: `*Graphics Engine Exception*\n- `Xid 31: `*GPU memory page fault*\n\nIllegal memory access errors are typically caused by your workloads trying\nto access memory that is already freed or is out of bounds. This can be caused\nby issues such as the dereferencing of an invalid pointer, or an out bounds array.\n\nTo resolve this issue, you need to debug your application. To debug your\napplication, you can use\n[cuda-memcheck](https://developer.nvidia.com/cuda-memcheck) and\n[CUDA-GDB](https://docs.nvidia.com/cuda/cuda-gdb/index.html).\n\nIn some very rare cases, hardware degradation might cause illegal memory access\nerrors to be returned. To identify if the issue is with your hardware, use\n[NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm).\nYou can run `dcgmi diag -r 3` or `dcgmi diag -r 4` to run different levels of\ntest coverage and duration. If you identify that the issue is with the hardware,\nfile a case with [Cloud Customer Care](/support/docs).\n\n### Other common Xid error messages\n\nReset GPUs\n----------\n\nSome issues might require you to reset your GPUs. To reset GPUs,\ncomplete the following steps:\n\n- For N1, G2, and A2 VMs, reboot the VM by running `sudo reboot`.\n- For A3 and A4 VMs, run `sudo nvidia-smi --gpu-reset`.\n - For most Linux VMs, the `nvidia-smi` executable is located in the `/var/lib/nvidia/bin` directory.\n - For GKE nodes, the `nvidia-smi` executable is located in the `/home/kubernetes/bin/nvidia` directory.\n\nIf errors persist after resetting the GPU, you need to\n[delete](/compute/docs/instances/deleting-instance) and\n[recreate the VM](/compute/docs/gpus/create-vm-with-gpus).\n\nIf the error persists after a delete and recreate, file a case with\n[Cloud Customer Care](/support/docs) to move the VM into the\n[repair stage](/compute/docs/instances/instance-lifecycle).\n\nWhat's next\n-----------\n\nReview [GPU machine types](/compute/docs/gpus)."]]