Troubleshooting NVMe disks


This document lists errors that you might encounter when using disks with the nonvolatile memory express (NVMe) interface.

You can use the NVMe interface for Local SSDs and persistent disks. Only the latest generation VMs, such as M3 and Tau T2A, use the NVMe interface for persistent disks. Confidential VMs also use NVMe persistent disks. All other Compute Engine machine series use the SCSI disk interface for persistent disks.

I/O operation timeout error

If you are encountering I/O timeout errors, latency could be exceeding the default timeout parameter for I/O operations submitted to NVMe devices.

Error message:

[1369407.045521] nvme nvme0: I/O 252 QID 2 timeout, aborting
[1369407.050941] nvme nvme0: I/O 253 QID 2 timeout, aborting
[1369407.056354] nvme nvme0: I/O 254 QID 2 timeout, aborting
[1369407.061766] nvme nvme0: I/O 255 QID 2 timeout, aborting
[1369407.067168] nvme nvme0: I/O 256 QID 2 timeout, aborting
[1369407.072583] nvme nvme0: I/O 257 QID 2 timeout, aborting
[1369407.077987] nvme nvme0: I/O 258 QID 2 timeout, aborting
[1369407.083395] nvme nvme0: I/O 259 QID 2 timeout, aborting
[1369407.088802] nvme nvme0: I/O 260 QID 2 timeout, aborting
...

Resolution:

To resolve this issue, increase the value of the timeout parameter.

  1. View the current value of the timeout parameter.

    1. Determine which NVMe controller is used by the persistent disk or local SSD.
      ls -l /dev/disk/by-id
      
    2. Display the io_timeout setting, specified in seconds, for the disk.

      cat /sys/class/nvme/CONTROLLER_ID/NAMESPACE/queue/io_timeout
      
      Replace the following:

      • CONTROLLER_ID: the ID of the NVMe disk controller, for example, nvme1
      • NAMESPACE: the namespace of the NVMe disk, for example, nvme1n1

      If you only have a single disk that uses NVMe, then use the command:

      cat /sys/class/nvme/nvme0/nvme0n1/queue/io_timeout
      

  2. To increase the timeout parameter for I/O operations submitted to NVMe devices, add the following line to the /lib/udev/rules.d/65-gce-disk-naming.rules file, and then restart the VM:

    KERNEL=="nvme*n*", ENV{DEVTYPE}=="disk", ATTRS{model}=="nvme_card-pd", ATTR{queue/io_timeout}="4294967295"
    

What's next?