Application resilience

Last reviewed 2024-04-25 UTC

Although Cloud Volumes Service is highly available, planned maintenance events like platform updates, service upgrades, software upgrades, or unplanned component failures in the service can lead to brief pauses in input and output (I/O) operations. I/O pauses vary depending on service type:

  • CVS-Performance service type: I/O pauses can range from a few seconds up to 30 seconds.

  • CVS service type: I/O pauses can range from a few seconds up to 120 seconds.

I/O pause behavior

Short I/O pauses are handled by the NFS or SMB client software inside your operating system. The client waits and retries without bringing the issue up to the application. The timeouts are considered non-disruptive because the application only notices read and write operations with unusually high-latency, not I/O errors.

For longer I/O pauses, the behavior depends on the NFS or SMB client of your operating system. Some cases are handled within the client, transparent to the application. In other cases, an I/O error is passed to the application to handle. The following sections discuss protocol-specific details.

NFSv3

All calls to an unavailable, hard-mounted NFSv3 share are blocked in the NFS client and wait until the NFS server responds again. From an application perspective, the I/O operation lags until it returns successfully. During I/O pauses, no I/O operation is ever lost and Cloud Volumes Service ensures data consistency.

Use cluster software applications to monitor I/O operations

You can use cluster software applications such as Pacemaker to monitor the process of I/O operations on the NFS share. The software stops an I/O operation and moves it to a different server if a configured timeout is exceeded. If a process is stopped, outstanding write operations don't finish. The new process on the other server isn't aware of the missing writes which may occur in data loss and corruption, and require application recovery. It is important to configure timeouts using cluster software that are long enough to avoid unnecessary failovers. At minimum, you should configure timeouts greater than 30 seconds. NetApp recommends the following timeout configurations:

  • CVS-Performance service type: 60 seconds

  • CVS service type: 120 seconds

NFSv4.x

The NFSv4 protocol inherits the resilience of NFSv3 and adds better lock-state recovery, which can aid in application failovers. A downside of that lock-state recovery is that NFSv4.x servers have to wait for clients to reclaim their locks after a server restart. This adds an additional 45 seconds to the restart in which the server does not respond to read and write operations. NetApp recommends the following configuration:

  • CVS-Performance service type: 105 seconds

For more information on NFSv4.x lock state recovery, see the NFSv4.x RFC, section 9.6.2.

SMB

Unlike NFS, SMB sessions use a connection which can time out. Cloud Volumes Service stays below the timeouts in most cases.

Session timeouts

Session timeouts are defined at the client. The default timeout for Windows clients is 60 seconds. You can read or change the session timeout by running the Get-SmbClientConfiguration/Set-SmbClientConfiguration command using the SessionTimeout parameter.

If a session timeout occurs, the SMB session is broken and an I/O error is reported to the application performing the I/O. Applications can respond differently to session timeouts. Some immediately reconnect upon user access to the SMB share, while others need to handle the I/O error first, or reconnect and retry the failed I/O operation, and some fail. Consult your application vendor's documentation to learn how the application handles SMB timeouts and how they can operate resiliently on SMB shares.

Continuously Available shares (CA shares) is an SMB3.x feature specifically designed to improve failover resilience for database-like applications. CVS-Performance supports CA shares for Microsoft SQL Server and FSLogix.

Failure recovery improves with every new SMB version. Cloud Volumes Service supports SMB 2.1, 3.0, and 3.1.1. If possible, use the latest supported SMB version. Windows 10 and Windows Server 2016 and later support the latest SMB version 3.1.1.

For older clients, see the SMB versions overview.