About maintenance

This page explains how Memorystore for Valkey performs maintenance on instances. It also provides information and configuration recommendations that your client applications should be aware of to take advantage of Memorystore for Valkey's zero downtime maintenance design. These recommendations apply to both highly available instances and instances without replicas. However, we strongly recommend the high availability configuration for all production use cases.

Memorystore for Valkey routinely updates instances to ensure that the service is reliable, performant, secure, and up-to-date. These updates are called maintenance. Maintenance is fully managed by the service and is designed to have zero downtime impact.

Maintenance typically falls into the following categories:

  • Memorystore features. To launch some features, Memorystore requires a maintenance update.
  • Operating system patches. We continuously monitor newly identified security vulnerabilities in the operating system. Upon discovery, we patch the operating system to protect you from new risks.
  • Database patches. Maintenance can include a Valkey update to improve instance security, performance, and reliability characteristics beyond what OSS Valkey provides.

Configure your client application

To configure your client application for the best performance and availability possible during maintenance, follow these steps:

  1. Use and configure your third party client according to the guidance at Client best practices to make sure that any scheduled maintenance doesn't impact the client application. Our recommended client configurations can avoid connection resets through periodic inline topology refreshes and background connection rotations.
  2. Test your client application with a series of update operations (such as scale in or out, replica count changes) while running a representative workload on primary and replica nodes, and monitoring for client impact. These updates test the inline topology refresh logic on clients, full sync impact, new node discovery, and existing node removal capability. Testing helps make sure that the third party client is configured correctly to avoid any negative impact to your application.

Scheduled maintenance

Memorystore for Valkey leverages a gradual deployment and create-before-destroy lifecycle strategy to avoid any downtime impact of Memorystore scheduled maintenance on your Valkey instances. Zero-downtime maintenance is achieved by using the request redirection capabilities of the OSS Valkey cluster protocol in conjunction with the following Memorystore mechanisms:

  1. Coordinated failover without any loss of data
  2. Graceful node removal to enable clients to catch up to the node topology updates without any availability impact
  3. The instance's PSC endpoints aren't affected by maintenance. For more information about PSC endpoints, see Instance endpoints.

The service behavior described in the following sections apply to scheduled maintenance only. For information about the impact of unplanned events such as hardware failures, see Client behavior during an unplanned failover.

Default maintenance windows

By default Memorystore updates your instance in the following windows according to your instance's time zone:

  • Weekday window (Monday to Friday): 10 PM to 6 AM

  • Weekend window: Friday, 10 PM to Monday, 6 AM

Gradual deployments strategy

Memorystore for Valkey deployments are performed with progressively increasing scope, and at a rate that allows for failure detection early enough to mitigate impact and establish stability confidence. Bake times (time during which the update is applied and monitored before considering it a success and moving forward) are integrated across the Memorystore fleet of instances at the service scale. Additionally, bake times are integrated within instances across zones in a region (multiple fault-domains) to reduce the scope of impact, if any.

For your instance configured for high availability, at most one fault-domain/zone is updated at any given time to make sure that an instance shard, including both primary and replicas, has high availability throughout the update. Furthermore, only a few Valkey nodes are updated at any given time. Updates use a create-before-destroy lifecycle mechanism to maximize instance stability. This strategy provides the most benefits when updating an instance with many shards. Only applying the updates to a small part of the overall user keyspace at any given time maximizes data availability.

Create-Before-Destroy lifecycle strategy

A Valkey instance has multiple shards. Each shard has one primary node and zero-or-more replica nodes. Memorystore uses the following process to update any existing primary or replica Valkey node in a shard:

  1. Memorystore for Valkey first adds a completely new replica with the latest software update to the shard. Memorystore creates an entirely new node, instead of updating an existing node, to ensure your provisioned capacity is retained in the event of an unexpected bootstrap failure.
  2. If a node within the shard to be updated is a primary node, it is first converted to a replica before removal using a coordinated failover.
  3. Next Memorystore removes the replica that uses the earlier software.
  4. The process is repeated for each node in the instance.

The create-before-destroy strategy helps retain the provisioned capacity of the instance, as compared to a typical rolling deployment which updates in-place, but results in an availability outage (and sometimes data loss) for the client application. For shards without replicas, Memorystore for Valkey still provisions a new replica first, coordinates the failover, and lastly replaces the existing primary node of the shard.

Step 1: Add Valkey replica

The first step of the create-before-destroy mechanism is to add a replica node with the latest software using the full sync OSS Valkey mechanism to copy the data from the primary to the replica node. This is done by forking a child process and leveraging diskless replication to bootstrap the replica.

You can best take advantage of the horizontal scale architecture of the instance by provisioning a higher number of shards to reduce the keyspace size within a node. Having a smaller dataset per node helps to reduce the fork latency impact of a full sync operation. It also speeds up copying of data across the nodes.

Step 2: Coordinated primary failover

If the Valkey node that needs to be updated is a primary node, Memorystore first executes a coordinated failover to the newly added replica node, and then proceeds with the node removal. During the coordinated failover, the client and the Valkey nodes work together and use the following strategies to avoid downtime for the application:

  1. Incoming client requests are temporarily blocked on the primary node, providing a window to ensure the existing replica is 100% synced with the primary.
  2. The replica completes the election process to take over the primary role.
  3. The previous primary node, now a replica, unblocks the existing requests and redirects them to the newly elected primary using the OSS Valkey cluster protocol. Any new requests sent to the previous replica node continue to be redirected to the new primary node.
  4. Your Valkey-friendly client refreshes its in-memory topology. It learns the address of the new primary endpoint, and no longer requires redirections.

Coordinated failovers should typically take tens of milliseconds. However, inflight data pending to be flushed to replicas and your total instance size can increase failover latency. Instance size can affect the convergence across primary nodes, which affects decision making on electing the new primary.

Step 3: Remove Valkey replica

The last step of the create-before-destroy mechanism is to remove the replica node on the earlier software. An abrupt node removal would be impactful for client applications because clients cache the endpoint information and the instance topology. Memorystore for Valkey has designed the removal of a Valkey replica to be graceful in order to allow client applications to refresh their topology before experiencing a hard node shutdown. The topology is customized to enable clients to learn about the new replica but also forget the one to be removed ahead of time.

The replica node running the earlier software is kept around for a certain drain period, typically on the order of minutes, during which it starts redirecting the incoming read requests to the primary node of its shard. It allows the third party client to refresh the node topology and learn about the new replica endpoints. If the client tries to reach a removed node after the drain period, the attempt fails, which in turn triggers a node topology refresh on the connecting client so it learns about the replica change. New refreshes of the node topology don't see the replica node to be removed.