About maintenance

This page explains how Memorystore for Redis Cluster performs maintenance on instances. It also provides information and configuration recommendations that your client applications should be aware of to take advantage of Memorystore for Redis Cluster's zero downtime maintenance design. These recommendations apply to both highly available clusters and clusters without replicas. However, we strongly recommend the high availability configuration for all production use cases.

Memorystore for Redis Cluster routinely updates instances to ensure that the service is reliable, performant, secure, and up-to-date. These updates are called maintenance. Maintenance is fully managed by the service and is designed to have zero downtime impact.

Maintenance typically falls into the following categories:

  • Memorystore features. To launch some features, Memorystore requires a maintenance update.
  • Operating system patches. We continuously monitor newly identified security vulnerabilities in the operating system. Upon discovery, we patch the operating system to protect you from new risks.
  • Database patches. Maintenance can include a Redis update to improve instance security, performance, and reliability characteristics beyond what OSS Redis provides.

Configure your client application

To configure your client application for the best performance and availability possible during maintenance, follow these steps:

  1. Use and configure your OSS Redis cluster client according to the guidance at Redis client best practices to make sure that any scheduled maintenance doesn't impact the client application. Our recommended client configurations can avoid connection resets through periodic inline topology refreshes and background connection rotations.
  2. Test your client application with a series of update operations (such as scale in or out, replica count changes) while running a representative workload on primary and replica nodes, and monitoring for client impact. These updates test the inline topology refresh logic on clients, full sync impact, new node discovery, and existing node removal capability. Testing helps make sure that the OSS Redis cluster client is configured correctly to avoid any negative impact to your application.

Cluster endpoints

A cluster discovery endpoint is a combination of an IP address and port number. The discovery endpoint is used to retrieve your instance's cluster topology to bootstrap OSS Redis cluster clients, and keep them updated in steady state. The resulting cluster topology provides Redis node endpoints (IP and port combinations) to be cached in-memory by the Redis cluster client.

Cluster endpoints have the following behavior:

  1. Your cluster's discovery endpoint remains unchanged throughout the lifecycle of the cluster instance, even during maintenance, or by any other action you take such as scaling in or out or changing replica counts.

  2. Redis node endpoints can change and can be recycled as nodes are added and removed over time. Ideally, you should use a Redis cluster client that can handle these changes automatically through topology refreshes and redirections. Examples of Redis cluster clients can be found at Client library code samples. Your application shouldn't have dependencies or assumptions that node endpoints will remain unchanged for a given cluster.

Scheduled maintenance

Memorystore for Redis Cluster leverages a gradual deployment and create-before-destroy lifecycle strategy to avoid any downtime impact of Memorystore scheduled maintenance on your Redis clusters. Zero-downtime maintenance is achieved by using the request redirection capabilities of the OSS Redis cluster protocol in conjunction with the following Memorystore mechanisms:

  1. Coordinated failover without any loss of data
  2. Graceful node removal to enable clients to catch up to the cluster topology updates without any availability impact

The service behavior described in the following sections apply to scheduled maintenance only. For information about the impact of unplanned events such as hardware failures, see Client behavior during an unplanned failover.

Gradual deployments strategy

Memorystore for Redis Cluster deployments are performed with progressively increasing scope, and at a rate that allows for failure detection early enough to mitigate impact and establish stability confidence. Bake times (time during which the update is applied and monitored before considering it a success and moving forward) are integrated across the Memorystore fleet of clusters at the service scale. Additionally, bake times are integrated within cluster instances across zones in a region (multiple fault-domains) to reduce the scope of impact, if any.

For your cluster instance configured for high availability, at most one fault-domain/zone is updated at any given time to make sure that a cluster shard, including both primary and replicas, has high availability throughout the update. Furthermore, only a few Redis nodes are updated at any given time. Updates use a create-before-destroy lifecycle mechanism to maximize cluster stability. This strategy provides the most benefits when updating a cluster with many shards. Only applying the updates to a small part of the overall user keyspace at any given time maximizes data availability.

Create-Before-Destroy lifecycle strategy

A Redis cluster has multiple shards. Each shard has one primary node and zero-or-more replica nodes. Memorystore uses the following process to update any existing primary or replica Redis node in a shard:

  1. Memorystore for Redis Cluster first adds a completely new replica with the latest software update to the shard. Memorystore creates an entirely new node, instead of updating an existing node, to ensure your provisioned capacity is retained in the event of an unexpected bootstrap failure.
  2. If a node within the shard to be updated is a primary node, it is first converted to a replica before removal using a coordinated failover.
  3. Next Memorystore removes the replica that uses the earlier software.
  4. The process is repeated for each node in the cluster.

The create-before-destroy strategy helps retain the provisioned capacity of the cluster, as compared to a typical rolling deployment which updates in-place, but results in an availability outage (and sometimes data loss) for the client application. For shards without replicas, Memorystore for Redis Cluster still provisions a new replica first, coordinates the failover, and lastly replaces the existing primary node of the shard.

Step 1: Add Redis replica

The first step of the create-before-destroy mechanism is to add a replica node with the latest software using the full sync OSS Redis mechanism to copy the data from the primary to the replica node. This is done by forking a child process and leveraging diskless replication to bootstrap the replica.

You can best take advantage of the horizontal scale architecture of the cluster by provisioning a higher number of shards to reduce the keyspace size within a node. Having a smaller dataset per node helps to reduce the fork latency impact of a full sync operation. It also speeds up copying of data across the nodes.

Step 2: Coordinated primary failover

If the Redis node that needs to be updated is a primary node, Memorystore first executes a coordinated failover to the newly added replica node, and then proceeds with the node removal. During the coordinated failover, the client and the Redis nodes work together and use the following strategies to avoid downtime for the application:

  1. Incoming client requests are temporarily blocked on the primary node, providing a window to ensure the existing replica is 100% synced with the primary.
  2. The replica completes the election process to take over the primary role.
  3. The previous primary node, now a replica, unblocks the existing requests and redirects them to the newly elected primary using the OSS Redis cluster protocol. Any new requests sent to the previous replica node continue to be redirected to the new primary node.
  4. Your Redis-cluster-friendly client refreshes its in-memory topology. It learns the address of the new primary endpoint, and no longer requires redirections.

Coordinated failovers should typically take tens of milliseconds. However, inflight data pending to be flushed to replicas and your total cluster size can increase failover latency. Cluster size can affect the convergence across primary nodes, which affects decision making on electing the new primary.

Step 3: Remove Redis replica

The last step of the create-before-destroy mechanism is to remove the replica node on the earlier software. An abrupt node removal would be impactful for client applications because clients cache the endpoint information and the cluster topology. Memorystore for Redis Cluster has designed the removal of a Redis replica to be graceful in order to allow client applications to refresh their topology before experiencing a hard node shutdown. The topology is customized to enable clients to learn about the new replica but also forget the one to be removed ahead of time.

The replica node running the earlier software is kept around for a certain drain period, typically on the order of minutes, during which it starts redirecting the incoming read requests to the primary node of its shard. It allows the OSS Redis cluster client to refresh the cluster topology and learn about the new replica endpoints. If the client tries to reach a removed node after the drain period, the attempt fails, which in turn triggers a cluster topology refresh on the cluster client so it learns about the replica change. New refreshes of the cluster topology don't see the replica node to be removed.