About maintenance

This page explains how Memorystore for Redis Cluster performs maintenance on instances. It also provides information and configuration recommendations that your client applications should be aware of to take advantage of Memorystore for Redis Cluster's zero downtime maintenance design. These recommendations apply to both highly available clusters and clusters without replicas. However, we strongly recommend the high availability configuration for all production use cases.

Memorystore for Redis Cluster routinely updates instances to ensure that the service is reliable, performant, secure, and up-to-date. These updates are called maintenance. Maintenance is fully managed by the service and is designed to have zero downtime impact.

Maintenance typically falls into the following categories:

Memorystore features. To launch some features, Memorystore requires a maintenance update.
Operating system patches. We continuously monitor newly identified security vulnerabilities in the operating system. Upon discovery, we patch the operating system to protect you from new risks.
Database patches. Maintenance can include a Redis update to improve instance security, performance, and reliability characteristics beyond what OSS Redis provides.

Configure your client application

To configure your client application for the best performance and availability possible during maintenance, follow these steps:

Use and configure your OSS Redis cluster client according to the guidance at Redis client best practices to make sure that any scheduled maintenance doesn't impact your client application. Our recommended client configurations can avoid connection resets through periodic inline topology refreshes and background connection rotations.
Test your client application with a series of update operations (such as scale in or out, replica count changes) while running a representative workload on primary and replica nodes, and monitoring for client impact. These updates test the inline topology refresh logic on clients, full sync impact, new node discovery, and existing node removal capability. Testing helps make sure that the OSS Redis cluster client is configured correctly to avoid any negative impact to your application.

Scheduled maintenance

Memorystore for Redis Cluster leverages a gradual deployment and create-before-destroy lifecycle strategy to avoid any downtime impact caused by maintenance. Zero-downtime maintenance is achieved by using the request redirection capabilities of the OSS Redis cluster protocol in conjunction with the following Memorystore mechanisms:

Coordinated failover without any loss of data.
Graceful node removal to enable clients to catch up to the cluster topology updates without any availability impact.
The cluster's PSC endpoints aren't affected by maintenance. For more information about PSC endpoints, see Cluster endpoints.

The service behavior described in the following sections applies to scheduled maintenance only. For information about the impact of unplanned events such as hardware failures, see Client behavior during an unplanned failover.

Gradual deployment strategy

Memorystore for Redis Cluster maintenance deployments are performed with progressively increasing scope, and at a rate that allows for failure detection early enough to mitigate impact and establish stability confidence. Bake times (time during which the update is applied and monitored before considering it a success and moving forward) are integrated across the Memorystore fleet of clusters at the service scale. Additionally, bake times are integrated within the cluster across zones in a region (multiple fault-domains) to reduce the scope of impact, if any.

For your cluster configured for high availability, at most one fault-domain/zone is updated at any given time to make sure that a cluster shard, including both primary and replicas, has high availability throughout the update. Furthermore, only a few Redis nodes are updated at any given time. Updates use a create-before-destroy lifecycle mechanism to maximize cluster stability. This strategy provides the most benefits when updating a cluster with many shards. Only applying the updates to a small part of the overall user keyspace at any given time maximizes data availability.

Create-before-destroy lifecycle strategy

A Redis cluster has multiple shards. Each shard has one primary node and zero-or-more replica nodes. Memorystore uses the following process to update any existing primary or replica Redis node in a shard:

Memorystore for Redis Cluster first adds a completely new replica with the latest software update to the shard. Memorystore creates an entirely new node, instead of updating an existing node, to ensure your provisioned capacity is retained in the event of an unexpected bootstrap failure.
If a node within the shard to be updated is a primary node, it is first converted to a replica before removal using a coordinated failover.
Next Memorystore removes the replica that uses the earlier software.
The process is repeated for each node in the cluster.

The create-before-destroy strategy helps retain the provisioned capacity of the cluster, as compared to a typical rolling deployment which updates in-place, but results in an availability outage (and sometimes data loss) for the client application. For shards without replicas, Memorystore for Redis Cluster still provisions a new replica first, coordinates the failover, and lastly replaces the existing primary node of the shard.

Step 1: Add Redis replica

The first step of the create-before-destroy mechanism is to add a replica node with the latest software using the full sync OSS Redis mechanism to copy the data from the primary to the replica node. This is done by forking a child process and leveraging diskless replication to bootstrap the replica.

You can best take advantage of the horizontal scale architecture of the cluster by provisioning a higher number of shards to reduce the keyspace size within a node. Having a smaller dataset per node helps to reduce the fork latency impact of a full sync operation. It also speeds up copying of data across the nodes.

Step 2: Coordinated primary failover

If the Redis node that needs to be updated is a primary node, Memorystore first executes a coordinated failover to the newly added replica node, and then proceeds with the node removal. During the coordinated failover, the client and the Redis nodes work together and use the following strategies to avoid downtime for the application:

Incoming client requests are temporarily blocked on the primary node, providing a window to ensure the existing replica is 100% synced with the primary.
The replica completes the election process to take over the primary role.
The previous primary node, now a replica, unblocks the existing requests and redirects them to the newly elected primary using the OSS Redis cluster protocol. Any new requests sent to the previous replica node continue to be redirected to the new primary node.
Your Redis-cluster-friendly client refreshes its in-memory topology. It learns the address of the new primary endpoint, and no longer requires redirections.

Coordinated failovers should typically take tens of milliseconds. However, your total cluster size can increase failover latency. So can inflight data pending to be flushed to replicas. Cluster size can affect the convergence across primary nodes, which affects decision making on electing the new primary.

Step 3: Remove Redis replica

The last step of the create-before-destroy mechanism is to remove the replica node on the earlier software. An abrupt node removal would be impactful for client applications because clients cache the endpoint information and the cluster topology. Memorystore for Redis Cluster has designed the removal of a Redis replica to be graceful in order to allow client applications to refresh their topology before experiencing a hard node shutdown. The topology is customized to enable clients to learn about the new replica. The topology also forgets the replica that will be removed before it's removed.

The replica node running the earlier software is kept around for a certain drain period, typically on the order of minutes, during which it starts redirecting the incoming read requests to the primary node of its shard. It allows the OSS Redis cluster client to refresh the cluster topology and learn about the new replica endpoints. If the client tries to reach a removed node after the drain period, the attempt fails, which in turn triggers a cluster topology refresh on the cluster client so it learns about the replica change. New refreshes of the cluster topology don't see the replica node that will be removed.

Maintenance settings

With Memorystore, you can customize maintenance schedules to align with your application's needs and minimize disruptions. You can do this by configuring a maintenance window for your cluster.

Maintenance windows are set per Memorystore cluster and allow the following configuration options:

Day of the week. Designates the day when maintenance occurs.
Start hour. The hour that maintenance begins.

The duration of the maintenance window is 1 hour. Note that in some cases, maintenance may spill beyond the window you have selected.

Once a maintenance window is configured for a cluster instance, future automatic maintenance will be scheduled according to the preferences set for maintenance windows.

Default maintenance windows

If you don't set a maintenance window, Memorystore will update your cluster in one of the following time windows according to your cluster's time zone:

Weekday window (Monday to Friday). 10 PM to 6 AM
Weekend window. Friday, 10 PM to Monday, 6 AM

Maintenance example

As a developer managing a shopping cart service at a retailer, you have the responsibility to oversee a production environment that includes a Memorystore for Redis Cluster instance. To ensure optimal performance during maintenance, you aim to schedule it when the cluster experiences minimal traffic, which typically occurs around midnight on Sundays.

In this case, you can set your production cluster's maintenance window to:

Day of the week. Sunday.
Start hour. 1 AM.

Upcoming maintenance notifications

To ensure you stay informed about maintenance events on your cluster, you can set up email notifications regarding upcoming maintenance at least one week before it's scheduled. These notifications will have the subject line, "Upcoming maintenance for your Cloud Memorystore instance [your-cluster-name]".

A notification is also sent when the Maintenance starts for your cluster. The subject line of the email would be "Maintenance is undergoing for your Cloud Memorystore instance [your-cluster-name]".

On maintenance completion, a completed notification is sent. The email title is "Completed Maintenance for your Cloud Memorystore instance [your-cluster-name]".

If Memorystore reschedules maintenance, you receive an email notifying you of the canceled maintenance. The subject line of this email would be "Canceled maintenance for your Cloud Memorystore instance [your-cluster-name]".

You must opt in to receive these maintenance notifications. To sign up for maintenance notifications, follow the steps outlined below:

To receive maintenance notifications from Memorystore, ensure you have completed the above steps at least one week before the scheduled maintenance update for your instance. If you fail to do so, the system will not have sufficient time to notify you of the upcoming maintenance.

Notifications will be sent to the email address associated with your Google Account. It is not possible to configure a custom email alias (for instance, a team email alias). At this time, we don't support sending notifications to a different email address.

By subscribing to maintenance notifications, you will receive alerts for all Memorystore clusters with scheduled maintenance within a specified project. If subscribed, you receive a separate notification for each cluster.

For instructions on finding scheduled maintenance, see Find scheduled maintenance.

Rescheduling maintenance

In the scenario where maintenance windows are configured for your cluster, this section provides guidelines on how to reschedule maintenance. For example, if a new service is scheduled to launch during your current maintenance window, you may want to postpone the maintenance window until a few days after the launch.

You can reschedule maintenance as much as you want within two weeks of the originally scheduled time. You may choose one of the following options while rescheduling:

Update now. Instead of waiting for the scheduled maintenance window, you can immediately apply the updates to your cluster.
Custom day and time. This lets you choose any specific time within two weeks of the originally scheduled maintenance time.

While rescheduling maintenance, the following restrictions apply:

If there is less than an hour remaining before the current scheduled maintenance time, you can not reschedule the maintenance.
Upon successful rescheduling, an email is sent confirming the cancellation of the previous maintenance, and a new upcoming maintenance notification is sent with the updated schedule.

For instructions on rescheduling maintenance see Reschedule maintenance.

FAQ

The following are some frequently asked questions about the maintenance policy for Memorystore for Redis Cluster:

How do I know when maintenance is scheduled for my cluster?

We recommend subscribing to notifications and configuring a maintenance window, to learn when maintenance is scheduled for your instance. You can also check manually using Find Scheduled Maintenance to see if the maintenance_schedule field is set in the response.

When am I notified of upcoming maintenance?

If you are subscribed to maintenance notifications and have set a maintenance window, you are alerted by email at least 1 week before a maintenance event.

How long can I defer maintenance?

Once maintenance has been scheduled for your cluster, you may start the update for your instance immediately or defer the update for up to 2 weeks from the originally scheduled maintenance time. For example, if maintenance is scheduled for Oct. 11 at 11:15 PM, you can defer until Oct. 25 at 11:15 PM. Maintenance will be applied at the scheduled time if no action is taken.

For more details, see Rescheduling maintenance.

What best practices should I follow for a smooth maintenance update experience?

We recommend you to take the following actions to ensure a smooth maintenance update experience:

Follow the instructions for configuring your client application
You should set your maintenance window to a time ensuring maintenance is not applied in your peak hours of Redis use.
You should opt-in to maintenance notifications to be alerted by email at least seven days before a maintenance update is scheduled for your instance.
If you don't have a low impact or no impact hour for your application usage, we recommend using the service default of the gradual rollouts which has best practices baked in. For more information, see Scheduled maintenance.

When should I apply maintenance immediately?

One scenario where you could apply the maintenance update immediately is on a test cluster to see how it impacts your application. This allows you to observe the impact it has, and defer maintenance on production clusters as needed/allowed.

You may also immediately schedule maintenance if the current time works for your cluster and you expect high load on your cluster in the future.

Do maintenance updates always complete inside the maintenance window?

An update starts inside of the maintenance window you specify. The update usually completes within the window, but this is not guaranteed.

Can I opt out of maintenance or schedule maintenance on certain clusters first?

No, you cannot opt out of maintenance or control the order of maintenance for your clusters. However once you get the initial maintenance notification, you can reschedule maintenance to defer it for up to 2 weeks.

What's next

View the permissions required to manage maintenance windows for your Redis cluster.