This guide provides an overview for planning for and managing high availability and disaster recovery for SAP HANA systems deployed on Google Cloud Platform (GCP) by following the SAP HANA on GCP deployment guide. This guide is not intended to replace the standard SAP documentation.
High availability for SAP HANA on GCP
You can obtain high availability for SAP HANA on GCP by using a combination of GCP and SAP features that handle failures at the infrastructure or software levels. The following tables describe SAP and GCP features that are used to provide high availability.
|Compute Engine live migration||
Compute Engine monitors the state of the underlying infrastructure and automatically migrates your instance away from an infrastructure maintenance event. No user intervention is required.
Migration occurs in the same region, if possible, and to a different region if not. In the case of systems that use multiple VMs, the replacement VMs are created in the same region, but may be located in different availability zones.
Compute Engine keeps your instance running during the migration if possible. In the case of major outages, there might be a slight delay between when the instance goes down and when it is available.
In multi-node systems, shared volumes, such as the `/hana/shared` volume used in the deployment guide, are persistent disks attached to the VM that hosts the master node, and are NFS-mounted to the worker nodes. The NFS volume is inaccessible for up to a few seconds in the event of the master node's live migration. When the master node has restarted, the NFS volume functions again on all nodes, and normal operation resumes automatically.
A recovered instance is identical to the original instance, including the instance ID, private IP address, and all instance metadata and storage. By default, standard instances are set to live migrate. We recommend not changing this setting.
For more information, see Live migrate.
|Compute Engine automatic restart||
If your instance is set to terminate when there is a maintenance event, or if your instance crashes because of an underlying hardware issue, you can set up Compute Engine to automatically restart the instance. By default, instances are set to automatically restart. We recommend not changing this setting.
For more information, see Automatic restart.
|SAP HANA Service Auto-Restart||
SAP HANA Service Auto-Restart is a fault recovery solution provided by SAP.
SAP HANA has many configured services running all the time for various activities. When any of these services is disabled due to a software failure or human error, the SAP HANA service auto-restart watchdog function restarts it automatically. When the service is restarted, it loads all the necessary data back into memory and resumes its operation.
For more information, see Service Auto-Restart.
|SAP HANA Backups||
SAP HANA backups create copies of data from your database that can be used to reconstruct the database to a point in time.
|SAP HANA Storage Replication||
SAP HANA storage replication provides storage-level disaster recovery support through certain hardware partners. SAP HANA storage replication isn't supported on GCP. You can consider using Compute Engine persistent disk snapshots instead.
For more information about using persistent disk snapshots to back up SAP HANA systems on GCP, see the SAP HANA operations guide.
|SAP HANA Host Auto-Failover||
SAP HANA host auto-failover is a local fault recovery solution that can be used in addition or as an alternative measure to system replication. One (or more) standby hosts are added to an SAP HANA system, and configured to work in standby mode. When a primary (worker) host fails, a standby host automatically takes its place.
SAP HANA host auto-failover isn't supported on GCP. Compute Engine live migration serves the same purpose in an SAP HANA system on GCP.
|SAP HANA System Replication||
SAP HANA system replication allows you to configure one or more systems to take over for your primary system in high-availability or disaster recovery scenarios. You can tune replication to meet your needs in terms of performance and failover time.
For more information from SAP about SAP HANA system replication see System Replication.
Recovering from instance restart
In the case of VM instance restart due to maintenance or other issues, Compute Engine automatic restart and SAP HANA service auto-restart work together to automatically restart the instance and application without your intervention. No client redirection is needed.
Deployment Manager support for SAP HANA HA clusters
If Compute Engine live migration, automatic restart, and the very high monthly uptime percentage of Compute Engine VMs are not enough to satisfy your availability requirements, you can deploy a high-availability Linux cluster on GCP for SAP HANA.
You can use the Deployment Manager to automate the deployment of a high-availability SUSE Linux Enterprise Server (SLES) cluster for a single-node, scale up SAP HANA system.
The Deployment Manager deploys a performance-optimized, high-availability Linux cluster that includes:
- Automatic failover
- Automatic restart
- Synchronous replication
- Memory preload
- The Pacemaker high-availability cluster resource manager
- A GCP fencing mechanism
For more information, see the SAP HANA High-Availability Cluster Deployment Guide.
More information about SAP HANA high availability features
For more information from SAP about SAP HANA high availability features, refer to the following documents:
- SAP HANA – High Availability
- FAQ: High Availability for SAP HANA
- How To Perform System Replication for SAP HANA 1.0
- How To Perform System Replication for SAP HANA 2.0
- Network Recommendations for SAP HANA 2.0 System Replication
- Network Recommendations for SAP HANA 2.1 System Replication
To prepare for disasters, you can use SAP HANA system replication to a secondary SAP HANA system, take backups of SAP HANA to enable recovery, or use both.
For mission critical workloads that require fast recovery times, use HANA system replication to minimize downtime. Using backups to recover a system costs less but takes longer, in that a new system must be created and then the backups restored into it to recover to the desired point in time.
In either case, you must use network-based redirection to redirect client applications that use the SAP HANA system to the IP address of the replacement system once it is available. For more information, see the SAP HANA Administration Guide.
Starting with SAP HANA SPS09, you can use the Python-based API included with SAP HANA to create your own high-availability/disaster-recovery (HA/DR) provider and integrate it with the SAP HANA System Replication takeover process to automate tasks like redirecting database client connections from the primary system to the secondary system after a takeover. For more information, see Implementing a HA/DR Provider.
Note that any restrictions defined by SAP, including distance limitation for synchronous replication, are also in effect on GCP.
Disaster recovery using SAP HANA System Replication
To maximize infrastructure resource utilization and to cost-optimize your DR solution, you can use the secondary system for non-production use cases, such as for a development or QA system. In this case, the secondary system isn't preloaded with data, so the failover time is longer than having the secondary system preloaded with data and kept in sync with the primary system.
HANA 2 SPS00 includes support for Active/Active (read enabled) configuration mode, which enables SAP HANA system replication to support read access on the secondary system. For more information, see Active/Active (Read Enabled).
Both synchronous and asynchronous replication are supported when using SAP HANA system replication with GCP.
If possible, we recommend using synchronous replication, where SQL transactions are not committed on the primary database instance until they are committed on the standby instance. This keeps the standby instance 100% in sync and ensures a zero recovery point objective. Synchronous replication can be used for instances that reside in any zones within the same region.
If the standby system is in a different region than the primary system, use asynchronous replication, where there is no wait for the standby instance to acknowledge the data before the commit on the primary instance. In this scenario, you might lose small amounts of data if a disaster happens. A tradeoff is that asynchronous replication gives you a greater than zero recovery point objective.
For all replication scenarios, you must manually perform a takeover on the standby system to start disaster recovery. You also need to manually redirect any applications that use the SAP HANA database to target the instance it has failed over to in the standby system.
Choose the HANA System Replication option that best fits your business needs, such as recovery time objective (RTO), and recovery point objective (RPO). For more information, see Replication Modes for SAP HANA System Replication.
SAP HANA System Replication with preload
In this scenario, your SAP HANA system is replicated to a dedicated standby system. The SAP HANA database is replicated to a Compute Engine VM that has a unique hostname and its own persistent disks attached. All of the SAP HANA data is loaded into memory on the standby system. If you have to failover, the failover time only takes around 90 seconds because all of the data is preloaded.
For more information about SAP HANA System Replication with preload, see the System Replication section in SAP HANA – High Availability.
SAP HANA System Replication without preload
In this scenario, your SAP HANA system is replicated to a dedicated standby system. The SAP HANA database is replicated to a Compute Engine VM that has a unique hostname and its own persistent disks attached. The SAP HANA data is not loaded into memory on the standby system. If you have to fail over, the failover time can take from minutes to hours, depending on the size of your dataset.
When you don't preload the data, the memory requirements for the Compute Engine VM that hosts the SAP HANA database are much smaller. The VM only needs either 64 GB of memory, or the amount of memory which is consumed by the rowstore on the target host, whichever is larger. You can get information about the rowstore memory footprint by running the following query:
SELECT round (sum(USED_FIXED_PART_SIZE + USED_VARIABLE_PART_SIZE)/1024/1024) AS "Row Tables MB" FROM M_RS_TABLES;
The reduced memory requirement gives you cost-saving options when choosing a Compute Engine machine type.
You can use a machine type that has low memory specifications for hosting the SAP HANA database in the standby system to lower your running cost. A low-memory VM isn't supported for SAP HANA in a production system, but you could use this lower-cost VM to perform a takeover in a disaster-recovery scenario, and then can modify the VM afterwards to change the machine type to one with a supported amount of memory. To do this, you must stop the VM to perform the upgrade, and so will have additional downtime before the SAP HANA system is available.
You can use a high-memory machine type for hosting the SAP HANA database in the standby system, and can share it with development or test systems to improve your return on investment. You can set the global allocation limit for the SAP HANA database to 64 GB by following the instructions at Change the Global Memory Allocation Limit, leaving the rest of the memory for other systems to use. When the standby system is needed, shut down dev and test operations, perform a takeover, and then remove the global allocation limit.
You can use either synchronous and asynchronous replication without preload. However, synchronous replication requires that the source and target instances be in the same GCP region.
You can use an HA/DR provider to address issues such as shutting down the development and/or test systems in the secondary node. To learn more about the HA/DR provider implementation, see Implementing a HA/DR Provider.
Triggering a takeover
To invoke disaster recovery, trigger the SAP HANA System Replication Takeover procedure in your standby system. SAP OSS Note 2063657 provides guidelines to help you decide whether takeover is the best option.
To trigger the takeover, follow the standard SAP HANA takeover process. For more details information about this procedure, see How To Perform System Replication for SAP HANA 1.0 or How To Perform System Replication for SAP HANA 2.0.
In cases of data issues or software failure, there might not be automatic notifications so that you can perform the takeover. Consider creating a custom solution to send alerts using Stackdriver or HANA monitoring tools.
Disaster recovery using SAP HANA backups
In cases where a longer recovery time objective is acceptable and your recovery point objective is greater than 15 minutes, you can recover from disaster by restoring from backup. To ensure successful recovery when using backups, make frequent copies of your backup files, especially log backups, to a Cloud Storage bucket, or some other long-term storage location that exists outside of the region where your SAP HANA system runs. We recommend documenting the infrastructure of your primary system and creating scripts that allow you to quickly create a replacement system to restore your backups to.
For more information, see the SAP HANA operations guide.
- For more information about high-availability and disaster recovery for SAP HANA on GCP, see the SAP HANA Operations Guide.
- To deploy a high-availability Linux cluster for SAP HANA, see SAP HANA High Availability Cluster on SLES Deployment Guide.