SAP HANA high-availability planning guide

This guide provides an overview of the options, recommendations, and general concepts that you need to know before you deploy a high-availability (HA) SAP HANA system on Google Cloud.

This guide assumes that you already have an understanding of the concepts and practices that are generally required to implement an SAP HANA high-availability system. Therefore, the guide focuses primarily on what you need to know to implement such a system on Google Cloud.

If you need to know more about the general concepts and practices that are required to implement an SAP HANA HA system, see:

This planning guide focuses solely on HA for SAP HANA and does not cover HA for application systems. For information about HA for SAP NetWeaver, see the High-availability planning guide for SAP NetWeaver on Google Cloud.

This guide does not replace any documentation that is provided by SAP.

High availability options for SAP HANA on Google Cloud

You can use a combination of Google Cloud and SAP features in the design of a high availability configuration for SAP HANA that can handle failures at both the infrastructure or software levels. The following tables describe SAP and Google Cloud features that are used to provide high availability.

Feature Description
Compute Engine live migration

Compute Engine monitors the state of the underlying infrastructure and automatically migrates your instance away from an infrastructure maintenance event. No user intervention is required.

Compute Engine keeps your instance running during the migration if possible. In the case of major outages, there might be a slight delay between when the instance goes down and when it is available.

In multi-host systems, shared volumes, such as the `/hana/shared` volume used in the deployment guide, are persistent disks attached to the VM that hosts the master host, and are NFS-mounted to the worker hosts. The NFS volume is inaccessible for up to a few seconds in the event of the master host's live migration. When the master host has restarted, the NFS volume functions again on all hosts, and normal operation resumes automatically.

A recovered instance is identical to the original instance, including the instance ID, private IP address, and all instance metadata and storage. By default, standard instances are set to live migrate. We recommend not changing this setting.

For more information, see Live migrate.

Compute Engine automatic restart

If your instance is set to terminate when there is a maintenance event, or if your instance crashes because of an underlying hardware issue, you can set up Compute Engine to automatically restart the instance.

By default, instances are set to automatically restart. We recommend not changing this setting.

SAP HANA Service Auto-Restart

SAP HANA Service Auto-Restart is a fault recovery solution provided by SAP.

SAP HANA has many configured services running all the time for various activities. When any of these services is disabled due to a software failure or human error, the SAP HANA service auto-restart watchdog function restarts it automatically. When the service is restarted, it loads all the necessary data back into memory and resumes its operation.

SAP HANA Backups

SAP HANA backups create copies of data from your database that can be used to reconstruct the database to a point in time.

For more information about using SAP HANA backups on Google Cloud, see the SAP HANA operations guide.

SAP HANA Storage Replication

SAP HANA storage replication provides storage-level disaster recovery support through certain hardware partners. SAP HANA storage replication isn't supported on Google Cloud. You can consider using Compute Engine persistent disk snapshots instead.

For more information about using persistent disk snapshots to back up SAP HANA systems on Google Cloud, see the SAP HANA operations guide.

SAP HANA Host Auto-Failover

SAP HANA host auto-failover is a local fault recovery solution that requires one or more standby SAP HANA hosts in a scale-out system. If one of the main hosts fail, host auto-failover automatically brings the standby host online and restarts the failed host as a standby host.

For more information, see:

SAP HANA System Replication

SAP HANA system replication allows you to configure one or more systems to take over for your primary system in high-availability or disaster recovery scenarios. You can tune replication to meet your needs in terms of performance and failover time.

OS-native HA clusters for SAP HANA on Google Cloud

Linux operating system clustering provides application and guest awareness for your application state and automates recovery actions in case of failure.

Although the high-availability cluster principles that apply in non-cloud environments generally apply on Google Cloud, there are differences in how some things, such as fencing and virtual IPs, are implemented.

You can use either Red Hat or SUSE high-availability Linux distributions for your HA cluster for SAP HANA on Google Cloud.

Cluster resource agents

Both Red Hat and SUSE provide resource agents for Google Cloud with their high-availability implementations of the Pacemaker cluster software. The resource agents for Google Cloud manage STONITH fencing, VIPs that are implemented with either routes or alias IPs, and storage actions.

To deliver updates that are not yet included in the base OS resource agents, Google Cloud periodically provides companion resource agents for HA clusters for SAP. When these companion resource agents are required, the Google Cloud deployment procedures include a step for downloading them.

Fencing

Fencing, in the context of Google Cloud Compute Engine OS clustering, takes the form of STONITH, which provides each member in a two node cluster with the ability to restart the other node.

The resource agents that both Red Hat and SUSE provide manage STONITH fencing on Google Cloud.

Virtual IP address

High-availability clusters for SAP on Google Cloud use a virtual, or floating, IP address (VIP) to redirect network traffic from one host to another in the event of a failover.

Typical non-cloud deployments use a gratuitous Address Resolution Protocol (ARP) request to announce the movement and reallocation of a VIP to a new MAC address.

On Google Cloud, instead of using gratuitous ARP requests, you use one of several different methods to move and reallocate a VIP in an HA cluster. The recommended method is to use an internal TCP/UDP load balancer, but, depending on your needs, you can also use route-based VIP implementation or an alias-IP-based VIP implementation.

For more information about VIP implementation on Google Cloud, see Virtual IP implementation on Google Cloud.

Storage and replication

An SAP HANA HA cluster configuration uses synchronous SAP HANA System Replication to keep the primary and secondary SAP HANA databases in sync. The standard OS-provided resource agents for SAP HANA manage System Replication during a failover, starting and stopping the replication, and switching which instances are serving as the active and the standby instances in the replication process.

If you need shared file storage, NFS- or SMB-based filers can provide the required functionality.

For a high-availability shared storage solution you can use a third-party file-sharing solution, such as NetApp Cloud Volumes. Google Cloud provides an NFS file server solution, Filestore, but Filestore does not currently provide a file server that is highly available across zones.

Compute Engine regional persistent disks offer synchronously replicated block storage across zones. Although regional persistent disks are not supported for database storage in SAP HA systems, you can use them with NFS file servers.

For more information about storage options on Google Cloud, see:

Configuration settings for HA clusters on Google Cloud

Google Cloud recommends changing the default values of certain cluster configuration parameters to values that are better suited for SAP systems in the Google Cloud environment. If you use the Deployment Manager templates that are provided by Google Cloud, the recommended values are set for you.

Consider the recommended values as a starting point for tuning the Corosync settings in your HA cluster. You need to confirm that the sensitivity of failure detection and failover triggering are appropriate for your systems and workloads in the Google Cloud environment.

Corosync configuration parameter values

In the HA cluster configuration guides for SAP HANA, Google Cloud recommends values for several parameters in the totem section of the corosync.conf configuration file that are different than the default values that are set by Corosync or your Linux distributor.

The following table shows the totem parameters that Google Cloud recommends different values for, along with the recommended values and the impact of changing the value. For the default values of the parameters, which can differ between Linux distributions, see the documentation for your Linux distribution.

Parameter Recommended value Impact of the change
join 60 (ms) Increases how long the node waits for `join` messages in the membership protocol.
max_messages 20 Increases the maximum number of messages that may be sent by the node after receiving the token.
token 20000 (ms)

Increases how long the node waits for a totem protocol token before the node declares a token loss, assumes a node failure, and starts taking action.

Increasing the value of the `token` parameter makes the cluster more tolerant of momentary infrastructure events, such as a live migration, but can make the cluster take longer to detect and recover from a node failure.

The value of the `token` parameter also determines the default value of the `consensus` parameter, which controls how long a node waits for consensus to be achieved before it attempts to re-establish configuration membership. When `consensus` is not specified, Corosync sets its value at 1.2 times the value of the `token` parameter.

token_retransmits_before_loss_const 10 Increases the number of token retransmits that the node attempts before it concludes that the recipient node has failed and takes action.

For more information about configuring the corosync.conf file, see the configuration guide for your Linux distribution:

Time out and interval settings

When you define a cluster resource, you set interval and timeout values, in seconds, for various resource operations (op). For example:

primitive rsc_SAPHanaTopology_HA1_HDB00 ocf:suse:SAPHanaTopology \
 operations \$id="rsc_sap2_HA1_HDB00-operations" \
 op monitor interval="10" timeout="600" \
 op start interval="0" timeout="600" \
 op stop interval="0" timeout="300" \
 params SID="HA1" InstanceNumber="00"

clone cln_SAPHanaTopology_HA1_HDB00 rsc_SAPHanaTopology_HA1_HDB00 \
 meta is-managed="true" clone-node-max="1" target-role="Started" interleave="true"

The timeout values affect each of the resource operations differently, as explained in the following table.

Resource operation Timeout action
monitor If the timeout is exceeded, the monitoring status typically reports as failed, and the associated resource is considered in a failed state. The cluster attempts recovery options, which can include a failover. The cluster does not retry a failed monitoring operation.
start If a resource fails to start before its timeout is reached, the cluster attempts to restart the resource. The behavior is dictated by the on-fail action that is associated with a resource.
stop If a resource does not respond to a stop operation before the timeout is reached, this triggers a fencing (STONITH) event.

Along with other cluster configuration settings, the interval and timeout settings of the cluster resources affect how quickly the cluster software detects a failure and triggers a failover.

The timeout and interval values that are suggested by Google Cloud in the cluster configuration guides for SAP HANA account for Compute Engine Live Migration maintenance events.

Regardless of which timeout and interval values you use, you need to evaluate the values when you test your cluster, particularly during live migration testing, because the length of live migration events can vary slightly depending on the machine type you are using and other factors, such as system utilization.

Testing your HA cluster on Google Cloud

After your cluster is configured and the cluster and SAP HANA systems are deployed in your test environment, you need to test the cluster to confirm that the HA system is configured correctly and functioning as expected.

To confirm failover is working as expected, simulate various failure scenarios with the following actions:

  • Shut down the VM
  • Create a kernel panic
  • Shut down the application
  • Interrupt the network between the instances

Also, simulate a Compute Engine live migration event on the primary host to confirm that it does not trigger a failover. You can simulate a failover event by using the Cloud SDK command gcloud compute instances simulate-maintenance-event.

Logging and monitoring

Resource agents can include logging capabilities that propagate logs to Google Cloud's operations suite for analysis. Each resource agent includes configuration information that identifies any logging options. In the case of bash implementations, the logging option is gcloud logging.

You can also install the Cloud Logging agent to capture log output from operating system processes and correlate resource utilization with system events. The Logging agent captures default system logs, which include log data from Pacemaker and the clustering services. For more information, see About the Logging agent.

For information about using Cloud Monitoring to configure service checks that monitor the availability of service endpoints, see Managing uptime checks.

Service accounts and HA clusters

The actions that the cluster software can take in the Google Cloud environment are secured by the permissions that are granted to the service account of each host VM. For high-security environments, you can limit the permissions in the service accounts of your host VMs to conform to the principle of least privilege.

When limiting the service account permissions, keep in mind that your system might interact with Google Cloud services, such as Cloud Storage, so you might need to include permissions for those service interactions in the service account of the host VM.

For the most restrictive permissions, create a custom role with the minimum required permissions. For information about custom roles, see Creating and managing custom roles. You can further restrict permissions by limiting them to only specific instances of a resource, such as the VM instances in your HA cluster, by adding conditions in the role bindings of a resource's IAM policy.

The minimum permissions that your systems need depends on the Google Cloud resources that your systems access and the actions that your systems perform. Consequently, determining the minimum required permissions for the host VMs in your HA cluster might require you to investigate exactly which resources the systems on the host VM access and the actions that those systems perform with those resources.

As a starting point, the following list shows some HA cluster resources and the associated permissions that they require:

  • STONITH fencing
    • compute.instances.list
    • compute.instances.get
    • compute.instances.reset
    • compute.instances.stop
    • compute.instances.start
    • logging.logEntries.create
    • compute.zones.list
  • VIP implemented by using an alias IP
    • compute.instances.list
    • compute.instances.get
    • compute.zones.list
    • logging.logEntries.create
    • compute.instances.updateNetworkInterface
    • compute.zoneOperations.get
    • logging.logEntries.create
  • VIP implemented by using static routes
    • compute.instances.list
    • compute.instances.get
    • compute.zones.list
    • logging.logEntries.create
    • compute.routes.get
    • compute.routes.create
    • compute.routes.delete
    • compute.routes.update
    • compute.routes.list
    • compute.networks.updatePolicy
    • compute.networks.get
    • compute.globalOperations.get
    • logging.logEntries.create
  • VIP implemented by using an internal load balancer
    • No specific permissions required - the load balancer operates on health check statuses that does not require the cluster to interact with or change resources in Google Cloud

Virtual IP implementation on Google Cloud

A high-availability cluster uses a floating or virtual IP address (VIP) to move its workload from one cluster node to another in the event of an unexpected failure or for scheduled maintenance. The IP address of the VIP doesn't change, so client applications are unaware that the work is being served by a different node.

A VIP is also referred to as a floating IP address.

On Google Cloud, VIPs are implemented slightly differently than they are in on-premises installations, in that when a failover occurs, gratuitous ARP requests cannot be used to announce the change. Instead, you can implement a VIP address for an SAP HA cluster by using one of the following methods:

Internal TCP/UDP Load Balancing VIP implementations

A load balancer typically distributes user traffic across multiple instances of your applications, both to distribute the workload across multiple active systems and to protect against a processing slowdown or failure on any one instance.

The Internal TCP/UDP Load Balancing service also provides failover support that you can use with Compute Engine health checks to detect failures, trigger failover, and reroute traffic to a new primary SAP system in an OS-native HA cluster.

Internal TCP/UDP Load Balancing failover support is the recommended VIP implementation for a variety of reasons, including:

  • Load balancing on Compute Engine offers a 99.99% availability SLA.
  • Load balancing supports multi-zone high-availability clusters, which protects against zone failures with predictable cross-zone failover times.
  • Using load balancing reduces the time required to detect and trigger a failover, usually within seconds of the failure. Overall failover times are dependent on the failover times of each of the components in the HA system, which can include the hosts, database systems, application systems, and more.
  • Using load balancing simplifies cluster configuration and reduces dependencies.
  • Unlike a VIP implementation that uses routes, with load balancing, you can use IP ranges from your own VPC network, allowing you to reserve and configure them as needed.
  • Load balancing can easily be used to reroute traffic to a secondary system for planned maintenance outages.

When you create a health check for a load balancer implementation of a VIP, you specify the host port that the health check probes to determine the health of the host. For an SAP HA cluster, specify a target host port that is in the private range, 49152-65535, to avoid clashing with other services. On the host VM, configure the target port with a secondary helper service, such as the socat utility or HAProxy.

For database clusters in which the secondary, standby system remains online, the health check and helper service enables load balancing to direct traffic to the online system that is currently serving as the primary system in the cluster.

Using the helper service and port redirection, you can trigger a failover for planned software maintenance on your SAP systems.

For more information about the failover support of the Internal TCP/UDP Load Balancing, see Configuring failover for Internal TCP/UDP Load Balancing.

To deploy an HA cluster with a load-balancer VIP implementation, see:

Static route VIP implementations

The static route implementation also provides protection against zone failures, but requires you to use a VIP outside of the IP ranges of your existing VPC subnets where the VMs reside. Consequently, you also need to make sure that the VIP does not conflict with any external IP addresses in your extended network.

Static route implementations can also introduce complexity when used with shared VPC configurations, which are intended to segregate network configuration to a host project.

If you use a static route implementation for your VIP, consult with your network administrator to determine a suitable IP address for a static route implementation.

Alias IP VIP implementations

Alias IP VIP implementations are not recommended for multi-zone HA deployments because, if a zone fails, the reallocation of the alias IP to a node in a different zone can be delayed. Implement your VIP with an Internal TCP/UDP Load Balancing with failover support instead.

If you are deploying all nodes of your SAP HA cluster in the same zone, you can use an alias IP to implement a VIP for the HA cluster.

If you have existing multi-zone SAP HA clusters that use an alias IP implementation for the VIP, you can migrate to an Internal TCP/UDP Load Balancing implementation without changing your VIP address. Both alias IP and Internal TCP/UDP Load Balancing use IP ranges from your VPC network.

While alias IP addresses are not recommended for VIP implementations in multi-zone HA clusters, they have other use cases in SAP deployments. For example, they can be used to provide a logical host name and IP assignments for flexible SAP deployments, such as those managed by SAP Landscape Management.

General best practices for VIPs on Google Cloud

For more information about VIPs on Google Cloud, see Best Practices for Floating IP Addresses.

SAP HANA host auto-failover on Google Cloud

Google Cloud supports SAP HANA auto host-failover, the local fault-recovery solution provided by SAP HANA. The host auto-failover solution uses one or more standby hosts that are kept in reserve to take over work from the master or a worker host in the event of a host failure. The standby hosts do not contain any data or process any work.

The /hana/data and /hana/log volumes are mounted on the master and worker hosts only. When a takeover occurs, the host auto-failover solution uses the SAP HANA Storage Connector API and the Compute Engine gceStorageConnector plugin to manage the switching of these disks from the failed host to the standby host. The configuration parameters for the gceStorageConnector plugin, including whether fencing is enabled or disabled, are stored in the storage section of the SAP HANA global.ini file.

The /hana/shared and /hanabackup volumes are stored on an NFS server, which is managed by the master host and mounted on all hosts, including the standby hosts.

After a failover completes, the failed host is restarted as a standby host.

SAP supports up to three standby hosts in scale-out systems on Google Cloud. The standby hosts do not count against the maximum of 16 active hosts that SAP supports in scale-out systems on Google Cloud.

Currently, Google Cloud supports SAP HANA host auto-failover on only the SUSE Linux Enterprise Server (SLES) for SAP public images that are available from Compute Engine in the sles-12-sp3-sap and sles-12-sp2-sap image families. To see the public images that are available from Compute Engine, see Images.

The following diagram shows a multi-host architecture on Google Cloud that includes support for SAP HANA host auto-failover. In the diagram, worker host 2 fails and the standby host takes over. The gceStorageClient plugin works with the SAP Storage Connector API (not shown) to detach the disks that contain the /hana/data and /hana/logs volumes from the failed worker and to remount them on the standby host, which then becomes worker host 2 while the failed host becomes the standby host.

Diagram depicts the architecture of a scale-out SAP HANA system that includes
support for host auto-failover

Deployment options for SAP HANA high availability configurations

Google Cloud provides Deployment Manager templates that you can use to automate the deployment of SAP HANA HA systems or you can deploy and configure your SAP HANA HA systems manually.

The Deployment Manager templates that Google Cloud provides include a template.yaml configuration file that you complete. Deployment Manager reads the configuration file and deploys a SAP HANA system for you that is fully supported by SAP and that adheres to the best practices of both SAP and Google Cloud.

Automated deployment of Linux high-availability clusters for SAP HANA

For SAP HANA, Deployment Manager deploys a performance-optimized, high-availability Linux cluster that includes:

  • Automatic failover
  • Automatic restart
  • Synchronous replication
  • Memory preload
  • The Pacemaker high-availability cluster resource manager
  • A Google Cloud fencing mechanism
  • A VM with the required persistent disks for each SAP HANA instance
  • A SAP HANA instance on each VM

For more information, see the SAP HANA High-Availability Cluster Deployment Guide.

Automated deployment of SAP HANA scale-out systems with SAP HANA host auto-failover

Manual deployment of SAP HANA high-availability clusters

When you configure an HA cluster manually, to ensure your SAP HANA systems meet SAP supportability requirements and best practices, deploy the VMs and SAP HANA instances first by using the Deployment Manager template that Google Cloud provides.

For instructions for deploying and manually configuring an HA cluster on Google Cloud for SAP HANA, see:

What's next

Both Google Cloud and SAP provide more information about high availability.

More information from Google Cloud about high availability

For more information about high-availability for SAP HANA on Google Cloud, see:

For general information about protecting systems on Google Cloud against various failure scenarios, see Designing robust systems.

More information from SAP about SAP HANA high availability features

For more information from SAP about SAP HANA high availability features, refer to the following documents: