SAP HANA failover not happening due to SFAIL

Problem

Pacemaker HANA resource agent calls the HANA script systemReplicationStatus.py with a hard coded timeout set to 5 seconds. If the script fails to return HANA replication status within 5s, pacemaker marks HANA replication status SFAIL.

2021-07-04 16:50:53 nodename SAPHana(SAPHana_SID_HDB00): INFO: ACT site=nodename, setting SFAIL for secondary (2) - srRc=124

In a situation when Primary HANA fails or Primary host goes down, the promotion operation would fail due to SFAIL thus HANA is unable to failover.

2021-06-22 12:42:20 nodename SAPHana(rsc_SAPHana_sid_HDB22): ERROR: ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED

Environment

Pacemaker HA clusters without srHook enabled running on RHEL or SLES OSes

Solution

Workaround

Check if the VM was under high load or having issues at the time SFAIL was reported. If needed you can open a ticket with SAP support who own systemReplicationStatus.py to understand why it did not return in the 5 seconds.

Solution

Implement the HANA srHook python sctip as detailed in the GCP HANA HA operation guide.

Cause

Pacemaker relies on SystemReplication status to determine whether a secondary site is eligible for takeover. If the cluster cannot confirm that the secondary node is in full sync it will exclude it as "Possible Takeover Node" and mark the Replication State as SFAIL.