SAP HANA failover not happening due to SFAIL

Problem

Pacemaker HANA resource agent calls the HANA script systemReplicationStatus.py with a hard coded timeout set to 5 seconds. If the script fails to return HANA replication status within 5s, pacemaker marks HANA replication status SFAIL.

2021-07-04 16:50:53 nodename SAPHana(SAPHana_SID_HDB00): INFO: ACT site=nodename, setting SFAIL for secondary (2) - srRc=124


In a situation when Primary HANA fails or Primary host goes down, the promotion operation would fail due to SFAIL thus HANA is unable to failover.

2021-06-22 12:42:20 nodename SAPHana(rsc_SAPHana_sid_HDB22): ERROR: ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED

Environment

  • Pacemaker HA clusters without srHook enabled running on RHEL or SLES OSes

Solution

Workaround

  • Check if the VM was under high load or having issues at the time SFAIL was reported. If needed you can open a ticket with SAP support who own systemReplicationStatus.py to understand why it did not return in the 5 seconds.

Solution

Cause

Pacemaker relies on SystemReplication status to determine whether a secondary site is eligible for takeover. If the cluster cannot confirm that the secondary node is in full sync it will exclude it as "Possible Takeover Node" and mark the Replication State as SFAIL.