Problem
Pacemaker HANA resource agent calls the HANA script systemReplicationStatus.py with a hard coded timeout set to 5 seconds. If the script fails to return HANA replication status within 5s, pacemaker marks HANA replication status SFAIL.
2021-07-04 16:50:53 nodename SAPHana(SAPHana_SID_HDB00): INFO: ACT site=nodename, setting SFAIL for secondary (2) - srRc=124
In a situation when Primary HANA fails or Primary host goes down, the promotion operation would fail due to SFAIL thus HANA is unable to failover.
2021-06-22 12:42:20 nodename SAPHana(rsc_SAPHana_sid_HDB22): ERROR: ACT: HANA SYNC STATUS IS NOT 'SOK' SO THIS HANA SITE COULD NOT BE PROMOTED
Environment
- Pacemaker HA clusters without srHook enabled running on RHEL or SLES OSes
Solution
Workaround
- Check if the VM was under high load or having issues at the time SFAIL was reported. If needed you can open a ticket with SAP support who own systemReplicationStatus.py to understand why it did not return in the 5 seconds.
Solution
- Implement the HANA srHook python sctip as detailed in the GCP HANA HA operation guide.
Cause
Pacemaker relies on SystemReplication status to determine whether a secondary site is eligible for takeover. If the cluster cannot confirm that the secondary node is in full sync it will exclude it as "Possible Takeover Node" and mark the Replication State as SFAIL.