Troubleshooting high-availability configurations for SAP

In high-availability configurations for SAP on Google Cloud, the root cause of issues might lie in the clustering software, the SAP software, the Google Cloud infrastructure, or some combination of these.

Analyze Pacemaker logs in Cloud Logging

The following video shows how you can start troubleshooting high-availability configurations for SAP on Google Cloud using Cloud Logging.

Failed node in a Linux cluster doesn't restart properly after a failover

If your Linux high-availability cluster uses the fence_gce fence agent and a fenced VM fails to rejoin the cluster after a failover, you might need to delay the start of the Corosync software when fenced VMs restart.

Issue

During a failover, the fence_gce agent fences the failed Compute Engine VM, which reboots and rejoins the cluster before Pacemaker registers the fence action as complete. Because the fence action is not registered as complete, the rebooted VM shuts down its Pacemaker and Corosync services and leaves the cluster.

Diagnosis

To confirm that this is your issue:

Make sure that your cluster is using the fence_gce agent:

RHEL

pcs config

SLES

crm config show

The fence agent definition includes fence_gce.

RHEL

Stonith Devices:
Resource: STONITH-example-ha-vm1 (class=stonith type=fence_gce)
Attributes: port=example-ha-vm1 project=example-project-123456 zone=us-central1-a
Operations: monitor interval=300s timeout=120s (STONITH-example-ha-vm1-monitor-interval-60s)
Resource: STONITH-example-ha-vm2 (class=stonith type=fence_gce)
Attributes: port=example-ha-vm2 project=example-project-123456 zone=us-central1-c
Operations: monitor interval=300s timeout=120s (STONITH-example-ha-vm2-monitor-interval-60s)

SLES

primitive fence-example-ha-vm1 stonith:fence_gce \
 op monitor interval=300s timeout=120s \
 op start interval=0 timeout=60s \
 params port=example-ha-vm1 zone=us-central1-a project=example-project-123456
primitive fence-example-ha-vm2 stonith:fence_gce \
 op monitor interval=300s timeout=120s \
 op start interval=0 timeout=60s \
 params port=example-ha-vm2 zone=us-central1-c project=example-project-123456

Check the system log for the following messages:

DATESTAMP> node2 stonith-ng[1106]:  notice: Operation reboot of node2 by node1 for stonith_admin.1366@node1.c3382af8: OK
DATESTAMP> node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
DATESTAMP> node2 stonith-ng[1106]: warning: Can't create a sane reply
DATESTAMP> node2 crmd[1110]:    crit: We were allegedly just fenced by node1 for node1!
DATESTAMP> node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure

Solution

Configure the operating systems in both cluster nodes to delay the start of Corosync to ensure the fence action has time to register as complete with Pacemaker on the new primary node. Also, set the Pacemaker reboot timeout value to account for the delay.

To configure a delayed start of Corosync:

Put the cluster in maintenance mode:

RHEL

pcs property set maintenance-mode=true

SLES

crm configure property maintenance-mode="true"

On each cluster node as root, set a start delay for Corosync:
1. Create a systemd drop-in file:
```
systemctl edit corosync.service
```
2. Add the following lines to the file:
```
[Service]
ExecStartPre=/bin/sleep 60
```
3. Save the file and exit the editor.
4. Reload the systemd manager configuration.
```
systemctl daemon-reload
```
On either cluster node as root, verify that the Pacemaker timeout value for reboots is set for both fence agents:
1. Check the pcmk_reboot_timeout value:
```
crm_resource --resource FENCE_AGENT_NAME --get-parameter=pcmk_reboot_timeout
```
  Replace FENCE_AGENT_NAME with the name of the fence agent.
2. If the pcmk_reboot_timeout parameter is not found or is set to a value that is smaller than 300, set the value on both fence agents:
```
crm_resource --resource FENCE_AGENT_NAME --set-parameter=pcmk_reboot_timeout --parameter-value=300
```
  Replace FENCE_AGENT_NAME with the name of the fence agent.
  
  The pcmk_reboot_timeout value should be greater than the sum of:
  - The Corosync token timeout
  - The Corosync consensus timeout, which by default is the product of token * 1.2
  - The length of time it takes a reboot operation to complete, including any delay attribute.
  On Google Cloud, 300 seconds is sufficient for most clusters.
3. Confirm the new pcmk_reboot_timeout value:
```
crm_resource --resource FENCE_AGENT_NAME --get-parameter=pcmk_reboot_timeout
```
  Replace FENCE_AGENT_NAME with the name of the fence agent.

Take the cluster out of maintenance mode:

RHEL

pcs property set maintenance-mode=false

SLES

crm configure property maintenance-mode="false"

Unintentional node affinity that favors a particular node

When you manually move resources in a high-availability cluster using the cluster commands, you find that an automatic affinity or client preference is set to favor a particular node.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, resources such as SAP HANA system or SAP NetWeaver central services only run on one particular cluster node and do not fail over as expected during a node failure event.

Consequently, you might experience issues such as:

When you trigger the SAP NetWeaver ASCS service failover by issuing a Pacemaker command to move a resource to a cluster node, the resource does not start and shows the status stopped.
When you issue the standby command to one cluster node to force all resources to move to the other node, the resources do not start.

Diagnosis

Check your Pacemaker logs for the message that mentions a particular resource cannot run anywhere. For example:
```
2021-05-24 21:39:58 node_1 pacemaker-schedulerd (native_color) info:
 Resource NW1-ASCS01 cannot run anywhere
```
Check your Pacemaker location constraint configuration to identify any constraints that might be preventing the resources from running on a certain cluster node.

To check the Pacemaker location constraint configuration, follow these steps:
1. Display the location constraints:
```
cibadmin --query --scope constraints | grep rsc_location
```
2. Verify the location constraints:
  - Explicit location constraint: You find location constraints with score INFINITY (prefer the node) or -INFINITY (avoid the node). For example:
```
<rsc_location id="loc-constraint" rsc="NW1-ASCS01" score="INFINITY" node="nw-ha-1"/>
```
    There must not be any location constraint with score INFINITY or -INFINITY other than the fence agents. In all HA clusters, fence agents are defined in a location constraint with score -INFINITY, to prevent them from running on the node that is the fencing target.
  - Implicit location constraint: When you issue the Pacemaker command to move a resource to a cluster node or ban a resource to run on a cluster node, an implicit location constraint with prefix cli-ban or cli-prefer is added to the constraint id. For example:
```
<rsc_location id="cli-prefer-NW1-ASCS01" rsc="NW1-ASCS01" role="Started" node="nw-ha-2" score="INFINITY"/>
```

Solution

Make sure that the location constraints are specified as explained in our deployment guides:
To fix an explicit location constraint, delete the location constraint:
RHEL
```
pcs constraint remove RESOURCE_LOCATION_ID
```
SLES
```
crm configure delete RESOURCE_LOCATION_ID
```
Replace RESOURCE_LOCATION_ID with the location constraint id.
To fix an implicit location constraint, remove all constraints defined on the specified resource.

After each command that you use to move or ban a resource, run the following command to remove all constraints:
RHEL
```
pcs resource clear RESOURCE_NAME
```
SLES
```
crm resource clear RESOURCE_NAME
```
Replace RESOURCE_NAME with the name of the resource that you're moving.

Fence agent experienced an operational error

The fence agent has reported an error in the cluster status.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, the fence agent has reported an error in the cluster status. For example:

Failed Resource Actions:
   STONITH-ha-node-01_monitor_300000 on ha-node-02 'unknown error' (1): call=153, status=Timed Out, exitreason='',  last-rc-change='Mon Dec 21 23:40:47 2023', queued=0ms, exec=60003ms

Diagnosis

The fence agent deployed in your SAP HANA or SAP NetWeaver high-availability cluster regularly accesses Compute Engine API server to check the status of the fence target instance. If there is a temporary delay in the API call response or if there is a network interruption, then the fence agent monitoring operation might fail or time out.

To check the fence agent status, run the following command:

RHEL

pcs status

SLES

crm status

If the fence agent status is stopped, then use one of the solution options to resolve the error.

The fence agent operational error might cause the fence agent to stop, but Pacemaker still calls the fence agents with a stop directive in a fencing event.

Solution

If the fence agent status is stopped, then do one of the following:

To manually reset the failcount and restart the fence agent, run the following command:
RHEL
```
pcs resource cleanup FENCE_AGENT_NAME
```
SLES
```
crm resource cleanup FENCE_AGENT_NAME
```
Replace FENCE_AGENT_NAME with the name of the fence agent.
To automatically remove the fence agent operational error, configure the failure-timeout parameter.

The failure-timeout parameter resets the failcount after the specified duration and clears any operational errors. Applying this parameter doesn't require you to restart the cluster or put the cluster in the maintenance mode.

To configure the failure-timeout parameter, run the following command:
```
crm_resource --meta --resource FENCE_AGENT_NAME --set-parameter failure-timeout --parameter-value DURATION
```
Replace the following:
- FENCE_AGENT_NAME: the name of the fence agent.
- DURATION: the duration following the last operational failure after which the failcount is reset and the fence agent is restarted.

Fence agent `gcpstonith` is deprecated

The fence agent gcpstonith is active in your configuration. This agent is deprecated and Customer Care has communicated that you must switch to fence_gce instead.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA on SUSE Linux, the fence agent gcpstonith is used. For example:

 # crm status | grep gcpstonith
   * STONITH-hana-vm1   (stonith:external/gcpstonith):   Started hana-vm2
   * STONITH-hana-vm2   (stonith:external/gcpstonith):   Started hana-vm1

Diagnosis

The fence agent deployed in your SAP HANA high-availability cluster needs to be updated to use the OS bundled fence_gce fence agent instead. The gcpstonith agent script was delivered on legacy systems and has been superseded by fence_gce. fence_gce is provided as part of the fence-agents SUSE Linux package. gcpstonith was only delivered as part of SUSE Linux HANA deployments.

Solution

To migrate from gcpstonith on SUSE Linux, complete the following steps:

Install the following additional packages specific to your operating system:
- For SLES 15: python3-oauth2client and python3-google-api-python-client
- For SLES 12: python-google-api-python-client, python-oauth2client, and python-oauth2client-gce
To install these packages on your operating system, use the following command:
SLES 15
```
zypper in -y python3-oauth2client python3-google-api-python-client
```
SLES 12
```
zypper in -y python-google-api-python-client python-oauth2client python-oauth2client-gce
```
Update the fence-agents package to ensure that you have the latest version installed:
```
zypper update -y fence-agents
```

Place the cluster in maintenance mode:

crm configure property maintenance-mode=true

Delete all the fencing devices from your cluster. While deleting the last fencing device, you might be prompted to acknowledge that no STONITH resources are defined in your cluster.
```
crm configure delete FENCING_RESOURCE_PRIMARY
```
```
crm configure delete FENCING_RESOURCE_SECONDARY
```

Recreate the fencing device for the primary instance:

crm configure primitive FENCING_RESOURCE_PRIMARY stonith:fence_gce \
 op monitor interval="300s" timeout="120s" \
 op start interval="0" timeout="60s" \
 params port="PRIMARY_INSTANCE_NAME" zone="PRIMARY_ZONE" \
 project="PROJECT_ID" \
 pcmk_reboot_timeout=300 pcmk_monitor_retries=4 pcmk_delay_max=30

Recreate the fencing device for the secondary instance:

crm configure primitive FENCING_RESOURCE_SECONDARY stonith:fence_gce \
 op monitor interval="300s" timeout="120s" \
 op start interval="0" timeout="60s" \
 params port="SECONDARY_INSTANCE_NAME" zone="SECONDARY_ZONE" \
 project="PROJECT_ID" \
 pcmk_reboot_timeout=300 pcmk_monitor_retries=4

Set the location constraints:

crm configure location FENCING_LOCATION_NAME_PRIMARY \
 FENCING_RESOURCE_PRIMARY -inf: "PRIMARY_INSTANCE_NAME"

crm configure location FENCING_LOCATION_NAME_SECONDARY \
 FENCING_RESOURCE_SECONDARY -inf: "SECONDARY_INSTANCE_NAME"

Take the cluster out of maintenance mode:

crm configure property maintenance-mode=false

Check the configuration:

crm config show related:FENCING_RESOURCE_PRIMARY

Check the cluster status:

# crm status | grep fence_gce
  STONITH-hana-vm1   (stonith:fence_gce):   Started hana-vm2
  STONITH-hana-vm2   (stonith:fence_gce):   Started hana-vm1

Resource agent is stopped

A resource agent has failed to start and remains in the Stopped status.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, a resource agent has reported an error in the cluster status. For example:

Failed Resource Actions:
   rsc_SAPHana_DV0_HDB00_start_0 on ha-node-02 'error' (1): call=91, status='complete', last-rc-change='Wed Oct 18 18:00:31 2023', queued=0ms, exec=19010ms

Diagnosis

If a running resource agent fails, then Pacemaker attempts to stop the agent and start it again. If the start operation fails for any reason, then Pacemaker sets the resource failcount to INFINITY and attempts to start the agent on another node. If the resource agent fails to start on any node, then the resource agent remains in the Stopped status.

To check the resource agent status, run the following command:

RHEL

pcs status

SLES

crm status

For SAP HANA, the following example shows the resource agent in the Stopped status on the node hana-b:

Full List of Resources:
  * STONITH-hana-a        (stonith:fence_gce):   Started hana-b
  * STONITH-hana-b        (stonith:fence_gce):   Started hana-a
  * Resource Group: g-primary:
    * rsc_vip_int-primary       (ocf::heartbeat:IPaddr2):        Started hana-a
    * rsc_vip_hc-primary        (ocf::heartbeat:anything):       Started hana-a
  * Clone Set: cln_SAPHanaTopology_DV0_HDB00 [rsc_SAPHanaTopology_DV0_HDB00]:
    * Started: [ hana-a hana-b ]
  * Clone Set: msl_SAPHana_DV0_HDB00 [rsc_SAPHana_DV0_HDB00] (promotable):
    * Masters: [ hana-a ]
    * Stopped: [ hana-b ]
  * STONITH-scaleup-majority    (stonith:fence_gce):   Started hana-b

Solution

If a resource agent is in the Stopped status, then do the following:

Manually start the resource agent by resetting the failcount:
RHEL
```
pcs resource cleanup RESOURCE_AGENT_NAME
```
SLES
```
crm resource cleanup RESOURCE_AGENT_NAME
```
Replace RESOURCE_AGENT_NAME with the name of the resource agent. For example rsc_SAPHana_DV0_HDB00.
Ensure that the status of the resource agent reaches status Started:
```
crm_mon
```
If the resource agent still fails to start, then gather the relevant diagnostic information and contact support.

Troubleshooting high-availability configurations for SAP

Analyze Pacemaker logs in Cloud Logging

Failed node in a Linux cluster doesn't restart properly after a failover

Issue

Diagnosis

RHEL

SLES

RHEL

SLES

Solution

RHEL

SLES

RHEL

SLES

Unintentional node affinity that favors a particular node

Issue

Diagnosis

Solution

RHEL

SLES

RHEL

SLES

Fence agent experienced an operational error

Issue

Diagnosis

RHEL

SLES

Solution

RHEL

SLES

Fence agent gcpstonith is deprecated

Issue

Diagnosis

Solution

SLES 15

SLES 12

Resource agent is stopped

Issue

Diagnosis

RHEL

SLES

Solution

RHEL

SLES

Fence agent `gcpstonith` is deprecated