Troubleshooting high-availability configurations for SAP

In high-availability configurations for SAP on Google Cloud, the root cause of issues might lie in the clustering software, the SAP software, the Google Cloud infrastructure, or some combination of these.

Analyze Pacemaker logs in Cloud Logging

The following video shows how you can start troubleshooting high-availability configurations for SAP on Google Cloud using Cloud Logging.

Failed node in a Linux cluster doesn't restart properly after a failover

If your Linux high-availability cluster uses the fence_gce fence agent and a fenced VM fails to rejoin the cluster after a failover, you might need to delay the start of the Corosync software when fenced VMs restart.

Issue

During a failover, the fence_gce agent fences the failed Compute Engine VM, which reboots and rejoins the cluster before Pacemaker registers the fence action as complete. Because the fence action is not registered as complete, the rebooted VM shuts down its Pacemaker and Corosync services and leaves the cluster.

Diagnosis

To confirm that this is your issue:

  • Make sure that your cluster is using the fence_gce agent:

    RHEL

    pcs config

    SLES

    crm config show

    The fence agent definition includes fence_gce.

    RHEL

    Stonith Devices:
    Resource: STONITH-example-ha-vm1 (class=stonith type=fence_gce)
    Attributes: port=example-ha-vm1 project=example-project-123456 zone=us-central1-a
    Operations: monitor interval=300s timeout=120s (STONITH-example-ha-vm1-monitor-interval-60s)
    Resource: STONITH-example-ha-vm2 (class=stonith type=fence_gce)
    Attributes: port=example-ha-vm2 project=example-project-123456 zone=us-central1-c
    Operations: monitor interval=300s timeout=120s (STONITH-example-ha-vm2-monitor-interval-60s)
    

    SLES

    primitive fence-example-ha-vm1 stonith:fence_gce \
     op monitor interval=300s timeout=120s \
     op start interval=0 timeout=60s \
     params port=example-ha-vm1 zone=us-central1-a project=example-project-123456
    primitive fence-example-ha-vm2 stonith:fence_gce \
     op monitor interval=300s timeout=120s \
     op start interval=0 timeout=60s \
     params port=example-ha-vm2 zone=us-central1-c project=example-project-123456
  • Check the system log for the following messages:

    DATESTAMP> node2 stonith-ng[1106]:  notice: Operation reboot of node2 by node1 for stonith_admin.1366@node1.c3382af8: OK
    DATESTAMP> node2 stonith-ng[1106]:   error: stonith_construct_reply: Triggered assert at commands.c:2343 : request != NULL
    DATESTAMP> node2 stonith-ng[1106]: warning: Can't create a sane reply
    DATESTAMP> node2 crmd[1110]:    crit: We were allegedly just fenced by node1 for node1!
    DATESTAMP> node2 pacemakerd[1055]: warning: Shutting cluster down because crmd[1110] had fatal failure

Solution

Configure the operating systems in both cluster nodes to delay the start of Corosync to ensure the fence action has time to register as complete with Pacemaker on the new primary node. Also, set the Pacemaker reboot timeout value to account for the delay.

To configure a delayed start of Corosync:

  1. Put the cluster in maintenance mode:

    RHEL

    pcs property set maintenance-mode=true

    SLES

    crm configure property maintenance-mode="true"
  2. On each cluster node as root, set a start delay for Corosync:

    1. Create a systemd drop-in file:

      systemctl edit corosync.service
    2. Add the following lines to the file:

      [Service]
      ExecStartPre=/bin/sleep 60
    3. Save the file and exit the editor.

    4. Reload the systemd manager configuration.

      systemctl daemon-reload
  3. On either cluster node as root, verify that the Pacemaker timeout value for reboots is set for both fence agents:

    1. Check the pcmk_reboot_timeout value:

      crm_resource --resource FENCE_AGENT_NAME --get-parameter=pcmk_reboot_timeout

      Replace FENCE_AGENT_NAME with the name of the fence agent.

    2. If the pcmk_reboot_timeout parameter is not found or is set to a value that is smaller than 300, set the value on both fence agents:

      crm_resource --resource FENCE_AGENT_NAME --set-parameter=pcmk_reboot_timeout --parameter-value=300

      Replace FENCE_AGENT_NAME with the name of the fence agent.

      The pcmk_reboot_timeout value should be greater than the sum of:

      • The Corosync token timeout
      • The Corosync consensus timeout, which by default is the product of token * 1.2
      • The length of time it takes a reboot operation to complete, including any delay attribute.

      On Google Cloud, 300 seconds is sufficient for most clusters.

    3. Confirm the new pcmk_reboot_timeout value:

      crm_resource --resource FENCE_AGENT_NAME --get-parameter=pcmk_reboot_timeout

      Replace FENCE_AGENT_NAME with the name of the fence agent.

  4. Take the cluster out of maintenance mode:

    RHEL

    pcs property set maintenance-mode=false

    SLES

    crm configure property maintenance-mode="false"

Unintentional node affinity that favors a particular node

When you manually move resources in a high-availability cluster using the cluster commands, you find that an automatic affinity or client preference is set to favor a particular node.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, resources such as SAP HANA system or SAP NetWeaver central services only run on one particular cluster node and do not fail over as expected during a node failure event.

Consequently, you might experience issues such as:

  • When you trigger the SAP NetWeaver ASCS service failover by issuing a Pacemaker command to move a resource to a cluster node, the resource does not start and shows the status stopped.

  • When you issue the standby command to one cluster node to force all resources to move to the other node, the resources do not start.

Diagnosis

  • Check your Pacemaker logs for the message that mentions a particular resource cannot run anywhere. For example:

    2021-05-24 21:39:58 node_1 pacemaker-schedulerd (native_color) info:
     Resource NW1-ASCS01 cannot run anywhere
  • Check your Pacemaker location constraint configuration to identify any constraints that might be preventing the resources from running on a certain cluster node.

    To check the Pacemaker location constraint configuration, follow these steps:

    1. Display the location constraints:

      cibadmin --query --scope constraints | grep rsc_location
    2. Verify the location constraints:

      • Explicit location constraint: You find location constraints with score INFINITY (prefer the node) or -INFINITY (avoid the node). For example:

        <rsc_location id="loc-constraint" rsc="NW1-ASCS01" score="INFINITY" node="nw-ha-1"/>

        There must not be any location constraint with score INFINITY or -INFINITY other than the fence agents. In all HA clusters, fence agents are defined in a location constraint with score -INFINITY, to prevent them from running on the node that is the fencing target.

      • Implicit location constraint: When you issue the Pacemaker command to move a resource to a cluster node or ban a resource to run on a cluster node, an implicit location constraint with prefix cli-ban or cli-prefer is added to the constraint id. For example:

        <rsc_location id="cli-prefer-NW1-ASCS01" rsc="NW1-ASCS01" role="Started" node="nw-ha-2" score="INFINITY"/>

Solution

Fence agent experienced an operational error

The fence agent has reported an error in the cluster status.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, the fence agent has reported an error in the cluster status. For example:

Failed Resource Actions:
   STONITH-ha-node-01_monitor_300000 on ha-node-02 'unknown error' (1): call=153, status=Timed Out, exitreason='',  last-rc-change='Mon Dec 21 23:40:47 2023', queued=0ms, exec=60003ms

Diagnosis

The fence agent deployed in your SAP HANA or SAP NetWeaver high-availability cluster regularly accesses Compute Engine API server to check the status of the fence target instance. If there is a temporary delay in the API call response or if there is a network interruption, then the fence agent monitoring operation might fail or time out.

To check the fence agent status, run the following command:

RHEL

pcs status

SLES

crm status

If the fence agent status is stopped, then use one of the solution options to resolve the error.

The fence agent operational error might cause the fence agent to stop, but Pacemaker still calls the fence agents with a stop directive in a fencing event.

Solution

If the fence agent status is stopped, then do one of the following:

  • To manually reset the failcount and restart the fence agent, run the following command:

    RHEL

    pcs resource cleanup FENCE_AGENT_NAME

    SLES

    crm resource cleanup FENCE_AGENT_NAME

    Replace FENCE_AGENT_NAME with the name of the fence agent.

  • To automatically remove the fence agent operational error, configure the failure-timeout parameter.

    The failure-timeout parameter resets the failcount after the specified duration and clears any operational errors. Applying this parameter doesn't require you to restart the cluster or put the cluster in the maintenance mode.

    To configure the failure-timeout parameter, run the following command:

    crm_resource --meta --resource FENCE_AGENT_NAME --set-parameter failure-timeout --parameter-value DURATION

    Replace the following:

    • FENCE_AGENT_NAME: the name of the fence agent.
    • DURATION: the duration following the last operational failure after which the failcount is reset and the fence agent is restarted.

Resource agent is stopped

A resource agent has failed to start and remains in the Stopped status.

Issue

In your Linux Pacemaker high-availability cluster for SAP HANA or SAP NetWeaver, a resource agent has reported an error in the cluster status. For example:

Failed Resource Actions:
   rsc_SAPHana_DV0_HDB00_start_0 on ha-node-02 'error' (1): call=91, status='complete', last-rc-change='Wed Oct 18 18:00:31 2023', queued=0ms, exec=19010ms

Diagnosis

If a running resource agent fails, then Pacemaker attempts to stop the agent and start it again. If the start operation fails for any reason, then Pacemaker sets the resource failcount to INFINITY and attempts to start the agent on another node. If the resource agent fails to start on any node, then the resource agent remains in the Stopped status.

To check the resource agent status, run the following command:

RHEL

pcs status

SLES

crm status

For SAP HANA, the following example shows the resource agent in the Stopped status on the node hana-b:

Full List of Resources:
  * STONITH-hana-a        (stonith:external/gcpstonith):   Started hana-b
  * STONITH-hana-b        (stonith:external/gcpstonith):   Started hana-a
  * Resource Group: g-primary:
    * rsc_vip_int-primary       (ocf::heartbeat:IPaddr2):        Started hana-a
    * rsc_vip_hc-primary        (ocf::heartbeat:anything):       Started hana-a
  * Clone Set: cln_SAPHanaTopology_DV0_HDB00 [rsc_SAPHanaTopology_DV0_HDB00]:
    * Started: [ hana-a hana-b ]
  * Clone Set: msl_SAPHana_DV0_HDB00 [rsc_SAPHana_DV0_HDB00] (promotable):
    * Masters: [ hana-a ]
    * Stopped: [ hana-b ]
  * STONITH-scaleup-majority    (stonith:external/gcpstonith):   Started hana-b

Solution

If a resource agent is in the Stopped status, then do the following:

  1. Manually start the resource agent by resetting the failcount:

    RHEL

    pcs resource cleanup RESOURCE_AGENT_NAME

    SLES

    crm resource cleanup RESOURCE_AGENT_NAME

    Replace RESOURCE_AGENT_NAME with the name of the resource agent. For example rsc_SAPHana_DV0_HDB00.

  2. Ensure that the status of the resource agent reaches status Started:

    crm_mon

    If the resource agent still fails to start, then gather the relevant diagnostic information and contact support.