Manage host errors for SAP workloads on Google Cloud
Stay organized with collections
Save and categorize content based on your preferences.
This document describes how you can detect host hardware errors on Google Cloud
and protect your SAP workloads from them.
Google Cloud is not impervious to hardware failures. While multiple layers
of redundancy exist, hardware errors can occur, resulting in the termination of
your Compute Engine instances.
Host hardware errors can have multiple causes because the server hardware and
their associated components have many parts that can experience failure.
Memory-optimized machine types in particular have a large number of memory
modules, which can increase the likelihood of hardware failures being memory
related. Memory related failures are of two types:
Correctable memory errors: These errors are those that can be corrected by
built-in hardware and software mechanisms, such as
Error correction code (ECC) memory.
Such errors have no impact on the Compute Engine instance that is
running on the host. They are transparently handled by hardware and software
mechanisms.
Uncorrectable memory errors: These errors are those that cannot be
corrected. They are rare, random and unpredictable. Any attempt to access
the affected memory area results in a signal to the OS, which results either
in the termination of the Compute Engine instance or a
Machine check exception (MCE)
that is passed on to the instance. When an application on the
Compute Engine instance attempts to read data from the affected
uncorrectable memory area, the application consumes this signal and
terminates. When the OS in a Compute Engine instance receives this
signal, by default the OS prevents the affected memory pages from being
re-allocated to avoid further use. When an uncorrectable memory error occurs,
termination of your application is unavoidable.
Compute Engine VMs have additional safeguards, such
as live migration,
that can be combined with application architecture strategies to limit the
impact of some of these events.
A host memory alert indicates a type of hardware error that is
associated with memory modules. Such errors can result from permanent
component failures over time, or transient events caused by high energy
particles or cosmic rays that prevent a memory page from being safely
retrieved.
Protect your SAP workloads from host errors
To protect your SAP workloads from host errors, we recommend the following:
Make sure that
automatic restart
is set for your Compute Engine instances.
Compute Engine enables this option for all instances by default. We
recommend that you don't turn this off.
To protect your SAP HANA and SAP NetWeaver workloads from single-instance
failures, deploy them with a high availability (HA) configuration.
For information about how to do these, see the deployment guide for your SAP
HANA scenario in
All SAP HANA guides.
To protect your SAP HANA workloads from uncorrectable memory errors on X4
memory-optimized bare metal machine types, your instance is automatically
restarted on a healthy host as soon as the error is detected. This provides
the quickest path to enabling full operational capability for your instance.
To protect your SAP HANA workloads from memory errors as surfaced by
Compute Engine Host Memory Alert
(compute.instances.host_event_notify) events for M2, M3, or M4
machine types, do the following:
If the uncorrectable error cannot be handled by the VM, then the VM is
automatically restarted due to the
automatic restart
policy. In an HA cluster, the secondary node automatically takes over. No
further action is required.
If the uncorrectable memory error can be handled by the VM and does not
result in a VM crash, then do the following:
If the affected instance is the current primary node in your HA cluster,
then initiate a manual failover to the secondary node in your cluster.
Stop the affected instance to release the virtual memory pages that were
affected by the host error event.
While Compute Engine automatically migrates the affected VMs to
a healthy host during these events, some memory pages can remain
inaccessible. If your SAP HANA workload attempts to read the affected
memory pages for the first time after the memory error occurs, then
your workload fails and terminates. By stopping the instance, you
release the affected virtual memory pages that might remain from the
initial hardware error.
Start the affected instance.
If you're unable to stop and start the affected VM, then the applications
running on it might continue to operate till they can read the affected
memory pages, which can take a number of hours. Restart the affected VM at
your earliest convenience to release any affected memory pages.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["# Manage host errors for SAP workloads on Google Cloud\n\nThis document describes how you can detect host hardware errors on Google Cloud and protect your SAP workloads from them.\n\n\u003cbr /\u003e\n\nGoogle Cloud is not impervious to hardware failures. While multiple layers\nof redundancy exist, hardware errors can occur, resulting in the termination of\nyour Compute Engine instances.\n\nHost hardware errors can have multiple causes because the server hardware and\ntheir associated components have many parts that can experience failure.\nMemory-optimized machine types in particular have a large number of memory\nmodules, which can increase the likelihood of hardware failures being memory\nrelated. Memory related failures are of two types:\n\n- **Correctable memory errors** : These errors are those that can be corrected by\n built-in hardware and software mechanisms, such as\n [Error correction code (ECC) memory](https://en.wikipedia.org/wiki/ECC_memory).\n Such errors have no impact on the Compute Engine instance that is\n running on the host. They are transparently handled by hardware and software\n mechanisms.\n\n- **Uncorrectable memory errors** : These errors are those that cannot be\n corrected. They are rare, random and unpredictable. Any attempt to access\n the affected memory area results in a signal to the OS, which results either\n in the termination of the Compute Engine instance or a\n [Machine check exception (MCE)](https://en.wikipedia.org/wiki/Machine-check_exception)\n that is passed on to the instance. When an application on the\n Compute Engine instance attempts to read data from the affected\n uncorrectable memory area, the application consumes this signal and\n terminates. When the OS in a Compute Engine instance receives this\n signal, by default the OS prevents the affected memory pages from being\n re-allocated to avoid further use. When an uncorrectable memory error occurs,\n termination of your application is unavoidable.\n\nCompute Engine VMs have additional safeguards, such\nas [live migration](/compute/docs/instances/live-migration-process),\nthat can be combined with application architecture strategies to limit the\nimpact of some of these events.\n\nDetect host errors\n------------------\n\nTo detect host errors on M2, M3, or M4 machine types,\n[configure log-based alerting policies](/logging/docs/alerting/log-based-alerts)\nthat use the following predefined\n[Compute Engine queries](/logging/docs/view/query-library#gce-filters):\n\nProtect your SAP workloads from host errors\n-------------------------------------------\n\nTo protect your SAP workloads from host errors, we recommend the following:\n\n- Make sure that\n [automatic restart](/compute/docs/instances/host-maintenance-overview#autorestart)\n is set for your Compute Engine instances.\n\n Compute Engine enables this option for all instances by default. We\n recommend that you *don't* turn this off.\n- To protect your SAP HANA and SAP NetWeaver workloads from single-instance\n failures, deploy them with a high availability (HA) configuration.\n\n For more information, see the following guides:\n - [SAP HANA high availability planning guide](/sap/docs/sap-hana-ha-planning-guide)\n - [High availability planning guide for SAP NetWeaver on Google Cloud](/sap/docs/netweaver-ha-planning-guide)\n- To protect your SAP HANA workloads from being affected by the termination of\n any SAP HANA process, implement the\n [SAP HANA HA/DR provider hooks](https://help.sap.com/docs/SAP_HANA_PLATFORM/6b94445c94ae495c83a19646e7c3fd56/1367c8fdefaa4808a7485b09815ae0f3.html)\n and enable the\n [SAP HANA Fast Restart option](/sap/docs/sap-hana-planning-guide#sap-hana-fast-restart-on-gc).\n\n For information about how to do these, see the deployment guide for your SAP\n HANA scenario in\n [All SAP HANA guides](/sap/docs/sap-hana-guides).\n- To protect your SAP HANA workloads from uncorrectable memory errors on X4\n memory-optimized bare metal machine types, your instance is automatically\n restarted on a healthy host as soon as the error is detected. This provides\n the quickest path to enabling full operational capability for your instance.\n\n- To protect your SAP HANA workloads from memory errors as surfaced by\n **Compute Engine Host Memory Alert**\n (`compute.instances.host_event_notify`) events for M2, M3, or M4\n machine types, do the following:\n\n - If the uncorrectable error cannot be handled by the VM, then the VM is\n automatically restarted due to the\n [automatic restart](/compute/docs/instances/host-maintenance-overview#autorestart)\n policy. In an HA cluster, the secondary node automatically takes over. No\n further action is required.\n\n - If the uncorrectable memory error can be handled by the VM and does not\n result in a VM crash, then do the following:\n\n 1. If the affected instance is the current primary node in your HA cluster,\n then initiate a manual failover to the secondary node in your cluster.\n\n 2. Stop the affected instance to release the virtual memory pages that were\n affected by the host error event.\n\n While Compute Engine automatically migrates the affected VMs to\n a healthy host during these events, some memory pages can remain\n inaccessible. If your SAP HANA workload attempts to read the affected\n memory pages for the first time after the memory error occurs, then\n your workload fails and terminates. By stopping the instance, you\n release the affected virtual memory pages that might remain from the\n initial hardware error.\n 3. Start the affected instance.\n\n If you're unable to stop and start the affected VM, then the applications\n running on it might continue to operate till they can read the affected\n memory pages, which can take a number of hours. Restart the affected VM at\n your earliest convenience to release any affected memory pages."]]