Google Cloud is not impervious to hardware failures. While multiple layers of redundancy exist, hardware errors can occur, resulting in the termination of your Compute Engine instances.
Host hardware errors can have multiple causes because the server hardware and their associated components have many parts that can experience failure. Memory-optimized machine types in particular have a large number of memory modules, which can increase the likelihood of hardware failures being memory related. Memory related failures are of two types:
Correctable memory errors: These errors are those that can be corrected by built-in hardware and software mechanisms, such as Error correction code (ECC) memory. Such errors have no impact on the Compute Engine instance that is running on the host. They are transparently handled by hardware and software mechanisms.
Uncorrectable memory errors: These errors are those that cannot be corrected. They are rare, random and unpredictable. Any attempt to access the affected memory area results in a signal to the OS, which results either in the termination of the Compute Engine instance or a Machine check exception (MCE) that is passed on to the instance. When an application on the Compute Engine instance attempts to read data from the affected uncorrectable memory area, the application consumes this signal and terminates. When the OS in a Compute Engine instance receives this signal, by default the OS prevents the affected memory pages from being re-allocated to avoid further use.
Detect host errors
To detect host errors, configure log-based alerting policies that use the following predefined Compute Engine queries:
Query/filter name | Description |
---|---|
Compute Engine Host Error (compute.instances.hostError) |
A host error indicates that a hardware error occurred that resulted in the Compute Engine instance needing to be terminated. |
Compute Engine Host Memory Alert (compute.instances.hostEventNotify) |
A host memory alert indicates a type of hardware error that is associated with memory modules. Such errors can result from permanent component failures over time, or transient events caused by high energy particles or cosmic rays that prevent a memory page from being safely retrieved. |
Protect your SAP workloads from host errors
To protect your SAP workloads from host errors, we recommend the following:
Make sure that automatic restart is set for your Compute Engine instances.
Compute Engine enables this option for all instances by default. We recommend that you don't turn this off.
To protect your SAP HANA and SAP NetWeaver workloads from single-instance failures, deploy them with a high availability (HA) configuration.
For more information, see the following guides:
To protect your SAP HANA workloads from being affected by the termination of any SAP HANA process, implement the SAP HANA HA/DR provider hooks and enable the SAP HANA Fast Restart option.
For information about how to do these, see the deployment guide for your SAP HANA scenario in All SAP HANA guides.
To protect your SAP HANA workloads from memory errors as surfaced by Compute Engine Host Memory Alert (
compute.instances.host_event_notify
) events for M2, M3, or M4 machine types, do the following:If the uncorrectable error cannot be handled by the VM, then the VM is automatically restarted due to the automatic restart policy. In an HA cluster, the secondary node automatically takes over. No further action is required.
If the uncorrectable memory error can be handled by the VM and does not result in a VM crash, then do the following:
If the affected instance is the current primary node in your HA cluster, then initiate a manual failover to the secondary node in your cluster.
Stop the affected instance to release the virtual memory pages that were affected by the host error event.
While Compute Engine automatically migrates the affected VMs to a healthy host during these events, some memory pages can remain inaccessible. If your SAP HANA workload attempts to read the affected memory pages for the first time after the memory error occurs, then your workload fails and terminates. By stopping the instance, you release the affected virtual memory pages that might remain from the initial hardware error.
Start the affected instance.
If you're unable to stop and start the affected VM, then the applications running on it might continue to operate till they can read the affected memory pages, which can take a number of hours. Restart the affected VM at your earliest convenience to release any affected memory pages.