Report faulty host

This document outlines how to report a faulty host machine that is running your artificial intelligence (AI), machine learning (ML) and high performance computing (HPC) workloads. A host, also known as a node, is the physical machine on which your virtual machine (VM) instances are running.

This document is for Slurm and other VM-based clusters. For Google Kubernetes Engine clusters, see Report a faulty host in the Manage AI-optimized GKE clusters document.

Before you begin

Select the tab for how you plan to use the samples on this page:
gcloud
REST

To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.
For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

Overview

To report a faulty host, the VM instance must meet the following requirements:

It must be in a RUNNING state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty.
The VM must be running on an A4 or A3 Ultra GPU machine type.
The VM must use the reservation-bound provisioning model.

If you want to report a faulty host for A4 or A3 Ultra VMs that were created using other provisioning models, contact your Google Cloud account team.

Limitations

Google makes best-effort attempts to fulfill all requests to report faulty hosts. However, due to capacity constraints or rate limits, your request might not always be fulfilled.

How it works

When you report a host as faulty, the following takes place:

The report faulty host operation starts on the VM instance. At this stage the VM stays in a RUNNING state. This operation takes 10 - 12 minutes. For more details, see Review operations.
The VM instance shuts down.
Depending on the setting for the automatically restart (automaticRestart) host maintenance option, one of the following takes place:
- If the VM isn't configured to automatically restart, then the VM stays shutdown.
- If the VM is configured to automatically restart, then the VM is restarted as follows:
  - If healthy hosts are available in your reserved block of capacity, the VM instance goes into the RUNNING state on a new host from your reserved block of capacity. In parallel, Compute Engine also attempts to update your reserved block of capacity by replacing your faulty host with a new host.
  - If your reserved block of capacity is depleted, the VM instance stays in the REPAIRING state until you obtain more capacity. For a VM instance in REPAIRING state you can either leave it in that state or shutdown the VM. If the VM is powered on again, it might return a stockout error because of a lack of a host machine.

Report a faulty host

To report a faulty host, complete the following steps:

Ensure that you have thoroughly investigated your environment to identify the root cause of your issues. Keep in mind that this process involves a termination and migration of VMs that disrupt your workloads.
Take note of the physical host on which the VM is running. To do this, see Review the physical host information.
Back up Local SSD data. When the VM instance shuts down and is moved to a new host machine, the Local SSD data is deleted. To backup your Local SSD data, see Local SSD data backup.
Report the faulty host. To report a faulty host, select one of the following options:

Note: You can't cancel a reportHostAsFaulty request. If an operation is stuck, restart the VM to clean up the VM.
gcloud
To report a faulty host, use the gcloud compute instances report-host-as-faulty command with the following flags.
```
gcloud compute instances report-host-as-faulty VM_NAME \
  --zone=ZONE \
  --fault-reasons=behavior=FAULT_REASON,description=DESCRIPTION \
  --disruption-schedule=IMMEDIATE \
  --async
```
Replace the following:
- VM_NAME: the name of the VM instance.
- ZONE: the zone where the VM instance is located.
- FAULT_REASON: the issue with the host. You can specify one or more of the following values for the fault reason.
  
  If specifying multiple issues, use a comma-separated list: --fault-reasons=behavior=FAULT_REASON_1,FAULT_REASON_2
  - PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
  - SILENT_DATA_CORRUPTION : use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
- UNRECOVERABLE_GPU_ERROR: use this value if you identified an unrecoverable GPU error with an XID for a VM.
  - BEHAVIOR_UNSPECIFIED: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
- Optional: DESCRIPTION: additional details on the failure, such as XID information or suspected performance problems.
- DISRUPTION_SCHEDULE: specifies when to replace the host. Only the value IMMEDIATE is supported.
REST
To report a faulty host, make a POST request to the instances.reportHostAsFaulty method.
```
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/reportHostAsFaulty
{
 "faultReasons":[
     {
       "behavior":"FAULT_REASON",
       "description":"DESCRIPTION"
     }
 ],
 "disruptionSchedule":"IMMEDIATE"
}
```
Replace the following:
- PROJECT_ID: your project ID.
- VM_NAME: the name of the VM instance
- ZONE: the zone where the VM is located
- FAULT_REASON: the issue with the host. You can specify one or more of the following values for the fault reason:
  - PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
  - SILENT_DATA_CORRUPTION : use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
  - UNRECOVERABLE_GPU_ERROR: use this value if you identified an unrecoverable GPU error with an XID for a VM.
  - BEHAVIOR_UNSPECIFIED: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
- Optional: DESCRIPTION: additional details on the failure, such as XID information or suspected performance problems.
- DISRUPTION_SCHEDULE: specifies when to replace the host. Only the value IMMEDIATE is supported.
The output resembles the following:
```
Http Status 200 Created
Header: Location:"/instances/VM_NAME"
```
When making a request, you can report multiple issues at a time as follows:
```
POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/reportHostAsFaulty
{
 "faultReasons":[
     {
       "behavior":"FAULT_REASON",
       "description":"DESCRIPTION"
     },
     {
       "behavior":"FAULT_REASON",
       "description":"DESCRIPTION"
     }
 ],
 "disruptionSchedule":"IMMEDIATE"
}
```

Review operations

After you report a faulty host, the following sequence of operations takes place during the VM shutdown and restart.

The VM shutdown

A reportHostAsFaulty operation is created.
The reportHostAsFaulty operation creates a sequence of sub-operations that work to mark the underlying host machine as faulty. When all these sub-operations are complete, the reportHostAsFaulty operation goes into RUNNING mode.
Once the reportHostAsFaulty operation goes into RUNNING mode, an upcomingMaintenance operation is then created to log the upcoming maintenance event.
Then an instance terminated during maintenance operation is created as the VM is terminated.
After this step, the reportHostAsFaulty operation completes.

It takes about 10-12 minutes for all these operations to take place. Throughout this time the VM is in the RUNNING state.

Caution: Don't re-run the report fault host command on the same VM while a reportHostAsFaulty operation is in progress, this will cause the operation to fail.

The VM restart

Based on the VM's maintenance configuration for automaticRestart, one of the following occurs:
1. If automaticRestart is set to true, the Automatically restart an instance operation is created as Compute Engine attempts to restart the VM on another host machine.
2. If automaticRestart is set to false, the VM stays in the TERMINATED state. You can manually restart the VM. Compute Engine provisions the VM on a healthy machine within the same block.
To confirm that the VM has moved to a different host, review the physicalHost value for the VM instance. To do this, see Review the physical host information.

To review operations, you can use one of the following options.

Console (VM operations)

In the Google Cloud console, go to the Operations page.

Go to Operations
Locate the VM that you reported.
If the VM is powered down and the host is reported then the Status column shows Done for the VM. This page doesn't track if the VM is restarted on a new host.
To see if the VM restarted, go to the VM instances page.

Go to VM instances

Console (Cloud Logs)

In the Google Cloud console, go to the Logs Explorer page.

Go to Logs Explorer

If you use the search bar to find this page, select the result whose subheading is Logging. Your most recent logs are displayed in the Query results pane
In the toolbar, ensure that Show query is enabled.

Copy and paste the following query into the query box:

resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.(reportHostAsFaulty|terminateOnHostMaintenance|upcomingMaintenance|automaticRestart)"

Click Run query. The results of the query are displayed in the Query results pane.

gcloud

To track the status and details of the report faulty host operation, use the gcloud compute operations describe command:

gcloud compute operations describe OPERATION_NAME \
    --zone=ZONE

Replace the following:

OPERATION_NAME: the name of the operation returned when you made the reportHostAsFaulty request.
ZONE: the zone where you ran the request.

Review the physical host information

You can check the physical host for a VM by running the following command.

gcloud

After installing the Google Cloud CLI, initialize it by running the following command:
```
gcloud init
```
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
Use the gcloud compute instances describe command to view the physical host that a VM is running on.
```
gcloud beta compute instances describe VM_NAME \
  --zone=ZONE \
  --format="yaml(resourceStatus.physicalHost)"
```
Replace the following:
- VM_NAME: the name of the VM instance.
- ZONE: the zone where the VM is located.

REST

Use the instances.get method to view the physical host that a VM is running on.

GET https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME

In the output review the value for the "physicalHost" field.

Replace the following:

PROJECT_ID: your project ID.
VM_NAME: the name of the VM instance.
ZONE: the zone where the VM is located.

What's next?

If you encounter issues when working with this API, see Troubleshoot faulty host API.

Report faulty host Stay organized with collections Save and categorize content based on your preferences.

Before you begin

gcloud

REST

Overview

Limitations

How it works

Report a faulty host

gcloud

REST

Review operations

Console (VM operations)

Console (Cloud Logs)

gcloud

Review the physical host information

gcloud

REST

What's next?

Report faulty host