This document outlines how to report a faulty A3 accelerator-optimized host that is running your artificial intelligence (AI), machine learning (ML) and high performance computing (HPC) workloads. A host, also known as a node, is the physical machine on which your virtual machine (VM) instances are running.
Before you begin
-
To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.
Install the Google Cloud CLI, then initialize it by running the following command:
gcloud init
For more information, see Authenticate for using REST in the Google Cloud authentication documentation.
- Request access to the faulty host API. To request access, complete the API enablement request form.
Overview
To report a faulty host, the VM instance must be meet the following requirements:
- It must be in a
RUNNING
state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty. - The VM must be a part of a reserved block of capacity that was created for you by a technical account manager (TAM).
- The VM must be running on an A3 High, A3 Mega, or A3 Ultra machine type.
Limitations
You might be rate-limited on calls to this API based on an evaluation of the health of your blocks.
How it works
When you report a host as faulty, then the following takes place:
- The report faulty host operation starts on the VM instance. At this
stage the VM stays in a
RUNNING
state. This operation takes 10 - 12 minutes. For more details, see Review operations. - The VM instance shuts down.
- Depending on the setting for the
automatically restart
(
automaticRestart
) host maintenance option, one of the following takes place:- If the VM isn't configured to automatically restart, then the VM stays shutdown.
- If the VM is configured to automatically restart, then the VM is
restarted as follows:
- If healthy hosts are available in your
reserved block of capacity,
the VM instance goes into the
RUNNING
state on a new host from your reserved block of capacity. In parallel, Compute Engine also attempts to update your reserved block of capacity by replacing your faulty host with a new host. - If your reserved block of capacity is depleted, the VM instance stays
in the
REPAIRING
state until you obtain more capacity. For a VM instance inREPAIRING
state you can either leave it in that state or shutdown the VM. If the VM is powered on again, it might return a stockout error because of a lack of a host machine.
- If healthy hosts are available in your
reserved block of capacity,
the VM instance goes into the
Report a faulty host
To report a faulty host, complete the following steps:
- Ensure that you have thoroughly investigated your environment to identify the root cause of your issues. Keep in mind that this process involves a termination and migration of VMs that disrupt your workloads.
- Ensure that you have requested access to the faulty host API. To request access, complete the API enablement request form.
- Take note of the physical host on which the VM is running. To do this, see Review the physical host information.
- Back up Local SSD data. When the VM instance shuts down and is moved to a new host machine, the Local SS data is deleted. To backup your Local SSD data, see Local SSD data backup.
Report the faulty host. To report a faulty host, you must use the
instance.reportHostasFaulty
REST API. You can specify one or more values for the issues with your host.Single issue
POST https://compute.googleapis.com/compute/alpha/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty { "faultReasons":[ { "behavior":"FAULT_REASON", "description":"DESCRIPTION" } ] }
Replace the following:
PROJECT_ID
: your project ID.VM_NAME
: the name of the VM instanceZONE
: the zone where the VM is locatedFAULT_REASON
: the issue with the host. You can specify one or more of the following values for the fault reason:PERFORMANCE
: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.SILENT_DATA_CORRUPTION
: use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.UNRECOVERABLE_GPU_ERROR
: use this value if you identified an unrecoverable GPU error with an XID for a VM.BEHAVIOR_UNSPECIFIED
: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
- Optional:
DESCRIPTION
: additional details on the failure, such as XID information or suspected performance problems.
The output resembles the following:
Http Status 200 Created Header: Location:"/instances/VM_NAME"
Multiple issues
POST https://compute.googleapis.com/compute/alpha/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty { "faultReasons":[ { "behavior":"FAULT_REASON", "description":"DESCRIPTION" }, { "behavior":"FAULT_REASON", "description":"DESCRIPTION" } ] }
Replace the following:
PROJECT_ID
: your project ID.VM_NAME
: the name of the VM instanceZONE
: the zone where the VM is locatedFAULT_REASON
: the issue with the host. You can specify one or more of the following values for the fault reason:PERFORMANCE
: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.SILENT_DATA_CORRUPTION
: use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.UNRECOVERABLE_GPU_ERROR
: use this value if you identified an unrecoverable GPU error with an XID for a VM.BEHAVIOR_UNSPECIFIED
: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
- Optional:
DESCRIPTION
: additional details on the failure, such as XID information or suspected performance problems.
The output resembles the following:
Http Status 200 Created Header: Location:"/instances/VM_NAME"
Review operations
After you report a faulty host, the following sequence of operations takes place during the VM shutdown and restart.
The VM shutdown
- A
reportHostAsFaulty
operation is created. - The
reportHostAsFaulty
operation creates a sequence of sub-operations that work to mark the underlying host machine as faulty. When all these sub-operations are complete, thereportHostAsFaulty
operation goes intoRUNNING
mode. - Once the
reportHostAsFaulty
operation goes intoRUNNING
mode, anupcomingMaintenance
operation is then created to log the upcoming maintenance event. - Then an
instance terminated during maintenance
operation is created as the VM is terminated. After this step, the
reportHostAsFaulty
operation completes.It takes about 10-12 minutes for all these operations to take place. Throughout this time the VM is in the
RUNNING
state.
The VM restart
- Based on the VM's maintenance configuration for
automaticRestart,
one of the following occurs:
- If
automaticRestart
is set to true, theAutomatically restart an instance
operation is created as Compute Engine attempts to restart the VM on another host machine. - If
automaticRestart
is set to false, the VM stays in theTERMINATED
state. You can manually restart the VM. Compute Engine provisions the VM on a healthy holdback machine.
- If
- To confirm that the VM has moved to a different host, review the
physicalHost
value for the VM instance. To do this, see Review the physical host information.
To review operations, you can use one of the following options from the Google Cloud console.
VM operations
In the Google Cloud console, go to the Operations page.
Locate the VM that you reported.
If the VM is powered down and the host is reported then the Status column shows Done for the VM. This page doesn't track if the VM is restarted on a new host.
To see if the VM restarted, go to the VM instances page.
Cloud Logs
In the Google Cloud console, go to the Logs Explorer page.
If you use the search bar to find this page, select the result whose subheading is Logging. Your most recent logs are displayed in the Query results pane
In the toolbar, ensure that Show query is enabled.
Copy and paste the following query into the query box:
resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.(reportHostAsFaulty|terminateOnHostMaintenance|upcomingMaintenance|automaticRestart)"
Click Run query. The results of the query are displayed in the Query results pane.
Review the physical host information
You can check the physical host for a VM by running the following command.
gcloud
Install the Google Cloud CLI, then initialize it by running the following command:
gcloud init
Use the
gcloud compute instances describe
command to view the physical host that a VM is running on.gcloud beta compute instances describe VM_NAME \ --zone=ZONE \ --format="yaml(resourceStatus.physicalHost)"
Replace the following:
VM_NAME
: the name of the VM instance.ZONE
: the zone where the VM is located.
REST
Use the
instances.get
method to
view the physical host that a VM is running on.
GET https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME
In the output review the value for the "physicalHost"
field.
Replace the following:
PROJECT_ID
: your project ID.VM_NAME
: the name of the VM instance.ZONE
: the zone where the VM is located.
What's next?
- If you encounter issues when working with this API, see Troubleshoot faulty host API.