Report faulty host

This document outlines how to report a faulty A3 accelerator-optimized host that is running your artificial intelligence (AI), machine learning (ML) and high performance computing (HPC) workloads. A host, also known as a node, is the physical machine on which your virtual machine (VM) instances are running.

Before you begin

  • To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

      Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init

    For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

  • Request access to the faulty host API. To request access, complete the API enablement request form.

Overview

To report a faulty host, the VM instance must be meet the following requirements:

  • It must be in a RUNNING state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty.
  • The VM must be a part of a reserved block of capacity that was created for you by a technical account manager (TAM).
  • The VM must be running on an A3 High, A3 Mega, or A3 Ultra machine type.

Limitations

You might be rate-limited on calls to this API based on an evaluation of the health of your blocks.

How it works

When you report a host as faulty, then the following takes place:

  1. The report faulty host operation starts on the VM instance. At this stage the VM stays in a RUNNING state. This operation takes 10 - 12 minutes. For more details, see Review operations.
  2. The VM instance shuts down.
  3. Depending on the setting for the automatically restart (automaticRestart) host maintenance option, one of the following takes place:
    • If the VM isn't configured to automatically restart, then the VM stays shutdown.
    • If the VM is configured to automatically restart, then the VM is restarted as follows:
      • If healthy hosts are available in your reserved block of capacity, the VM instance goes into the RUNNING state on a new host from your reserved block of capacity. In parallel, Compute Engine also attempts to update your reserved block of capacity by replacing your faulty host with a new host.
      • If your reserved block of capacity is depleted, the VM instance stays in the REPAIRING state until you obtain more capacity. For a VM instance in REPAIRING state you can either leave it in that state or shutdown the VM. If the VM is powered on again, it might return a stockout error because of a lack of a host machine.

Report a faulty host

To report a faulty host, complete the following steps:

  1. Ensure that you have thoroughly investigated your environment to identify the root cause of your issues. Keep in mind that this process involves a termination and migration of VMs that disrupt your workloads.
  2. Ensure that you have requested access to the faulty host API. To request access, complete the API enablement request form.
  3. Take note of the physical host on which the VM is running. To do this, see Review the physical host information.
  4. Back up Local SSD data. When the VM instance shuts down and is moved to a new host machine, the Local SS data is deleted. To backup your Local SSD data, see Local SSD data backup.
  5. Report the faulty host. To report a faulty host, you must use the instance.reportHostasFaulty REST API. You can specify one or more values for the issues with your host.

    Single issue

    POST https://compute.googleapis.com/compute/alpha/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty
    {
     "faultReasons":[
         {
           "behavior":"FAULT_REASON",
           "description":"DESCRIPTION"
         }
     ]
    }
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • VM_NAME: the name of the VM instance
    • ZONE: the zone where the VM is located
    • FAULT_REASON: the issue with the host. You can specify one or more of the following values for the fault reason:
      • PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
      • SILENT_DATA_CORRUPTION : use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
      • UNRECOVERABLE_GPU_ERROR: use this value if you identified an unrecoverable GPU error with an XID for a VM.
      • BEHAVIOR_UNSPECIFIED: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
    • Optional: DESCRIPTION: additional details on the failure, such as XID information or suspected performance problems.

    The output resembles the following:

    Http Status 200 Created
    Header: Location:"/instances/VM_NAME"

    Multiple issues

    POST https://compute.googleapis.com/compute/alpha/projects/PROJECT_ID/zones/ZONE/instance/VM_NAME/reportHostAsFaulty
    {
     "faultReasons":[
         {
           "behavior":"FAULT_REASON",
           "description":"DESCRIPTION"
         },
         {
           "behavior":"FAULT_REASON",
           "description":"DESCRIPTION"
         }
     ]
    }
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • VM_NAME: the name of the VM instance
    • ZONE: the zone where the VM is located
    • FAULT_REASON: the issue with the host. You can specify one or more of the following values for the fault reason:
      • PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
      • SILENT_DATA_CORRUPTION : use this value if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, by kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
      • UNRECOVERABLE_GPU_ERROR: use this value if you identified an unrecoverable GPU error with an XID for a VM.
      • BEHAVIOR_UNSPECIFIED: use this value if you are not sure what is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
    • Optional: DESCRIPTION: additional details on the failure, such as XID information or suspected performance problems.

    The output resembles the following:

    Http Status 200 Created
    Header: Location:"/instances/VM_NAME"

Review operations

After you report a faulty host, the following sequence of operations takes place during the VM shutdown and restart.

The VM shutdown

  1. A reportHostAsFaulty operation is created.
  2. The reportHostAsFaulty operation creates a sequence of sub-operations that work to mark the underlying host machine as faulty. When all these sub-operations are complete, the reportHostAsFaulty operation goes into RUNNING mode.
  3. Once the reportHostAsFaulty operation goes into RUNNING mode, an upcomingMaintenance operation is then created to log the upcoming maintenance event.
  4. Then an instance terminated during maintenance operation is created as the VM is terminated.
  5. After this step, the reportHostAsFaulty operation completes.

    It takes about 10-12 minutes for all these operations to take place. Throughout this time the VM is in the RUNNING state.

The VM restart

  1. Based on the VM's maintenance configuration for automaticRestart, one of the following occurs:
    1. If automaticRestart is set to true, the Automatically restart an instance operation is created as Compute Engine attempts to restart the VM on another host machine.
    2. If automaticRestart is set to false, the VM stays in the TERMINATED state. You can manually restart the VM. Compute Engine provisions the VM on a healthy holdback machine.
  2. To confirm that the VM has moved to a different host, review the physicalHost value for the VM instance. To do this, see Review the physical host information.

To review operations, you can use one of the following options from the Google Cloud console.

VM operations

  1. In the Google Cloud console, go to the Operations page.

    Go to Operations

  2. Locate the VM that you reported.

  3. If the VM is powered down and the host is reported then the Status column shows Done for the VM. This page doesn't track if the VM is restarted on a new host.

  4. To see if the VM restarted, go to the VM instances page.

    Go to VM instances

Cloud Logs

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

    If you use the search bar to find this page, select the result whose subheading is Logging. Your most recent logs are displayed in the Query results pane

  2. In the toolbar, ensure that Show query is enabled.

  3. Copy and paste the following query into the query box:

    resource.type="gce_instance" AND protoPayload.methodName=~"compute\.instances\.(reportHostAsFaulty|terminateOnHostMaintenance|upcomingMaintenance|automaticRestart)"
    
  4. Click Run query. The results of the query are displayed in the Query results pane.

Review the physical host information

You can check the physical host for a VM by running the following command.

gcloud

  1. Install the Google Cloud CLI, then initialize it by running the following command:

    gcloud init

  2. Use the gcloud compute instances describe command to view the physical host that a VM is running on.

    gcloud beta compute instances describe VM_NAME \
      --zone=ZONE \
      --format="yaml(resourceStatus.physicalHost)"
    

    Replace the following:

    • VM_NAME: the name of the VM instance.
    • ZONE: the zone where the VM is located.

REST

Use the instances.get method to view the physical host that a VM is running on.

GET https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME

In the output review the value for the "physicalHost" field.

Replace the following:

  • PROJECT_ID: your project ID.
  • VM_NAME: the name of the VM instance.
  • ZONE: the zone where the VM is located.

What's next?