Troubleshooting VM shutdowns and reboots


This document describes the common causes of unexpected shutdowns and reboots of virtual machine (VM) instances and how to prevent them.

VM shutdowns and reboots can be caused by system events or admin activities. System event shutdowns and reboots are generated by Google systems or your VM's operating system. Admin activity shutdowns and reboots are generated by a user- or service account-generated API call. All shutdowns and reboots are logged, except for reboots that are initiated from within the VM.

Before you begin

  • If you haven't already, then set up authentication. Authentication is the process by which your identity is verified for access to Google Cloud services and APIs. To run code or samples from a local development environment, you can authenticate to Compute Engine by selecting one of the following options:

    Select the tab for how you plan to use the samples on this page:

    Console

    When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

    gcloud

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
    2. Set a default region and zone.

Diagnosing VM shutdowns and reboots

To diagnose the cause of a VM's spontaneous shutdown or reboot, you must query your VM's logs. To quickly identify the cause of future VM shutdowns or reboots, build a dashboard that contains the logs. After you query the logs, review the method and principalEmail fields to determine what event and which user or service initiated the shutdown or reboot.

Querying Cloud Audit Logs

Query Cloud Audit Logs to display a list of system events and admin activities that might have caused the shutdown or reboot.

Console

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

  2. In the Query field, enter the following query:

    resource.type="gce_instance"
    "VM_NAME"
    logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")
    

    Replace VM_NAME with the name of the VM that shut down or rebooted.

  3. If the event you're looking for happened more than an hour ago, set a custom time frame by clicking the clock symbol and entering a custom range.

    Set query time frame.

  4. Click Run query. The results are displayed in the Query results section.

  5. Click the expander arrow next to each result to show detailed information.

  6. See Reviewing Cloud Audit Logs to learn more about the method and principalEmail fields that are associated with shutdowns and reboots, and what you can do to prevent them.

gcloud

  1. View Cloud Audit Logs using the gcloud logging read command:

    gcloud logging read --freshness=TIME 'resource.type="gce_instance" "VM_NAME" logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")'
    

    Replace the following:

    • TIME: the amount of time you want to query. For example, 1h queries log entries in the past hour. For information about date and time formats, see gcloud topic datetimes.
    • VM_NAME: the name of the VM that shutdown or rebooted.

    The results display.

  2. See Reviewing Cloud Audit Logs to learn more about the method and principalEmail fields that are associated with shutdowns and reboots, and what you can do to prevent them.

Reviewing Cloud Audit Logs

Review the method and principalEmail fields of the Cloud Audit Logs to determine why your VM was shut down or rebooted.

  1. Review the method fields of the Cloud Audit Logs and compare them with the methods listed in the following table.

    Method Shutdown type Description
    compute.instances.repair.recreateInstance System event

    If your VM belongs to a managed instance group (MIG), the MIG recreates the VM if the VM's state changes from RUNNING and the MIG did not initiate the change in state.

    Changes of instance state that are not initiated by the MIG include:

    compute.instances.hostError System event

    A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine or the data center infrastructure hosting your compute instance that caused your instance to crash. A host error involving a total hardware failure or other hardware issues might prevent the live migration of your instance. If your instance is set to automatically restart, which is the default setting, Compute Engine restarts your instance, typically within three minutes from the time the error was detected. Depending on the issue, the restart might take up to 5.5 minutes.

    Occasionally, a compute instance might become unresponsive before a host error is signaled. You can reduce the amount of time Compute Engine waits to restart or terminate the instance by setting the host error recovery timeout (Preview). For more information, see Set availability policies.

    Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:

    Google also offers managed services such as App Engine and the App Engine flexible environment.

    compute.instances.automaticRestart System event

    This event occurs after a hostError event or a terminateOnHostMaintenance event if your VM's automaticRestart host maintenance policy is set to true. In the logs, a hostError or a terminateOnHostMaintenance log entry precedes this log.

    If you want to change your VM's host maintenance policy, see Updating options for an instance.

    compute.instances.guestTerminate System event Your VM's operating system initiated the shutdown.
    compute.instances.terminateOnHostMaintenance System event

    If you set your VM's onHostMaintenance host maintenance policy to TERMINATE, Compute Engine stops your VM when there is a maintenance event where Google must move your VM to another host.

    If you want to change your VM's onHostMaintenance policy, see Updating options for an instance.

    compute.instances.preempted System event

    Compute Engine preempted your Spot VM or legacy preemptible VM:

    • When Compute Engine preempts a Spot VM, Compute Engine either stops or deletes the Spot VM based on its termination action. Spot VMs do not have a maximum runtime.
    • When Compute Engine preempts a preemptible VM, Compute Engine stops the VM after a maximum runtime of 24 hours. To avoid these limitations, use Spot VMs instead.

    Spot VMs and preemptible VMs are excess Compute Engine capacity, so Compute Engine might preempt them any time that capacity is needed elsewhere. You can help mitigate the effects of preemption by following the best practices. Alternatively, if you require VMs with user-controlled runtimes, create standard VMs instead.

    compute.instances.stop Admin activity

    A user or service account stopped your VM.

    Continue to the next step to identify the user or service account that stopped your VM. For information about restarting your VM, see Restarting a stopped instance.

    compute.instances.delete Admin activity or system event

    A user or service account deleted your VM, or the VM was configured to be automatically deleted.

    Specifically, a log for the compute.instances.delete method might indicate any of the following requests for your VM:

    • Requests from a user or service account to directly delete your VM are indicated only by a compute.instances.delete method from the user or service account.
    • Requests that automatically delete your VM are indicated by a compute.instances.delete method from system@google.com, but the method that explains the cause of automatic deletion might or might not appear in Cloud Audit Logs.

      For example, if a Spot VM is configured to be automatically deleted during preemption and is preempted, you see a compute.instances.delete method from system@google.com, but you might or might not also see a compute.instances.preempted method.

    • Requests to the VM that happened shortly before or after a compute.instances.delete method might or might not appear in Cloud Audit Logs.

      For example, if a VM is stopped due to host maintenance shortly before the VM is deleted, you see a compute.instances.delete method, but you might or might not also see a compute.instances.terminateOnHostMaintenance method.

    Continue to the next step to identify the user or service account that deleted your VM. For information about creating a new VM, see Creating and starting a VM.

    compute.instances.insert Admin activity

    A user or service account created your VM.

    Continue to the next step to identify the user or service account that created your VM. For information about creating a new VM, see Creating and starting a VM.

    compute.instances.reset Admin activity

    A user or service account reset your VM.

    Continue to the next step to identify the user or service account that stopped your VM.

  2. Review the principalEmail fields of the Cloud Audit Logs to identify the user or service that initiated the shutdown or reboot. The following table include common Google managed services that initiate shutdowns or reboots.

    Email Description
    system@google.com A system event caused the shutdown or reboot.
    project-number@cloudservices.gserviceaccount.com

    A service agent initiated the shutdown.

    To determine which project the service initiated the shutdown from, review the service agent's project-number.

    To determine which Google service made the request, review the protoPayload.requestMetadata.callerSuppliedUserAgent field.

    If a user triggered the shutdown or reboot, their email address appears in the principalEmail field. For example, cloudysanfrancisco@gmail.com.

    Administrators can prevent users from changing the state of project VMs by changing Identity and Access Management permissions on user accounts. For more information, see Granting, changing, and revoking access to resources.

Monitor VM lifecycle events

You can monitor VM lifecycle events (including shutdowns, reboots, and host errors) by building a Cloud Monitoring dashboard.

This dashboard lets you to visualize system events and admin activities that are described in further detail in the Reviewing Audit Logs section of this document.

VM Lifecycle Dashboard: Stop and Start events Figure 1. An example dashboard showing the availability of an instance and its lifecycle events such as a stopped instance.

Create log-based metric

To capture VM lifecycle events, create a user-defined log-based metric. This metric uses Audit Logs to keep count of the number of times a particular VM lifecycle event has occurred.

To get the permissions that you need to create the metric, ask your administrator to grant you the Logs Writer (roles/logging.logWriter) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a user-defined log-based metric by doing the following:

  1. In the Google Cloud console, go to the Log-based Metrics page.

    Go to Log-based Metrics

  2. Click Create Metric.

In the Metric Type section, do the following:

  • Select Counter.
  • Leave Distribution at the default setting of unselected.

In the Details section, enter the following information:

  • Log-based metric name: vm-lifecycle-events. You must use this exact name for the dashboard to work correctly.
  • Description: Optional — Enter a description for this metric.
  • Units: 1
  1. In the Filter selection section, specify the following:

    • From the Select project or log bucket menu, select: Project logs
    • In the Build filter enter:
      resource.type = "gce_instance" AND
      log_id("cloudaudit.googleapis.com/activity") OR
      log_id("cloudaudit.googleapis.com/system_event")
      operation.first="true"
  2. In the Labels section, click Add label.

  3. Specify the following:

    • Label name: method
    • Label type: STRING
    • Field name: protoPayload.methodName
    • Regular expression:
      (recreateInstance|hostError|automaticRestart|guestTerminate|terminateOnHostMaintenance|preempted|insert|stop|delete|reset|start)
  4. Click Done

  5. Click Create metric.

Use the dashboard

No data appears on the dashboard until a VM experiences a system event or an admin activity. To test that the dashboard works, perform an admin activity, such as a stop and start operation:

  1. Perform a stop and start operation on any existing VM, or create a new VM for testing purposes.

To get the permissions that you need to use the dashboard, ask your administrator to grant you the Monitoring Dashboard Viewer (roles/monitoring.dashboardViewer) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

  1. Open Dashboards in the Google Cloud console.

    Go to Dashboards

  2. From the Dashboard List tab open the GCE VM Lifecycle Events Monitoring dashboard.

  3. Select the VM from the Name drop-down menu.

  4. Narrow the time series to a relevant timeframe.

    For more ways to filter the dashboard see Add a temporary filter.

The dashboard contains two charts that display a timeline of system events and admin activities that occur on a VM:

  1. The VM Lifecycle Timeline chart displays the following:

    • The compute.googleapis.com/instance/uptime metric that indicates whether the VM was running at a given point in time, where 1 is up and 0 is down. Note this metric reflects availability as a result of user activity and system events, and is not an indication of Compute Engine SLA.
    • The vm-lifecycle-events log-based metric to count the number of lifecycle actions, such as stop or start that performed were performed against the VM at a given point in time
  2. The Events chart shows the same vm-lifecycle-events log-based metric but in a magnified view for easier readability. Note that although the X-axes are aligned, the colors are not synchronized between the two charts.

Investigating mass VM shutdown across projects

Compute Engine might shut down multiple VMs that are connected to a Shared VPC host project, if the Shared VPC host project's billing is inactive or disabled.

To determine if your VMs have been shut down by a mass shutdown request, look for stop operations initiated by cloud-cluster-manager@prod.google.com.

Starting an affected instance returns an error similar to the following:

Starting instance(s) INSTANCE_NAME...failed.
ERROR: (gcloud.compute.instances.start) The default network interface [nic0] is frozen.

To resolve this issue, do the following:

  1. Identify the Shared VPC used by the VMs, by using the gcloud compute instances describe command:

    gcloud compute instances describe VM_NAME \
       --format="flattened(networkInterfaces[].network)"
    

    The output is similar to the following:

    networkInterfaces[0].network: https://www.googleapis.com/compute/v1/projects/SHARED_VPC_PROJECT/global/networks/FROZEN_NETWORK
    
  2. Verify in the Shared VPC's host project if billing has been disabled.

    resource.type="project"
    protoPayload.request.@type="type.googleapis.com/google.internal.cloudbilling.billingaccount.v1.DisableResourceBillingRequest"
    protoPayload.response.resourceBillingInfo.billingAccountAssignmentType="DISABLED"
    
  3. If applicable, Enable billing on the host project.

To help prevent this issue from recurring, read Secure the link between a project and its billing account.

Investigating VM termination issues with gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

This gcpdiag runbook investigates VM termination issues, examining the following areas:
  • System event-triggered shutdowns and reboot: Identifies terminations initiated by internal Google Cloud systems due to system maintenance events, normal hardware failures, resource constraints.
  • System admin activities-triggered shutdowns/reboots: Investigates terminations caused by direct actions, such as API calls made by users or service accounts. These actions may include manual shutdowns, restarts, or automated processes impacting VM states.
  • Unofficial RCA text generation: Provides a detailed Root Cause Analysis text, outlining the identified cause of termination, the involved systems or activities, and recommendations to prevent future occurrences where applicable.

Google Cloud console

  1. Complete and then copy the following command.
  2. gcpdiag runbook gce/vm-termination \
        --parameter project_id=PROJECT_ID \
        --parameter name=VM_NAME \
        --parameter zone=ZONE
  3. Open the Google Cloud console and activate Cloud Shell.
  4. Open Cloud console
  5. Paste the copied command.
  6. Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

  1. Copy and run the following command on your local workstation.
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. Execute the gcpdiag command.
    ./gcpdiag runbook gce/vm-termination \
        --parameter project_id=PROJECT_ID \
        --parameter name=VM_NAME \
        --parameter zone=ZONE

View available parameters for this runbook.

Replace the following:

  • PROJECT_ID: The ID of the project containing the resource
  • VM_NAME: The name of the target VM within your project.
  • ZONE: The zone in which your target VM is located.

Useful flags:

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.