Troubleshooting VM shutdowns and reboots

This document describes the common causes of unexpected shutdowns and reboots of virtual machine (VM) instances and how to prevent them.

VM shutdowns and reboots can be caused by system events or admin activities. System event shutdowns and reboots are generated by Google systems or your VM's operating system. Admin activity shutdowns and reboots are generated by a user- or service account-generated API call. All shutdowns and reboots are logged, except for reboots that are initiated from within the VM.

Before you begin

Diagnosing VM shutdowns and reboots

To diagnose the cause of a VM's spontaneous shutdown or reboot, you must query your VM's logs. After you query the logs, review the method and principalEmail fields to determine what event and which user or service initiated the shutdown or reboot.

Querying Cloud Audit Logs

Query Cloud Audit Logs to display a list of system events and admin activities that might have caused the shutdown or reboot.

Console

  1. In the Google Cloud Console, go to the Logs explorer page.

    Go to Logs explorer

  2. In the Query builder field, enter the following query:

    resource.type="gce_instance"
    "VM_NAME"
    logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")
    

    Replace VM_NAME with the name of the VM that shutdown or rebooted.

  3. If the shutdown or reboot happened more than an hour ago, set a custom time frame by clicking the clock symbol and entering a custom range.

    Set query time frame.

  4. Click Run query. The results are displayed in the Query results section.

  5. Click the expander arrow next to each result to show detailed information.

    Each result displays a method field and a principalEmail field, which show the methods and users responsible for shutdowns and reboots. Continue to Reviewing audit logs to learn more about the methods that cause shutdowns and reboots and what you can do to prevent them.

gcloud

  1. View system event and admin activity Cloud Audit Logs using the gcloud logging read command:

    gcloud logging read --freshness=TIME 'resource.type="gce_instance" "VM_NAME" logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")'
    

    Replace the following:

    • TIME: the amount of time you want to query. For example, 1h queries log entries in the past hour. For information about date and time formats, see gcloud topic datetimes.
    • VM_NAME: the name of the VM that shutdown or rebooted.
  2. Review the methodName fields in the system event logs. Continue to Reviewing Cloud Audit Logs to learn more about the methods that cause shutdowns and reboots and what you can do to prevent them.

Reviewing Cloud Audit Logs

Review the method and principalEmail fields of the Cloud Audit Logs to determine why your VM was shut down or rebooted.

  1. Review the method fields of the Cloud Audit Logs and compare them with the methods listed in the following table.

    Method Shutdown type Description
    compute.instances.repair.recreateInstance System event

    If your VM belongs to a managed instance group (MIG), the MIG recreates the VM if the VM's state changes from RUNNING and the MIG did not initiate the change in state.

    Changes of instance state that are not initiated by the MIG include:

    compute.instances.hostError System event

    A host error means that there was a hardware or software issue on the physical machine hosting your VM that caused your VM to crash. If your VM is set to automatically restart, which is the default setting, Google restarts your VM on a different physical machine, typically within three minutes from the time the error was detected. In cases with certain hardware issues, the attempt to restart your VM might get delayed by 5.5 minutes to 16.5 minutes.

    Certain resources behave differently, such as local SSDs. If there is a host error, Compute Engine makes a best effort to reconnect to the VM and preserve the local SSD data, but if the underlying drive does not recover within 60 minutes, the VM restarts without the local SSD data. While Compute Engine is recovering your VM and local SSD, which can take up to 60 minutes, the host system and the underlying drive are unresponsive. For more information about how local SSD disks behave in the event of host errors, see Local SSD data persistence.

    Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:

    Google also offers managed services such as App Engine and the App Engine flexible environment.

    compute.instances.guestTerminate System event Your VM's operating system initiated the shutdown.
    compute.instances.terminateOnHostMaintenance System event

    If you set your VM's onHostMaintenance maintenance policy to TERMINATE, Compute Engine stops your VM when there is a maintenance event where Google must move your VM to another host.

    If you want to change your VM's onHostMaintenance policy, see Updating options for an instance.

    compute.instances.preempted System event

    Compute Engine stopped your preemptible instance. Compute Engine always stops preemptible instances after they run for 24 hours.

    If you require a VM that runs for longer periods of time, see Creating and starting a VM instance.

    compute.instances.stop Admin activity

    A user or service account stopped your VM.

    Continue to the next step to identify the user or service account that stopped your VM. For information about restarting your VM, see Restarting a stopped instance.

    compute.instances.delete Admin activity

    A user or service account deleted your VM.

    Continue to the next step to identify the user or service account that deleted your VM. For information about creating a new VM, see Creating and staring a VM.

    compute.instances.reset Admin activity

    A user or service account reset your VM.

    Continue to the next step to identify the user or service account that stopped your VM.

  2. Review the principalEmail fields of the Cloud Audit Logs to identify the user or service that initiated the shutdown or reboot. The following table include common Google managed services that initiate shutdowns or reboots.

    Email Description
    system@google.com A system event caused the shutdown or reboot.
    project-number@cloudservices.gserviceaccount.com

    A Google-managed service account initiated the shutdown.

    To determine which project the service initiated the shutdown from, review the service account's project-number.

    To determine which Google service made the request, review the protoPayload.requestMetadata.callerSuppliedUserAgent field.

    If a user triggered the shutdown or reboot, their email address appears in the principalEmail field. For example, cloudysanfrancisco@gmail.com.

    Administrators can prevent users from changing the state of project VMs by changing Identity and Access Management permissions on user accounts. For more information, see Granting, changing, and revoking access to resources.