General troubleshooting

This page describes troubleshooting steps that you might find helpful if you run into problems using Compute Engine instances.

Troubleshooting general issues with instances

If your instance doesn't start up

If you receive a quota error when you try to start an instance, you must request additional CPU quota. For more information, see the VM instances section of the Resource quotas document.

Here are some tips to help troubleshoot your persistent boot disk if it doesn't boot.

  • Your boot disk is full. If your boot disk is completely full and your operating system does not support automatic resizing, you won't be able to SSH into your instance. You must create a new instance and recreate the boot disk. See recovering an inaccessible instance or a full boot disk.

  • Examine your virtual machine instance's serial port output.

    An instance's BIOS, bootloader, and kernel will print their debug messages into the instance's serial port output, providing valuable information about any errors or issues that the instance experienced. If you enable serial port output logging to Cloud Logging, you can access this information even when your instance is not running. See viewing serial port output.

  • Enable interactive access to the serial console.

    You can enable interactive access to an instance's serial console so you can log in and debug boot issues from within the instance, without requiring your instance to be fully booted. For more information, read Interacting with the serial console.

  • Verify that your disk has a valid file system.

    If your file system is corrupted or otherwise invalid, you won't be able to launch your instance. Validate your disk's file system:

    1. Detach the disk in question from any instance it is attached to, if applicable:

      gcloud compute instances delete old-instance --keep-disks boot
    2. Start a new instance with the latest Google-provided image:

      gcloud compute instances create debug-instance
    3. Attach your disk as a non-boot disk but don't mount it. Replace DISK with the name of the disk that won't boot. Note that we also provide a device name so that the disk is easily identifiable on the instance:

      gcloud compute instances attach-disk debug-instance --disk DISK --device-name debug-disk
    4. Connect to the instance:

      gcloud compute ssh debug-instance
    5. Look up the root partition of the disk, which is identified with the part1 notation. In this case, the root partition of the disk is at /dev/sdb1:

      user@debug-instance:~$ ls -l /dev/disk/by-id
      total 0
      lrwxrwxrwx 1 root root  9 Jan 22 17:09 google-debug-disk -> ../../sdb
      lrwxrwxrwx 1 root root 10 Jan 22 17:09 google-debug-disk-part1 -> ../../sdb1
      lrwxrwxrwx 1 root root  9 Jan 22 17:02 google-persistent-disk-0 -> ../../sda
      lrwxrwxrwx 1 root root 10 Jan 22 17:02 google-persistent-disk-0-part1 -> ../../sda1
      lrwxrwxrwx 1 root root  9 Jan 22 17:09 scsi-0Google_PersistentDisk_debug-disk -> ../../sdb
      lrwxrwxrwx 1 root root 10 Jan 22 17:09 scsi-0Google_PersistentDisk_debug-disk-part1 -> ../../sdb1
      lrwxrwxrwx 1 root root  9 Jan 22 17:02 scsi-0Google_PersistentDisk_persistent-disk-0 -> ../../sda
      lrwxrwxrwx 1 root root 10 Jan 22 17:02 scsi-0Google_PersistentDisk_persistent-disk-0-part1 -> ../../sda1
    6. Run a file system check on the root partition:

      user@debug-instance:~$ sudo fsck /dev/sdb1
      fsck from util-linux 2.20.1
      e2fsck 1.42.5 (29-Jul-2012)
      /dev/sdb1: clean, 19829/655360 files, 208111/2621184 blocks
    7. Mount your file system:

      user@debug-instance:~$ sudo mkdir /mydisk
      user@debug-instance:~$ sudo mount /dev/sdb1 /mydisk
    8. Check that the disk has kernel files:

      user@debug-instance~:$ ls /mydisk/boot/vmlinuz-*
  • Verify that the disk has a valid master boot record (MBR).

    Run the following command on the debug instance that has attached the persistent boot disk, such as /dev/sdb:

    $ sudo parted /dev/sdb print

    If your MBR is valid, it should list information about the filesystem:

    Disk /dev/sdb: 10.7GB
    Sector size (logical/physical): 512B/4096B
    Partition Table: msdos
    Disk Flags:
    Number  Start   End     Size    Type     File system  Flags
     1      2097kB  10.7GB  10.7GB  primary  ext4         boot

If you cannot create an instance

Here are some tips to help troubleshoot if your instance is not created.

  • Simultaneous resource mutation or creation operations might cause an error. For example, if you are modifying secondary ranges in a subnetwork and creating a VM at the same time, you might see the following error: The resource 'projects/[PROJECT]/regions/[REGION]/subnetworks/default' is not ready. The solution is to retry the failed operation.

  • If you receive a resource error (such as ZONE_RESOURCE_POOL_EXHAUSTED or ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS) when requesting new resources, it means that the zone cannot currently accommodate your request. This error is due to the availability of Compute Engine resources in the zone, and is not due to your Compute Engine quota.

    Here are some tips to help mitigate:

    • Because this situation is temporary and can change frequently based on fluctuating demand, try your request again later.
    • Try to create the resources in another zone in the region or in another region.
    • Try to change the shape of the VM you are requesting. It's easier to get smaller machine types than larger ones. A change to your request, such as reducing the number of GPUs or using a custom VM with less memory or vCPUs, might let your request proceed.
    • Use Compute Engine reservations to reserve resources within a zone to ensure that the resources you need are available when you need them.
    • If you are trying to create a preemptible instance, remember that preemptible VMs are spare capacity and so might not be obtainable at peak demand periods.
    • If you receive a notFound or does not exist in zone error when requesting new resources, it means that the zone does not offer the resource or machine type that you have requested. See Regions and zones to find out which features are available in each zone.

If network traffic to or from your instance is being dropped

Compute Engine only allows network traffic that is explicitly permitted by your project's Firewall rules to reach your instance. By default, all projects automatically come with a default network that allows certain kinds of connections. If you deny all traffic, by default, that also denies SSH connections and all internal traffic. For more information, see the Firewall rules page.

In addition, you may need to adjust TCP keep-alive settings to work around the default idle connection timeout of 10 minutes. For more information, see Communicating between your instances and the internet.

Troubleshooting firewall rules or routes on an instance

The Google Cloud Console provides network details for each network interface of an instance. You can view all of the firewall rules or routes that apply to an interface, or you can view just the rules and routes that the interface uses. Either view can help you troubleshoot which firewall rules and routes apply to the instance and which ones are actually being used (where priority and processing order override other rules or routes).

For more information, see the troubleshooting information in the Virtual Private Cloud documentation:

Troubleshooting issues with SSH

Under certain conditions, a Linux instance might refuse SSH connections. There are many reasons this could happen, from a full disk to an accidental misconfiguration of sshd. This section describes a number of tips and approaches to troubleshoot and resolve common SSH issues.

Check your firewall rules

Compute Engine provisions each project with a default set of firewall rules that permit SSH traffic. If the default firewall rule that permits SSH connections is somehow removed, you'll be unable to access your instance. Check your list of firewalls with the gcloud command-line tool and ensure the default-allow-ssh rule is present.

gcloud compute firewall-rules list

If it is missing, add it back:

gcloud compute firewall-rules create default-allow-ssh --allow tcp:22

Debug the issue in the serial console

You can enable read/write access to an instance's serial console so you can log into the console and troubleshoot problems with the instance. This is particularly useful when you cannot log in with SSH or if the instance has no connection to the network. The serial console remains accessible in both these conditions.

To learn how to enable interactive access and connect to an instance's serial console, read Interacting with the serial console.

Test the network

You can use the netcat tool to connect to your instance on port 22, and determine whether the network connection is working. If you connect and see an ssh banner (for example, SSH-2.0-OpenSSH_6.0p1 Debian-4), your network connection is working, and you can rule out firewall problems. First, use the gcloud tool to get the external natIP for your instance:

gcloud compute instances describe example-instance --format='get(networkInterfaces[0].accessConfigs[0].natIP)'

Use the nc command to connect to your instance:

# Check for SSH banner
user@local:~$ nc [EXTERNAL_IP] 22
SSH-2.0-OpenSSH_6.0p1 Debian-4

Try a new user

The issue that prevents you from logging in might be limited to your account (for example, if the permissions on the ~/.ssh/authorized_keys file on the instance were set incorrectly).

Try using the gcloud tool to log in as a fresh user, specifying another username with the SSH request. The gcloud tool will update the project's metadata to add the new user and allow SSH access.

user@local:~$ gcloud compute ssh [USER]@example-instance

where [USER] is a new username to log in with.

Use your disk on a new instance

If the above set of steps doesn't work for you, and the instance you're interested in is booted from a persistent disk, you can detach the persistent disk and attach this disk to use on the new instance. Replace DISK with your disk name:

gcloud compute instances delete old-instance --keep-disks=boot
gcloud compute instances create new-instance --disk name=DISK boot=yes auto-delete=no
gcloud compute ssh new-instance

Inspect an instance without shutting it down

You might have an instance you can't connect to that continues to correctly serve production traffic. In this case, you might want to inspect the disk without interrupting the instance's ability to serve users. First, take a snapshot of the instance's boot disk, then create a new disk from that snapshot, create a temporary instance, and finally attach and mount the new persistent disk to your temporary instance to troubleshoot the disk.

  1. Create a new VPC network to host your cloned instance:

    gcloud compute networks create debug-network
  2. Add a firewall rule to allow SSH connections to the network:

    gcloud compute firewall-rules create debug-network-allow-ssh --allow tcp:22
  3. Create a snapshot of the disk in question, replacing DISK with the disk name:

    gcloud compute disks snapshot DISK --snapshot-name debug-disk-snapshot
  4. Create a new disk with the snapshot you just created:

    gcloud compute disks create example-disk-debugging --source-snapshot debug-disk-snapshot
  5. Create a new debugging instance without an external IP address:

    gcloud compute instances create debugger --network debug-network --no-address
  6. Attach the debugging disk to the instance:

    gcloud compute instances attach-disk debugger --disk example-disk-debugging
  7. Follow the instructions to connect to an instance without an external IP address.

  8. After you are logged in to the debugger instance, troubleshoot the instance. For example, you can look at the instance logs:

    $ sudo su -
    $ mkdir /mnt/myinstance
    $ mount /dev/disk/by-id/scsi-0Google_PersistentDisk_example-disk-debugging /mnt/myinstance
    $ cd /mnt/myinstance/var/log
    # Identify the issue preventing ssh from working
    $ ls

Use a startup script

If none of the earlier suggestions helped, you can create a startup script to collect information right after the instance starts. Follow the instructions for running a startup script.

Afterward, you need to reset your instance before the metadata takes affect by running gcloud compute instances reset. Alternatively, you can also recreate your instance with a diagnostic startup script:

  1. Run gcloud compute instances delete with the --keep-disks flag.

    gcloud compute instances delete INSTANCE --keep-disks boot
  2. Add a new instance with the same disk and specify your startup script.

    gcloud compute instances create example-instance --disk name=DISK,boot=yes --startup-script-url URL

As a starting point, you can use the compute-ssh-diagnostic script to collect diagnostics information for most common issues.