Virtual machine troubleshooting

This page covers troubleshooting the virtual machine (VM) for the Application Operator (AO) in Google Distributed Cloud (GDC) air-gapped appliance.

Recover a full VM boot disk

If a VM runs out of space on the boot disk, for example, when an application fills the boot disk partition with logs, critical capabilities on the VMs fail to work. You might not have the ability to add a new SSH key through the VirtualMachineAccessRequest resource, or establish an SSH connection into the VM using existing keys.

This page describes the steps to create a new VM and attaching the disk to recover the contents to a new VM as an additional disk. These steps demonstrate the following:

  • A successful SSH connection into the new VM.
  • Increase the amount of space by mounting the disk to recover and delete unnecessary data.
  • Delete the new VM and replace the original disk to the original VM.

Before you begin

Before continuing, ensure you request project-level VM access. Follow steps given to assign the Project VirtualMachine Admin (project-vm-admin) role.

To use gdcloud command-line interface (CLI) commands, ensure that you have downloaded, installed, and configured the gdcloud CLI. All commands for GDC air-gapped appliance use the gdcloud or kubectl CLI, and require an operating system (OS) environment.

Get the kubeconfig file path

To run commands against the admin cluster, ensure you have the following resources:

  1. Locate the admin cluster name, or ask your Platform Administrator (PA) what the cluster name is.

  2. Sign in and generate the kubeconfig file for the admin cluster if you don't have one.

  3. Use the path to replace ADMIN_KUBECONFIG in these instructions.

Recover a VM disk out of space

To recover a VM boot disk out of space, complete the following steps:

  1. Stop the existing VM by following Stop a VM.

  2. Edit the existing VM:

    kubectl --kubeconfig ADMIN_KUBECONFIG edit \
        virtualmachine.virtualmachine.gdc.goog -n PROJECT VM_NAME
    

    Replace the existing VM disk name in the spec field with a new placeholder name:

    ...
    spec:
      disks:
      - boot: true
        virtualMachineDiskRef:
          name: VM_DISK_PLACEHOLDER_NAME
    
  3. Create a new VM with an image operating system (OS) different from the original VM. For example, if the original disk uses the OS ubuntu-2004, create the new VM with rocky-8.

  4. Attach the original disk as an additional disk to the new VM:

    ...
    spec:
      disks:
      - boot: true
        autoDelete: true
        virtualMachineDiskRef:
          name: NEW_VM_DISK_NAME
      - virtualMachineDiskRef:
          name: ORIGINAL_VM_DISK_NAME
    

    Replace the following:

    • NEW_VM_DISK_NAME: the name you give to the new VM disk.
    • ORIGINAL_VM_DISK_NAME: the name of the original VM disk.
  5. After you've created the VM and it is running, establish an SSH connection to the VM by following Connect to a VM.

  6. Create a directory and mount the original disk to a mount point. For example, /mnt/disks/new-disk.

  7. Check through the files and directories in the mount directory using extra space:

    cd /mnt/disks/MOUNT_DIR
    du -hs -- * | sort -rh | head -10
    

    Replace MOUNT_DIR with the name of the directory where you mounted the original disk.

    The output is similar to the following:

    18G   home
    1.4G  usr
    331M  var
    56M   boot
    5.8M  etc
    36K   snap
    24K   tmp
    16K   lost+found
    16K   dev
    8.0K  run
    
  8. Check through each of the files and directories to verify the amount of space each are using. This example checks the home directory as it uses 18G of space.

    cd home
    du -hs -- * | sort -rh | head -10
    

    The output is similar to the following:

    17G   log_file
    ...
    4.0K  readme.md
    4.0K  main.go
    

    The example file log_file is a file to clear as it consumes 17G of space, and is not necessary.

  9. Delete the files you don't need that consume extra space, or back up the files to the new VM boot disk:

    • Move the files you want to keep:

      mv /mnt/disks/MOUNT_DIR/home/FILENAME/home/backup/
      
    • Delete the files consuming extra space:

      rm /mnt/disks/MOUNT_DIR/home/FILENAME
      

      Replace FILENAME with the name of the file you want to move or delete.

  10. Log out of the new VM and Stop the VM.

  11. Edit the new VM to remove the original disk from the spec field:

    kubectl --kubeconfig ADMIN_KUBECONFIG \
        edit virtualmachine.virtualmachine.gdc.goog -n PROJECT NEW_VM_NAME
    

    Remove the virtualMachineDiskRef list that contains the original VM disk name:

    spec:
      disks:
      - autoDelete: true
        boot: true
        virtualMachineDiskRef:
          name: NEW_VM_DISK_NAME
      - virtualMachineDiskRef: # Remove this list
          name: ORIGINAL_VM_DISK_NAME # Remove this disk name
    
  12. Edit the original VM and replace VM_DISK_PLACEHOLDER_NAME you set in step two with the previous name:

    ...
    spec:
      disks:
      - boot: true
        virtualMachineDiskRef:
          name: VM_DISK_PLACEHOLDER_NAME # Replace this name with the previous VM name
    
  13. Start the original VM. If you've cleared enough space, the VM boots successfully.

  14. If you don't need the new VM, delete the VM:

    kubectl --kubeconfig ADMIN_KUBECONFIG \
        delete virtualmachine.virtualmachine.gdc.goog -n PROJECT NEW_VM_NAME
    

Provision a virtual machine

This section describes how to troubleshoot issues that might occur while provisioning a new virtual machine (VM) in Google Distributed Cloud (GDC) air-gapped appliance.

The Application Operator (AO) must run all commands against the default user cluster.

Unable to create disk

If a PersistentVolumeClaim (PVC) is in a Pending state, review the following alternatives to resolve the state:

  • The storage class does not support creating a PVC with the ReadWriteMany access mode:

    1. Update the spec.dataVolumeTemplate.spec.pvc.storageClassName value of the virtual machine with a storage class that supports a ReadWriteMany access mode and uses a Container Storage Interface (CSI) driver as its storage provisioner.

    2. If no other storage classes on the cluster can provide the ReadWriteMany capability, update the spec.dataVolumeTemplate.spec.pvc.accessMode value to include the ReadWriteOnce access mode.

  • The CSI driver is unable to provision a PersistentVolume:

    1. Check for an error message:

      kubectl describe pvc VM_NAME-boot-dv -n NAMESPACE_NAME
      

      Replace the following variables:

      • VM_NAME: The name of the virtual machine.
      • NAMESPACE_NAME: The name of the namespace.
    2. Configure the driver to resolve the error. To ensure that the PersistentVolume provisioning works, create a test PVC in a new spec with a different name than the one specified in the dataVolumeTemplate.spec.pvc:

      cat <<EOF | kubectl apply -
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: test-pvc
        namespace: NAMESPACE_NAME
      spec:
        storageClassName: standard-rwx
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 10Gi
      EOF
      
    3. After the PersistentVolume object provisioning is successful, delete the test PVC after verification:

      kubectl delete pvc test-pvc -n NAMESPACE_NAME
      

Unable to create a virtual machine

If the virtual machine resource is applied but does not get to a Running state, follow these steps:

  1. Review the virtual machine logs:

    kubectl get vm VM_NAME -n NAMESPACE_NAME
    
  2. Check the corresponding Pod status of the virtual machine:

    kubectl get pod -l kubevirt.io/vm=VM_NAME
    

    The output shows a Pod status. The possible options are as follows:

The ContainerCreating state

If the Pod is in the ContainerCreating state, follow these steps:

  1. Get additional details about the Pod's state:

    kubectl get pod -l kubevirt.io/vm=VM_NAME
    
  2. If the volumes are unmounted, ensure all the volumes specified in the spec.volumes field are successfully mounted. If the volume is a disk, check the disk status.

  3. The spec.accessCredentials field specifies a value to mount a SSH public key. Ensure that the secret is created in the same namespace as the virtual machine.

If there are not enough resources on the cluster to create the Pod, follow these steps:

  1. If the cluster does not have enough compute resources to schedule the virtual machine Pod, remove other unwanted Pods to help release resources.

  2. Reduce the spec.domain.resources.requests.cpu and spec.domain.resources.requests.memory values of the virtual machine.

The Error or CrashLoopBackoff state

To resolve Pods in Error or CrashLoopBackoff states, retrieve logs from the virtual machine compute Pod:

kubectl logs -l  kubevirt.io/vm=VM_NAME  -c compute

The Running state and virtual machine failure

If the Pod is in the Running state but the virtual machine itself fails, follow these steps:

  1. View the logs from the virtual machine log Pod:

    kubectl logs -l  kubevirt.io/vm=VM_NAME  -c log
    
  2. If the log shows errors in the virtual machine startup, check the correct boot device of the virtual machine. Set the spec.domain.devices.disks.bootOrder value of the primary boot disk with the value of 1. Use the following example as a reference:

    …
      spec:
          domain:
            devices:
              disks:
              - bootOrder: 1
                disk:
                  bus: virtio
                name: VM_NAME-boot-dv
      …
    

To troubleshoot configuration issues with the virtual machine image, create another virtual machine with a different image.