Backing up and restoring clusters

This page shows how to backup and restore the etcd store for a cluster. This page also provides a script that you can use to automatically back up a cluster's etcd store.

You can create a backup file for recovery from foreseen disasters that might damage your cluster's etcd data. Store the backup file in a location that is outside of the cluster and is not dependent on the cluster's operation.

Limitations

  • This procedure does not back up application-specific data.

  • This procedure does not back up your PersistentVolumes.

  • Workloads scheduled after you create a backup aren't restored with that backup.

  • You cannot restore a cluster after a failed upgrade.

  • This procedure is not intended to restore a deleted cluster.

Backing up a user cluster

A user cluster backup is a snapshot of the user cluster's etcd store. The etcd store contains all of the Kubernetes objects and custom objects required to manage cluster state. The snapshot contains the data required to recreate the cluster's components and workloads.

To create a snapshot of the etcd store, perform the following steps:

  1. Get a shell into the kube-etcd container:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it \
       kube-etcd-0 --container kube-etcd --namespace USER_CLUSTER_NAME \
       -- bin/sh
    

    where:

    • ADMIN_CLUSTER_KUBECONFIG is the admin cluster's kubeconfig file.
    • USER_CLUSTER_NAME is the name of the user cluster.
  2. In your shell, in the root directory, create backup named snapshot.db:

    ETCDCTL_API=3 etcdctl \
       --endpoints=https://127.0.0.1:2379 \
       --cacert=/etcd.local.config/certificates/etcdCA.crt \
       --cert=/etcd.local.config/certificates/etcd.crt \
       --key=/etcd.local.config/certificates/etcd.key \
       snapshot save snapshot.db
    
  3. In your shell, enter exit to exit the shell.

  4. Copy snapshot.db from the kube-etcd container to the current directory:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG cp \
       USER_CLUSTER_NAME/kube-etcd-0:snapshot.db \
       --container kube-etcd snapshot.db
    

Restoring a user cluster from a backup (non-HA)

Before you use a backup file to restore your user cluster's etcd store, diagnose your cluster and resolve existing issues. Using a backup to restore a problematic cluster might re-create or exacerbate issues. Contact the GKE on-prem support team for further assistance with restoring your clusters.

The following instructions explain how to use a backup file to restore a user cluster in cases where the cluster's etcd data has become damaged and the user cluster's etcd Pod is crashlooping.

You can restore the etcd data by deploying a utility Pod that overwrites the damaged data with the backup. The admin cluster's API server must be running and the admin cluster's scheduler must be able to schedule new Pods.

  1. Copy the following Pod manifest to a file named etcd-utility.yaml. Replace these placeholders with values:

    • NODE_NAME: the node where the kube-etcd-0 Pod is running.

    • ADMIN_CLUSTER_KUBECONFIG: the admin cluster's kubeconfig file.

    • USER_CLUSTER_NAME: the name of the user cluster.

    • GKE_ON_PREM_VERSION: the version of the cluster where you want to perform the etcd restore (for example, 1.5.0-gke.0).

    apiVersion: v1
    kind: Pod
    metadata:
      name: etcd-utility-0
      namespace: USER_CLUSTER_NAME
    spec:
      containers:
      - command: ["/bin/sh"]
        args: ["-ec", "while :; do echo '.'; sleep 5 ; done"]
        image: gcr.io/gke-on-prem-release/etcd-util:GKE_ON_PREM_VERSION
        name: etcd-utility
        volumeMounts:
        - mountPath: /var/lib/etcd
          name: data
        - mountPath: /etcd.local.config/certificates
          name: etcd-certs
      nodeSelector:
        kubernetes.googleapis.com/cluster-name: USER_CLUSTER_NAME
        kubernetes.io/hostname: NODE_NAME
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300
      - effect: NoSchedule
        key: node.kubernetes.io/unschedulable
        operator: Exists
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-kube-etcd-0
      - name: etcd-certs
        secret:
          defaultMode: 420
          secretName: kube-etcd-certs
    
  2. Deploy the utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
       create -f etcd-utility.yaml --namespace USER_CLUSTER_NAME
    
  3. Copy snapshot.db from the current directory to the root directory of the utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG cp snapshot.db \
       USER_CLUSTER_NAME/etcd-utility-0:snapshot.db --container etcd-utility
    
  4. Get a shell into the etcd-utility container:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec it \
       etcd-utility-0 --container etcd-utility --namespace USER_CLUSTER_NAME \
       -- bin/sh
    
  5. In your shell, in the root directory, run the following command to create a new folder that contains the backup:

    ETCDCTL_API=3 etcdctl \
       --endpoints=https://127.0.0.1:2379 \
       --cacert=/etcd.local.config/certificates/etcdCA.crt \
       --cert=/etcd.local.config/certificates/etcd.crt \
       --key=/etcd.local.config/certificates/etcd.key \
       snapshot restore snapshot.db
    
  6. In your shell, delete the old etcd data:

    rm -r var/lib/etcd/*
    
  7. In your shell copy the restored etcd data to its permanent location:

    cp -r default.etcd/* var/lib/etcd/
    
  8. In your shell, enter exit to exit the shell.

  9. Delete the crashing etcd Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
       delete pod kube-etcd-0 --namespace USER_CLUSTER_NAME
    
  10. Verify that the etcd Pod is no longer crashing.

  11. Delete the utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG \
       delete pod etcd-utility-0 --namespace USER_CLUSTER_NAME
  12. Remove etcd-utility.yaml from the current directory:

    rm etcd-utility.yaml
    

Restoring a user cluster from a backup (HA)

This section shows how to restore the etcd data for a high-availability (HA) user cluster.

For an HA user cluster, there are three nodes in the admin cluster that serve as control planes for the user cluster. Each of those nodes runs an etcd Pod that maintains etcd data on a storage volume.

If two of the etcd Pods are healthy, and the data on the associated storage volumes is intact, then there is no need to use a backup file. That is because you still have an etcd quorum.

In the rare case that two of the etcd storage volumes have corrupt data, you need to use a backup file to restore the etcd data.

To do the steps in this section, you must have already created a snapshot.db file as described in Backing up a user cluster.

Listing your etcd Pods and nodes

  1. List the etcd Pods that manage the etcd store for your user cluster. These Pods run in the admin cluster:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --namespace USER_CLUSTER_NAME \
        --output wide | grep kube-etcd
    

    The output shows the etcd Pods and the nodes where the Pods run. The nodes shown in the output are nodes in the admin cluster that serve as control planes for your user cluster:

    NAME              ...   NODE
    kube-etcd-0       ...   node-xxx
    kube-etcd-1       ...   node-yyy
    kube-etcd-2       ...   node-zzz
    
  2. Make a note of the Pod names and the control plane node names for later.

    Notice that each etcd Pod is named kube-etcd appended with a number. This number is called the member number for the Pod. It identifies the Pod as being a particular member of the etcd cluster that holds the object data for your user cluster. This guide uses the placeholder MEMBER_NUMBER to refer to the etcd Pod member number.

    Also notice that each Pod in your etcd cluster runs on its own node.

Preparing to deploy the utility Pods

  1. Save a manifest for the PodDisruptionBudget (PDB) for the user cluster's Kubernetes API server. Then delete the PDB.

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG get pdb --namespace USER_CLUSTER_NAME \
       kube-apiserver-pdb --output yaml > kube-apiserver-pdb.yaml
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG delete pdb --namespace USER_CLUSTER_NAME \
       kube-apiserver-pdb
    
  2. Stop the Kubernetes API server and the etcd maintenance Deployment. This ensures that no components will use etcd during restoration:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG --namespace USER_CLUSTER_NAME \
       scale --replicas 0 statefulset kube-apiserver
    
    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG --namespace USER_CLUSTER_NAME \
       scale --replicas 0 deployment gke-master-etcd-maintenance
    
  3. Recall the name of the container image for your etcd Pods.

Deploying the utility Pods

Do the steps in this section for each of your etcd Pods.

  1. Recall the name of the etcd Pod and the name of the node where the Pod runs.

  2. Save the following Pod manifest in the current directory in a file named etcd-utility-MEMBER_NUMBER.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: etcd-utility-MEMBER_NUMBER
      namespace: USER_CLUSTER_NAME
    spec:
      containers:
      - command: ["/bin/sh"]
        args: ["-ec", "while :; do echo '.'; sleep 5 ; done"]
        image: gcr.io/gke-on-prem-release/etcd-util:GKE_ON_PREM_VERSION
        name: etcd-utility
        volumeMounts:
        - mountPath: /var/lib/etcd
          name: data
        - mountPath: /etcd.local.config/certificates
          name: etcd-certs
      nodeSelector:
        kubernetes.googleapis.com/cluster-name: USER_CLUSTER_NAME
        kubernetes.io/hostname: NODE_NAME
      tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 300
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 300
      - effect: NoSchedule
        key: node.kubernetes.io/unschedulable
        operator: Exists
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-kube-etcd-MEMBER_NUMBER
      - name: etcd-certs
        secret:
          defaultMode: 420
          secretName: kube-etcd-certs
    

    The preceding manifest describes a utility Pod that you run temporarily to restore etcd data.

  3. Create the utility Pod in your admin cluster:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f etcd-utility-MEMBER_NUMBER.yaml
    
  4. Copy your backup file, snapshot.db, to the root directory of your utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG cp snapshot.db \
       USER_CLUSTER_NAME/etcd-utility-MEMBER_NUMBER:snapshot.db
    
  5. Get a shell into the etcd-utility container in the utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG exec -it --namespace USER_CLUSTER_NAME \
       etcd-utility-MEMBER_NUMBER --container etcd-utility -- bin/sh
    
  6. In your shell, in the root directory, use snapshot.db to restore the etcd data:

    ETCDCTL_API=3 etcdctl \
        --endpoints=https://127.0.0.1:2379 \
        --cacert=/etcd.local.config/certificates/etcdCA.crt \
        --cert=/etcd.local.config/certificates/etcd.crt \
        --key=/etcd.local.config/certificates/etcd.key \
        --name=kube-etcd-MEMBER_NUMBER \
        --initial-cluster=kube-etcd-0=https://kube-etcd-0.kube-etcd:2380,kube-etcd-1=https://kube-etcd-1.kube-etcd:2380,kube-etcd-2=https://kube-etcd-2.kube-etcd:2380 \
        --initial-cluster-token=etcd-cluster-1 \
        --initial-advertise-peer-urls=https://kube-etcd-MEMBER_NUMBER.kube-etcd:2380 \
        snapshot restore snapshot.db
    

    The preceding command stored etcd data in the /kube-etcd-MEMBER_NUMBER.etcd directory.

  7. In your shell, delete the old etcd data:

    rm -r var/lib/etcd/*
    
  8. In your shell, copy the restored etcd data to its permanent location:

    cp -r kube-etcd-MEMBER_NUMBER.etcd/* var/lib/etcd/
    
  9. In your shell, remove the temporary etcd directory and the backup file:

    rm -R kube-etcd-MEMBER_NUMBER.etcd/
    rm snapshot.db
    
  10. In your shell, enter exit to exit the shell.

  11. Delete the utility Pod:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG delete pod \
        --namespace USER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER
    

Restarting components

Now that you have deployed and deleted your utility Pods, you need to restart some cluster components.

  1. Restart the Pods in the kube-etcd StatefulSet:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG restart rollout statefulset \
        --namespace USER_CLUSTER_NAME kube-etcd
    
  2. Start the Kubernetes API servers for your user cluster:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG scale statefulset --replicas 3 \
       --namespace USER_CLUSTER_NAME kube-apiserver
    
  3. Start the etcd maintenance Deployment for your user cluster:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG scale deployment --replicas 1 \
        --namespace=USER_CLUSTER_NAME  gke-master-etcd-maintenance
    
  4. Restore the PDB for the Kubernetes API server:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONIFG apply -f kube-apiserver-pdb.yaml
    

Backing up an admin cluster

An admin cluster backup contains the following:

  • A snapshot of the admin cluster's etcd.
  • Admin control plane's Secrets, which are required for authenticating to the admin and user clusters.

Complete the following steps before you create an admin cluster backup:

  1. Find the admin cluster's external IP address, which is used to SSH in to the admin cluster control plane:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get nodes -n kube-system -o wide | grep master

    where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.

  2. Create an SSH key called vsphere_tmp from the admin cluster's private key.

    You can find the private key from the admin clusters Secrets:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get secrets sshkeys -n kube-system -o yaml

    In the command output, you can find the private key in the vsphere_tmp field.

    Copy the private key to vsphere_tmp:

    echo "[PRIVATE_KEY]" | base64 -d > vsphere_tmp; chmod 600 vsphere_tmp
  3. Check that you can shell into the admin control plane using this private key:

    ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
    
  4. Exit the container:

    exit

Backing up an admin cluster's etcd store

To back up the admin cluster's etcd store:

  1. Get the etcd Pod's name:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] get pods \
        -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}'
  2. Shell into Pod's kube-etcd container:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG]  exec -it \
        -n kube-system [ADMIN_ETCD_POD] -- bin/sh

    where [ADMIN_ETCD_POD] is the name of the etcd Pod.

  3. From the shell, use etcdctl to a create backup named snapshot.db in the local directory:

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
        --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save snapshot.db
    
  4. Exit the container:

    exit
  5. Copy the backup out of the kube-etcd container using kubectl cp:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \
    kube-system/[ADMIN_ETCD_POD]:snapshot.db [RELATIVE_DIRECTORY]
    

    where [RELATIVE_DIRECTORY] is a path where you want to store your backup.

Backing up an admin cluster's Secrets

To back up the admin control plane's Secrets:

  1. Use SSH to connect to the admin control plane node:

    ssh -i vsphere_tmp ubuntu@EXTERNAL_IP
    

    Replace EXTERNAL_IP with the admin control plane's external IP address, which you noted previously.

  2. Optional but highly recommended: Create a local backup directory.

    You need to change the backup Secrets' permissions to copy them out of the node.

    mkdir backup
  3. Locally copy the Secrets to the local backup directory:

    sudo cp -r /etc/kubernetes/pki/* backup/
  4. Change the permissions of the backup Secrets:

    sudo chmod -R +rw backup/
  5. Exit the admin control plane node:

    exit
  6. Run scp to copy the backup folder out of the admin control plane node:

    sudo scp -r -i vsphere_tmp  ubuntu@EXTERNAL_IP:backup/ RELATIVE_DIRECTORY
    

    Replace RELATIVE_DIRECTORY with a path where you want to store your backup.

Restoring an admin cluster

The following procedure recreates a backed-up admin cluster and all of the user control planes it managed when its etcd snapshot was created.

  1. Run scp to copy snapshot.db to the admin control plane:

    sudo scp -i vsphere_tmp snapshot.db ubuntu@[EXTERNAL_IP]:

    where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

  2. Shell into the admin control plane:

    sudo ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
    
  3. Copy snapshot.db/ to /mnt:

    sudo cp snapshot.db /mnt/
  4. Make temporary directory, like backup:

    mkdir backup
  5. Exit the admin control plane:

    exit
  6. Copy the certificates to backup/:

    sudo scp -r -i vsphere_tmp [BACKUP_CERT_FILE] ubuntu@[EXTERNAL_IP]:backup/
  7. Shell into the admin control plane node:

    ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
    

    where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.

  8. Stop kube-etcd and kube-apiserver.

    sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml
    sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml
  9. Copy the backup Secrets to /etc/kubernetes/pki/:

    sudo cp -r backup/* /etc/kubernetes/pki/
  10. Run etcdctl restore with Docker:

    sudo docker run --rm \
    -v '/mnt:/backup' \
    -v '/var/lib/etcd:/var/lib/etcd' --env ETCDCTL_API=3 'gcr.io/gke-on-prem-release/etcd-util:GKE_ON_PREM_VERSION' /bin/sh -c "etcdctl snapshot restore '/backup/snapshot.db'; rm -r /var/lib/etcd/*; mv /default.etcd/member/ /var/lib/etcd/"
  11. Restart kube-etcd and kube-apiserver.

    sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
    sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
  12. Exit the admin control plane:

    exit
  13. Copy the newly generated kubeconfig file out of the admin node:

    sudo scp -i vsphere_tmp ubuntu@[EXTERNAL_IP]:[HOME]/.kube/config kubeconfig

    where:

    • [EXTERNAL_IP] is the admin control plane's external IP address.
    • [HOME] is the home directory on the admin node.

    Now you can use this new kubeconfig file to access the restored cluster.

Automatic cluster backup

You can use the script given here as an example on how to automatically back up your clusters. Note that the following script is not supported, and should only be used as reference to write a better, more robust and complete script. Before you run the script, fill in values for the five variables at the beginning of the script:

  • Set BACKUP_DIR to the path where you want to store the admin and user cluster backups. This path should not exist.
  • Set ADMIN_CLUSTER_KUBECONFIG to the path of the admin cluster's kubeconfig file
  • Set USER_CLUSTER_NAMESPACE to the name of your user cluster. The name of your user cluster is a namespace in the admin cluster.
  • Set EXTERNAL_IP to the VIP that you reserved for the admin control plane service.
  • Set SSH_PRIVATE_KEY to the path of the SSH key you created when you set up your admin workstation.
  • If you are using a private network, set JUMP_IP to your network's jump server's IP address.
#!/usr/bin/env bash
 
# Automates manual steps for taking backups of user and admin clusters.
# Fill in the variables below before running the script.
 
BACKUP_DIR=""                       # path to store user and admin cluster backups
ADMIN_CLUSTER_KUBECONFIG=""         # path to admin cluster kubeconfig
USER_CLUSTER_NAMESPACE=""           # user cluster namespace
EXTERNAL_IP=""                      # admin control plane node external ip - follow steps in documentation
SSH_PRIVATE_KEY=""                  # path to vsphere_tmp ssh private key - follow steps in documentation
JUMP_IP=""                          # network jump server IP - leave empty string if not using private network.
 
mkdir -p $BACKUP_DIR
mkdir $BACKUP_DIR/pki
 
# USER CLUSTER BACKUP
 
# Snapshot user cluster etcd
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n ${USER_CLUSTER_NAMESPACE} kube-etcd-0 -c kube-etcd -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key snapshot save ${USER_CLUSTER_NAMESPACE}_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp ${USER_CLUSTER_NAMESPACE}/kube-etcd-0:${USER_CLUSTER_NAMESPACE}_snapshot.db $BACKUP_DIR/user-cluster_${USER_CLUSTER_NAMESPACE}_snapshot.db 
 
# ADMIN CLUSTER BACKUP
 
# Set up ssh options
SSH_OPTS=(-oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY})
if [ "${JUMP_IP}" != "" ]; then
    SSH_OPTS+=(-oProxyCommand="ssh -oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY} -W %h:%p ubuntu@${JUMP_IP}")
fi
 
# Copy admin certs
ssh "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP} 'sudo chmod -R +rw /etc/kubernetes/pki/*'
scp -r "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP}:/etc/kubernetes/pki/* ${BACKUP_DIR}/pki/
 
# Snapshot admin cluster etcd
admin_etcd=$(kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} get pods -n kube-system -l component=etcd,tier=control-plane -ojsonpath='{$.items[*].metadata.name}{"\n"}')
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n kube-system ${admin_etcd} -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save admin_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp -n kube-system ${admin_etcd}:admin_snapshot.db $BACKUP_DIR/admin-cluster_snapshot.db

Troubleshooting

For more information, refer to Troubleshooting.

Diagnosing cluster issues using gkectl

Use gkectl diagnosecommands to identify cluster issues and share cluster information with Google. See Diagnosing cluster issues.

Default logging behavior

For gkectl and gkeadm it is sufficient to use the default logging settings:

  • By default, log entries are saved as follows:

    • For gkectl, the default log file is /home/ubuntu/.config/gke-on-prem/logs/gkectl-$(date).log, and the file is symlinked with the logs/gkectl-$(date).log file in the local directory where you run gkectl.
    • For gkeadm, the default log file is logs/gkeadm-$(date).log in the local directory where you run gkeadm.
  • All log entries are saved in the log file, even if they are not printed in the terminal (when --alsologtostderr is false).
  • The -v5 verbosity level (default) covers all the log entries needed by the support team.
  • The log file also contains the command executed and the failure message.

We recommend that you send the log file to the support team when you need help.

Specifying a non-default location for the log file

To specify a non-default location for the gkectl log file, use the --log_file flag. The log file that you specify will not be symlinked with the local directory.

To specify a non-default location for the gkeadm log file, use the --log_file flag.

Locating Cluster API logs in the admin cluster

If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:

  1. Find the name of the Cluster API controllers Pod in the kube-system namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
  2. Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use grep or a similar tool to search for errors:

    kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager

What's next