Shut down and power up the device

This page describes how to shut down and powerup Google Distributed Cloud (GDC) air-gapped appliance. For example: to move the device to a new location.

You might use GDC air-gapped appliance in transient operational locations, where it is necessary to shut the device for transport in order to move the device between locations. You might also need to restore the device from a power failure, as generators might power it in rugged environments.

Before you begin

Ensure you stop all workloads before proceeding further. Google cannot guarantee what will happen if workloads are active during a shutdown.

Prerequisites

  1. You can execute this runbook on a laptop or workstation connected to Google Distributed Cloud (GDC) air-gapped appliance's network. Alternatively, you can connect a laptop or workstation to switch by following Connect the device.
  2. Make sure that you have the access to the kubeconfig for the root-admin cluster.
  3. Set the correct KUBECONFIG environment variable by running export KUBECONFIG=PATH_TO_KUBECONFIG.
  4. Ensure that you have the SSH key and certificate.

Shut down the blades

  1. Get info of nodes by running kubectl get nodes -A -o wide.

  2. Pause BareMetalHost sync by running following command for all nodes one by one.Replace NODE_NAME with the node names obtained in Step 1:

    kubectl annotate bmhost -n gpc-system NODE_NAME "baremetalhost.metal3.io/paused=true" --overwrite
    

    The output might look like this example:

    baremetalhost.metal3.io/**-**-bm01 annotated
    baremetalhost.metal3.io/**-**-bm02 annotated
    baremetalhost.metal3.io/**-**-bm03 annotated
    
  3. Cordon all nodes one by one:

    kubectl cordon NODE_NAME
    

    The output might look like this example:

    node/**-**-bm01 cordoned
    node/**-**-bm02 cordoned
    node/**-**-bm03 cordoned
    
  4. To determine etcd leader node and follower nodes, run this step one by one for all nodes:

    1. Find target IPs for SSH by noting values under INTERNAL-IP column of output from kubectl get nodes -A -o wide. Establish an SSH connection:

      ssh root@INTERNAL-IP
      
    2. To determine whether current node is etcd leader or follower, run following command inside the SSH terminal:

      ETCDCTL_API=3 etcdctl \
          --cacert /etc/kubernetes/pki/etcd/ca.crt \
          --cert /etc/kubernetes/pki/etcd/server.crt \
          --key /etc/kubernetes/pki/etcd/server.key \
          --write-out=table endpoint status
      

      Pay attention to the IS LEADER field.

      The output might look like this example for the etcd leader node:

      [root@**-**-bm0* ~]# ETCDCTL_API=3 etcdctl \
      >      --cacert /etc/kubernetes/pki/etcd/ca.crt \
      >      --cert /etc/kubernetes/pki/etcd/server.crt \
      >      --key /etc/kubernetes/pki/etcd/server.key \
      >      --write-out=table endpoint status
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |    ENDPOINT    |        ID        |   VERSION    | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      | ************** | **************** | 3.4.30-gke.1 |  162 MB |      true |      false |      3641 |   12957958 |           12957958 |        |
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      

      The output might look like this example for the two etcd follower nodes:

      [root@**-**-bm0* ~]# ETCDCTL_API=3 etcdctl \
      >      --cacert /etc/kubernetes/pki/etcd/ca.crt \
      >      --cert /etc/kubernetes/pki/etcd/server.crt \
      >      --key /etc/kubernetes/pki/etcd/server.key \
      >      --write-out=table endpoint status
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |    ENDPOINT    |        ID        |   VERSION    | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      | ************** | **************** | 3.4.30-gke.1 |  163 MB |     false |      false |      3641 |   12957404 |           12957404 |        |
      +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
      

      Note down the etcd-leader and etcd-follower status of the nodes.

  5. Drain the two etcd follower nodes. Do not drain the etcd leader node.

    kubectl drain NODE_NAME --delete-emptydir-data --grace-period 900 --ignore-daemonsets --disable-eviction
    

    The output might look like this:

    node/**-**-bm01 already cordoned
    WARNING: ignoring DaemonSet-managed Pods: kube-system/anetd-krj2z, kube-system/etcd-defrag-xh469, kube-system/ipam-controller-manager-2f4dz, kube-system/istio-cni-node-cgqv4, kube-system/kube-proxy-5mwf2, kube-system/localpv-mn2jh, kube-system/metallb-speaker-6l7sv, mon-system/mon-node-exporter-backend-nd8mp, netapp-trident/netapp-trident-node-linux-rrlmd, obs-system/anthos-audit-logs-forwarder-tpfqv, obs-system/anthos-log-forwarder-npjh4, obs-system/kube-control-plane-metrics-proxy-wp8nh, obs-system/log-failure-detector-crbnv, obs-system/oplogs-forwarder-sqwvj, vm-system/macvtap-v9pgp, vm-system/virt-handler-86khx
    pod/grafana-0 deleted
    pod/capi-kubeadm-bootstrap-controller-manager-1.30.400-gke.136lvgtf deleted
    pod/grafana-0 deleted
    pod/grafana-proxy-server-86d8fc4758-mkc4f deleted
    .
    .
    .
    
    node/**-**-bm02 already cordoned
    WARNING: ignoring DaemonSet-managed Pods: kube-system/anetd-v75jz, kube-system/etcd-defrag-t5jnc, kube-system/ipam-controller-manager-5958m, kube-system/istio-cni-node-ggv4c, kube-system/kube-proxy-r6x46, kube-system/localpv-g56xc, kube-system/metallb-speaker-tmw72, mon-system/mon-node-exporter-backend-9rs7k, netapp-trident/netapp-trident-node-linux-9jmfp, obs-system/anthos-audit-logs-forwarder-bwns9, obs-system/anthos-log-forwarder-lbskj, obs-system/kube-control-plane-metrics-proxy-grthl, obs-system/log-failure-detector-dzh4v, obs-system/oplogs-forwarder-vdn7z, vm-system/macvtap-mjwtc, vm-system/virt-handler-dlqvv
    pod/vai-web-plugin-backend-5dfd6d6597-nxxgn
    pod/vai-web-plugin-frontend-6b5468968b-mrr7g
    pod/grafana-proxy-server-64b759fbf6-b8pl8
    pod/iam-bundledidp-backend-0
    .
    .
    .
    
  6. Gracefully shutdown the two etcd follower nodes. Follow next step one by one for both node.

  7. Turn off NODE_NAME using iLO:

    1. Retrieve the username for iLO:

      kubectl get secret bmc-credentials-NODE_NAME -n gpc-system -o jsonpath="{.data.username}" | base64 --decode
      
    2. Retrieve the password for iLO:

      kubectl get secret bmc-credentials-NODE_NAME -n gpc-system -o jsonpath="{.data.password}" | base64 --decode
      
    3. Retrieve the BMC-IP address for NODE_NAME from values in BMC-IP column:

      kubectl get servers -A
      
    4. Visit the BMC-IP address obtained in the previous step and sign in by entering the username and password obtained.

    5. Hover over first button on top row. It should display Power: ON. Click it. A drop-down menu will appear, click first item labelled Momentary Press. The button colour will change from Green to Orange, meaning node is shutting down. Wait for the button to change color to yellow, indicating the machine has powered off. This will take a few minutes.

  8. After both etcd-follower nodes have shutdown, finally repeat Step 7 for the etcd leader node.

Remove Yubikeys for transport

If you need to transport the system after installation completes, remove the Yubikeys and transport the Yubikeys separately. Ensure that you tag the keys yourself.

Power up and connect

If power was lost unexpectedly, such as a hard shutdown, the device automatically comes back up. In this case you should start from Step 7, skipping the Steps 1 to 6. You might experience some data loss that does not persist after an unexpected power loss.

Plan of action

  1. Insert the yubikeys in each node.

  2. Plug the GDC air-gapped appliance machine into power, and press the power button on each node in any order.

  3. After the nodes are powered up, wait for a few minutes for the control plane to connect. kubectl can connect to the control plane in under 30 minutes.

  4. Get the names of nodes by running kubectl get nodes -A.

  5. Uncordon each node to enable scheduling:

    kubectl uncordon `NODE_NAME`
    
  6. Resume sync of the bare metal hosts for each node:

    kubectl annotate bmhost -n gpc-system NODE_NAME "baremetalhost.metal3.io/paused=false" --overwrite
    
  7. Check the status of the nodes using kubectl get nodes -A.

    • If all nodes are in Ready state, wait for two hours for the reconciliation process to complete. The output might look like this:

      NAME         STATUS     ROLES           AGE     VERSION
      **-**-bm01   Ready      control-plane   4d13h   v1.30.6-gke.300
      **-**-bm02   Ready      control-plane   4d13h   v1.30.6-gke.300
      **-**-bm03   Ready      control-plane   4d13h   v1.30.6-gke.300
      

      In this case no further action is needed.

    • Otherwise, if one or more nodes are in 'NotReady' state, restart some services to get the cluster ready. The output might look like this:

      NAME         STATUS     ROLES           AGE     VERSION
      **-**-bm01   Ready      control-plane   4d13h   v1.30.6-gke.300
      **-**-bm02   Ready      control-plane   4d13h   v1.30.6-gke.300
      **-**-bm03   NotReady   control-plane   4d13h   v1.30.6-gke.300
      

      In this case, note down the name of node which is not ready, and proceed to next steps.

  8. Establish an SSH connection into the NotReady node. The target IP addresses of SSH are values under INTERNAL-IP column of the output from kubectl get nodes -A -o wide:

    ssh root@INTERNAL-IP
    
  9. Restart containerd and kubelet services on the NotReady node. Following commands are to be run on nodes, not on customer's laptop or workstation connected to Google Distributed Cloud (GDC) air-gapped appliance:

    systemctl stop containerd
    systemctl daemon-reload
    systemctl restart containerd
    systemctl stop kubelet
    systemctl start kubelet
    
  10. To verify status of containerd and kubelet services, run following commands on the NotReady node:

    systemctl status kubelet
    systemctl status containerd
    

    The output might look like this:

    # systemctl status kubelet kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
            └─00-standalone_containerd.conf, 10-kubeadm.conf
    Active: active (running) since Thu 2025-03-27 07:58:27 UTC; 34s ago
    .
    .
    .
    
    # systemctl status containerd containerd.service - containerd container runtime
    Loaded: loaded (/etc/systemd/system/containerd.service; disabled; vendor preset: disabled)
    Active: active (running) since Thu 2025-03-27 07:58:17 UTC; 52s ago
    .
    .
    .
    

    If the containerd and kubelet services are running fine after restart, then wait for two hours for the reconciliation to complete.