This page describes how to shut down and powerup Google Distributed Cloud (GDC) air-gapped appliance. For example: to move the device to a new location.
You might use GDC air-gapped appliance in transient operational locations, where it is necessary to shut the device for transport in order to move the device between locations. You might also need to restore the device from a power failure, as generators might power it in rugged environments.
Before you begin
Ensure you stop all workloads before proceeding further. Google cannot guarantee what will happen if workloads are active during a shutdown.
Prerequisites
- You can execute this runbook on a laptop or workstation connected to Google Distributed Cloud (GDC) air-gapped appliance's network. Alternatively, you can connect a laptop or workstation to switch by following Connect the device.
- Make sure that you have the access to the kubeconfig for the root-admin cluster.
- Set the correct KUBECONFIG environment variable by running
export KUBECONFIG=PATH_TO_KUBECONFIG
. - Ensure that you have the SSH key and certificate.
Shut down the blades
Get info of nodes by running
kubectl get nodes -A -o wide
.Pause BareMetalHost sync by running following command for all nodes one by one.Replace
NODE_NAME
with the node names obtained in Step 1:kubectl annotate bmhost -n gpc-system NODE_NAME "baremetalhost.metal3.io/paused=true" --overwrite
The output might look like this example:
baremetalhost.metal3.io/**-**-bm01 annotated baremetalhost.metal3.io/**-**-bm02 annotated baremetalhost.metal3.io/**-**-bm03 annotated
Cordon all nodes one by one:
kubectl cordon NODE_NAME
The output might look like this example:
node/**-**-bm01 cordoned node/**-**-bm02 cordoned node/**-**-bm03 cordoned
To determine etcd leader node and follower nodes, run this step one by one for all nodes:
Find target IPs for SSH by noting values under
INTERNAL-IP
column of output fromkubectl get nodes -A -o wide
. Establish an SSH connection:ssh root@INTERNAL-IP
To determine whether current node is etcd leader or follower, run following command inside the SSH terminal:
ETCDCTL_API=3 etcdctl \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/server.crt \ --key /etc/kubernetes/pki/etcd/server.key \ --write-out=table endpoint status
Pay attention to the
IS LEADER
field.The output might look like this example for the etcd leader node:
[root@**-**-bm0* ~]# ETCDCTL_API=3 etcdctl \ > --cacert /etc/kubernetes/pki/etcd/ca.crt \ > --cert /etc/kubernetes/pki/etcd/server.crt \ > --key /etc/kubernetes/pki/etcd/server.key \ > --write-out=table endpoint status +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ************** | **************** | 3.4.30-gke.1 | 162 MB | true | false | 3641 | 12957958 | 12957958 | | +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
The output might look like this example for the two etcd follower nodes:
[root@**-**-bm0* ~]# ETCDCTL_API=3 etcdctl \ > --cacert /etc/kubernetes/pki/etcd/ca.crt \ > --cert /etc/kubernetes/pki/etcd/server.crt \ > --key /etc/kubernetes/pki/etcd/server.key \ > --write-out=table endpoint status +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ************** | **************** | 3.4.30-gke.1 | 163 MB | false | false | 3641 | 12957404 | 12957404 | | +----------------+------------------+--------------+---------+-----------+------------+-----------+------------+--------------------+--------+
Note down the etcd-leader and etcd-follower status of the nodes.
Drain the two etcd follower nodes. Do not drain the etcd leader node.
kubectl drain NODE_NAME --delete-emptydir-data --grace-period 900 --ignore-daemonsets --disable-eviction
The output might look like this:
node/**-**-bm01 already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/anetd-krj2z, kube-system/etcd-defrag-xh469, kube-system/ipam-controller-manager-2f4dz, kube-system/istio-cni-node-cgqv4, kube-system/kube-proxy-5mwf2, kube-system/localpv-mn2jh, kube-system/metallb-speaker-6l7sv, mon-system/mon-node-exporter-backend-nd8mp, netapp-trident/netapp-trident-node-linux-rrlmd, obs-system/anthos-audit-logs-forwarder-tpfqv, obs-system/anthos-log-forwarder-npjh4, obs-system/kube-control-plane-metrics-proxy-wp8nh, obs-system/log-failure-detector-crbnv, obs-system/oplogs-forwarder-sqwvj, vm-system/macvtap-v9pgp, vm-system/virt-handler-86khx pod/grafana-0 deleted pod/capi-kubeadm-bootstrap-controller-manager-1.30.400-gke.136lvgtf deleted pod/grafana-0 deleted pod/grafana-proxy-server-86d8fc4758-mkc4f deleted . . . node/**-**-bm02 already cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/anetd-v75jz, kube-system/etcd-defrag-t5jnc, kube-system/ipam-controller-manager-5958m, kube-system/istio-cni-node-ggv4c, kube-system/kube-proxy-r6x46, kube-system/localpv-g56xc, kube-system/metallb-speaker-tmw72, mon-system/mon-node-exporter-backend-9rs7k, netapp-trident/netapp-trident-node-linux-9jmfp, obs-system/anthos-audit-logs-forwarder-bwns9, obs-system/anthos-log-forwarder-lbskj, obs-system/kube-control-plane-metrics-proxy-grthl, obs-system/log-failure-detector-dzh4v, obs-system/oplogs-forwarder-vdn7z, vm-system/macvtap-mjwtc, vm-system/virt-handler-dlqvv pod/vai-web-plugin-backend-5dfd6d6597-nxxgn pod/vai-web-plugin-frontend-6b5468968b-mrr7g pod/grafana-proxy-server-64b759fbf6-b8pl8 pod/iam-bundledidp-backend-0 . . .
Gracefully shutdown the two etcd follower nodes. Follow next step one by one for both node.
Turn off
NODE_NAME
using iLO:Retrieve the username for iLO:
kubectl get secret bmc-credentials-NODE_NAME -n gpc-system -o jsonpath="{.data.username}" | base64 --decode
Retrieve the password for iLO:
kubectl get secret bmc-credentials-NODE_NAME -n gpc-system -o jsonpath="{.data.password}" | base64 --decode
Retrieve the
BMC-IP
address forNODE_NAME
from values inBMC-IP
column:kubectl get servers -A
Visit the
BMC-IP
address obtained in the previous step and sign in by entering the username and password obtained.Hover over first button on top row. It should display
Power: ON
. Click it. A drop-down menu will appear, click first item labelledMomentary Press
. The button colour will change from Green to Orange, meaning node is shutting down. Wait for the button to change color to yellow, indicating the machine has powered off. This will take a few minutes.
After both etcd-follower nodes have shutdown, finally repeat Step 7 for the etcd leader node.
Remove Yubikeys for transport
If you need to transport the system after installation completes, remove the Yubikeys and transport the Yubikeys separately. Ensure that you tag the keys yourself.
Power up and connect
If power was lost unexpectedly, such as a hard shutdown, the device automatically comes back up. In this case you should start from Step 7, skipping the Steps 1 to 6. You might experience some data loss that does not persist after an unexpected power loss.
Plan of action
Insert the yubikeys in each node.
Plug the GDC air-gapped appliance machine into power, and press the power button on each node in any order.
After the nodes are powered up, wait for a few minutes for the control plane to connect.
kubectl
can connect to the control plane in under 30 minutes.Get the names of nodes by running
kubectl get nodes -A
.Uncordon each node to enable scheduling:
kubectl uncordon `NODE_NAME`
Resume sync of the bare metal hosts for each node:
kubectl annotate bmhost -n gpc-system NODE_NAME "baremetalhost.metal3.io/paused=false" --overwrite
Check the status of the nodes using
kubectl get nodes -A
.If all nodes are in
Ready
state, wait for two hours for the reconciliation process to complete. The output might look like this:NAME STATUS ROLES AGE VERSION **-**-bm01 Ready control-plane 4d13h v1.30.6-gke.300 **-**-bm02 Ready control-plane 4d13h v1.30.6-gke.300 **-**-bm03 Ready control-plane 4d13h v1.30.6-gke.300
In this case no further action is needed.
Otherwise, if one or more nodes are in 'NotReady' state, restart some services to get the cluster ready. The output might look like this:
NAME STATUS ROLES AGE VERSION **-**-bm01 Ready control-plane 4d13h v1.30.6-gke.300 **-**-bm02 Ready control-plane 4d13h v1.30.6-gke.300 **-**-bm03 NotReady control-plane 4d13h v1.30.6-gke.300
In this case, note down the name of node which is not ready, and proceed to next steps.
Establish an SSH connection into the
NotReady
node. The target IP addresses of SSH are values underINTERNAL-IP
column of the output fromkubectl get nodes -A -o wide
:ssh root@INTERNAL-IP
Restart
containerd
andkubelet
services on theNotReady
node. Following commands are to be run on nodes, not on customer's laptop or workstation connected to Google Distributed Cloud (GDC) air-gapped appliance:systemctl stop containerd systemctl daemon-reload systemctl restart containerd systemctl stop kubelet systemctl start kubelet
To verify status of
containerd
andkubelet
services, run following commands on theNotReady
node:systemctl status kubelet systemctl status containerd
The output might look like this:
# systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─00-standalone_containerd.conf, 10-kubeadm.conf Active: active (running) since Thu 2025-03-27 07:58:27 UTC; 34s ago . . . # systemctl status containerd ● containerd.service - containerd container runtime Loaded: loaded (/etc/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2025-03-27 07:58:17 UTC; 52s ago . . .
If the
containerd
andkubelet
services are running fine after restart, then wait for two hours for the reconciliation to complete.