Rolling out clusters on the edge at scale with Anthos on bare metal

This article introduces an advanced, ready-to-use solution for platform operators and developers that uses Anthos on bare metal and Anthos Config Management to deploy Kubernetes clusters on the edge at scale. For the purposes of this guide, we assume you have familiarity with the following:

  • Ansible playbooks.
  • Edge deployments and their challenges.
  • How to work with a Google Cloud project.
  • gcloud and kubectl command-line interfaces.

You can find the source code for both the application that makes up the edge workload and the scripts that set up the bare metal infrastructure in the point-of-sale GitHub repository. You can use the scripts to replicate this deployment on your own and then customize it for your own requirements.

About edge deployments

One key characteristic of operating at scale is having a centralized management plane for all parts of the platform. But enterprises that expand beyond traditional data centers into edge locations have unique requirements. Edge deployments must run their workloads in isolation, receive timely updates, report critical metrics, and be designed to enable expansion to more edge locations in the future. Anthos on bare metal is Google's answer for these large-scale edge deployment requirements.

Starting from version 1.8 Anthos clusters on bare metal comes with an edge profile that minimizes system resource requirements and is recommended for edge devices with significant resource constraints. The edge profile is available only for standalone clusters. Standalone clusters are self-managing clusters that run workloads. They do not manage other clusters, eliminating the need for running a separate admin cluster in resource-constrained scenarios. The edge profile provides minimal vCPU and RAM resource requirements, excluding any user workload. The Anthos on bare metal edge profile offers:

  • A 75% reduction in CPU requirements to 1 Node x 2 vCPUs from 2 Nodes x 4 vCPUs.
  • A 90% reduction in memory requirements to 1 Node x 4 GiB from 2 Nodes x 32 GiB for Ubuntu.

For more information about configuring a cluster with the edge profile, see Creating standalone clusters. The following table shows how the minimum system requirements for Anthos on bare metal have been reduced to provide a smaller footprint that supports edge devices.

Anthos 1.7 Anthos 1.8 and later Anthos 1.8 and later with the edge profile
CPU 2 Nodes x (4 vCPUs) 1 Node x (4 vCPU) 1 Node x (2 vCPU)
RAM 2 Nodes x (32 GiB RAM) 1 Node x (16 GiB RAM) 1 Node x (4 GiB RAM Ubuntu)
1 Node x (6 GiB RAM RHEL/CentOS)
Storage 2 Nodes x (128 GiB) 1 Node x (128 GiB) 1 Node x (128 GiB)

In the following sections, to emulate nodes deployed on the edge, you use Compute Engine VMs and a sample point of sale application as the edge workload. Anthos clusters on bare metal and Anthos Config Management provide centralized management and control for your edge cluster. Anthos Config Management dynamically pulls new configs from GitHub and applies these policies and configs to your clusters.

Edge rollout architecture diagram

The following diagram shows the architecture of an edge deployment, showing how the application runs on Anthos on bare metal and is managed by Anthos Config Management. In the diagram, ABM refers to Anthos on bare metal and ACM refers to Anthos Config Management.

Shows how the application runs on Anthos on bare metal and is managed by Anthos Config Management
Architecture of an Anthos on bare metal edge deployment managed by Anthos Config Management

Edge workload architecture diagram

The following diagram shows the architecture of the simple point of sale application workload that we use in this guide. It also depicts how the application is deployed in Anthos on bare metal clusters in an emulated edge location. The Compute Engine VMs are analogous to the nodes running in the edge.

Architecture and Deployment of the point of sale Application.
Architecture of the point of sale application (click image to enlarge)

Solution workflow

In this solution, we do the following:

  • Emulate a bare metal infrastructure running in an edge location, using Compute Engine VMs.
  • Deploy an Anthos on bare metal cluster on the emulated edge infrastructure.
  • Connect and register the Anthos on bare metal cluster with Google Cloud.
  • Deploy a sample point of sale application workload on the Anthos on bare metal cluster.
  • Verify and monitor the application operating on the edge through the Cloud Console.
  • Use Anthos Config Management to update the application running on the Anthos on bare metal cluster.

This guide can take up to 55-60 minutes to complete if you have all the prerequisites (listed in the following Before you begin section) already set up.

Before you begin

To complete the edge deployment described in this guide, you need the following:

Set up the workstation environment

  1. Fork the point-of-sale repository to create your own copy of the source code for this edge deployment solution. For more information about forks, including instructions for forking a repository, see Forking a repository.

  2. Create a personal access token for your forked repository as described in the Creating a personal access token GitHub documentation.

    • Select the public_repo scope only.
    • Save the access token you created in a safe place, because you need it later.
  3. Clone your forked repository to your workstation.

    git clone https://github.com/GITHUB_USERNAME/point-of-sale
    cd point-of-sale/anthos-baremetal-edge-deployment
    
    • Replace GITHUB_USERNAME with your GitHub username.
  4. Initialize the environment variables in a new shell instance.

    export PROJECT_ID="PROJECT_ID"
    export REGION="us-central1"
    export ZONE="us-central1-a"
    
    # path to which the Google Service Account key file is downloaded to
    export LOCAL_GSA_FILE="$(pwd)/remote-gsa-key.json"
    
    # port on the admin Compute Engine instance we use to set up an nginx proxy to allow traffic into the Anthos on bare metal cluster
    export PROXY_PORT="8082"
    
    # should be a multiple of 3 since N/3 clusters are created with each having 3 nodes
    export MACHINE_COUNT="3"
    
    # url to the fork of: https://github.com/GoogleCloudPlatform/point-of-sale
    export ROOT_REPO_URL="https://github.com/GITHUB_USERNAME/point-of-sale"
    
    # this is the username used to authenticate to your fork of this repository
    export SCM_TOKEN_USER="GITHUB_USERNAME"
    
    # access token created in the earlier step
    export SCM_TOKEN_TOKEN="ACCESS_TOKEN"
    
    • PROJECT_ID: your Google Cloud project ID.
    • GITHUB_USERNAME: your GitHub username.
    • ACCESS_TOKEN: the personal access token you created for your GitHub repository.
  5. Initialize Cloud SDK.

    gcloud config set project "${PROJECT_ID}"
    gcloud services enable compute.googleapis.com
    
    gcloud config set compute/region "${REGION}"
    gcloud config set compute/zone "${ZONE}"
    
  6. Create the Google Cloud service account that is used by the Compute Engine instances.

    # when prompted "Create a new key for GSA? [y/n]" type "y" and press the return key
    # the service account key file is downloaded to the path referred to by $LOCAL_GSA_FILE
    ./scripts/create-primary-gsa.sh
    

Provision the Compute Engine instances

  1. Create SSH keys and Compute Engine instances where Anthos on bare metal is installed.

    # press the return key when asked for a passphrase for the SSH key (i.e. empty string)
    ./scripts/cloud/easy-install.sh
    
  2. Test SSH connectivity to the Compute Engine instances.

    # If the checks fail the first time with errors like 
    # "sh: connect to host cnuc-1 port 22: Connection refused", 
    # then wait a few seconds and retry
    for i in `seq $MACHINE_COUNT`; do
        HOSTNAME="cnuc-$i"
        ssh abm-admin@${HOSTNAME} 'ping -c 3 google.com'
    done
    

    When the scripts run successfully, they produce output like the following:

    PING google.com (74.125.124.113) 56(84) bytes of data.
    64 bytes from jp-in-f113.1e100.net (74.125.124.113): icmp_seq=1 ttl=115 time=1.10 ms
    64 bytes from jp-in-f113.1e100.net (74.125.124.113): icmp_seq=2 ttl=115 time=1.10 ms
    64 bytes from jp-in-f113.1e100.net (74.125.124.113): icmp_seq=3 ttl=115 time=0.886 ms
    
    --- google.com ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2003ms
    rtt min/avg/max/mdev = 0.886/1.028/1.102/0.100 ms
    PING google.com (108.177.112.139) 56(84) bytes of data.
    ...
    ...
    

Install Anthos on bare metal with Ansible

The script used in this guide creates Anthos on bare metal clusters in groups of three Compute Engine instances. For example, you set the environment variable MACHINE_COUNT to 6 to create two Anthos on bare metal clusters with three instances each. The instances are named with a prefix cnuc- followed by a number. The first instance of each cluster acts as the admin instance from which the Anthos on bare metal installation is triggered. The Anthos on bare metal user clusters are also named after these admin instances (for example, cnuc-1, cnuc-4, cnuc-7).

The Ansible playbook does the following:

  • Configures the Compute Engine instances with the necessary tools, such as docker, bmctl, gcloud, and nomos.
  • Installs Anthos on bare metal in the configured Compute Engine instances.
  • Creates an Anthos on bare metal user cluster called cnuc-1.
  • Registers the cnuc-1 cluster with Google Cloud.
  • Installs Anthos Config Management into the cnuc-1 cluster.
  • Configures Anthos Config Management to sync with the cluster configurations located at anthos-baremetal-edge-deployment/acm-config-sink in your forked repository.

Follow these steps to set up and initiate the installation process.

  1. Generate the Ansible inventory file from template.

    # replace environment variables in the template
    envsubst < templates/inventory-cloud-example.yaml > inventory/gcp.yaml
    
  2. Verify the workstation setup and access to the Compute Engine hosts.

    # verify workstation environment setup
    ./scripts/verify-pre-installation.sh
    
    # verify access to hosts
    ./scripts/health-check.sh
    

    When the scripts run successfully, they produce output like the following:

    Proceed!!
    
    cnuc-1 | SUCCESS => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"},"changed": false,"ping": "pong"}
    cnuc-2 | SUCCESS => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"},"changed": false,"ping": "pong"}
    cnuc-3 | SUCCESS => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"},"changed": false,"ping": "pong"}
    
    SUCCESS!!
    
  3. Run the Ansible playbook for installing Anthos on bare metal on Compute Engine instances.

    ansible-playbook -i inventory cloud-full-install.yml
    

    When the scripts run successfully, they produce output like the following:

    ...
    ...
    PLAY RECAP ********************************************************************************************************
    cnuc-1                     : ok=136  changed=106  unreachable=0    failed=0    skipped=33   rescued=0    ignored=8
    cnuc-2                     : ok=86   changed=67   unreachable=0    failed=0    skipped=71   rescued=0    ignored=2
    cnuc-3                     : ok=86   changed=67   unreachable=0    failed=0    skipped=71   rescued=0    ignored=2
    

Log in to the Anthos on bare metal cluster in the Cloud Console

  1. To copy the utility script into the admin Compute Engine instance and generate a Kubernetes service account token, run the following scripts and commands.

    # Copy the utility script into the admin node of the cluster
    scp -i ~/.ssh/cnucs-cloud scripts/cloud/cnuc-k8s-login-setup.sh abm-admin@cnuc-1:
    
    # Use SSH to connect to the admin node of the cluster
    ssh -i ~/.ssh/cnucs-cloud abm-admin@cnuc-1
    
    # execute the script and copy token that is printed out
    ./cnuc-k8s-login-setup.sh
    

    When the scripts run successfully, they produce output like the following:

    ...
    ...
    💡 Retrieving Kubernetes Service Account Token
    
    🚀 ------------------------------TOKEN-------------------------------- 🚀
    eyJhbGciOiJSUzI1NiIsImtpZCI6Imk2X3duZ3BzckQyWmszb09sZHFMN0FoWU9mV1kzOWNGZzMyb0x2WlMyalkifQ.eyJpc3MiOiJrdW
    mljZS1hY2NvdW50LnVpZCI6IjQwYWQxNDk2LWM2MzEtNDhiNi05YmUxLWY5YzgwODJjYzgzOSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYW
    iZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImVkZ2Etc2EtdG9rZW4tc2R4MmQiLCJrdWJlcm5ldGVzLmlvL3Nl
    cnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiZWRnYS1zYSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2Vyd
    4CwanGlof6s-fbu8IUy1_bTgCminylNKb3VudC5uYW1lIjoiZWRnYS1zYSIsImt1YmVybmV0ZXuaP-hDEKURb5O6IxulTXWH6dxYxg66x
    Njb3VudDpkZWZhdWx0OmVkZ2Etc2EifQ.IXqXwX5pg9RIyNHJZTM6cBKTEWOMfQ4IQQa398f0qwuYlSe12CA1l6P8TInf0S1aood7NJWx
    xe-5ojRvcG8pdOuINq2yHyQ5hM7K7R4h2qRwUznRwuzOp_eXC0z0Yg7VVXCkaqnUR1_NzK7qSu4LJcuLzkCYkFdSnvKIQABHSvfvZMrJP
    Jlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3V
    MgyLOd9FJyhZgjbf-a-3cbDci5YABEzioJlHVnV8GOX_q-MnIagA9-t1KpHA
    🚀 ------------------------------------------------------------------- 🚀
    
  2. Copy the token from the output.

  3. In the Cloud Console, go to the Kubernetes clusters page and use the copied token to log in to the cnuc-1 cluster.

    Go to the Kubernetes clusters page

    1. In the list of clusters, click Actions next to the cnuc-1 cluster, and then click Log in.
    2. Select Token and paste in the copied token.
    3. Click Login.
  4. In the Cloud Console, go to the Config Management page to check the Config spec status. Verify that the status is Synced. A status of Synched indicates that Anthos Config Management has successfully synchronized your GitHub configs with your deployed cluster, cnuc-1.

    Go to the Config Management page

    Anthos Config Management Synced with the source repository.

Configure a proxy for external traffic

Anthos on bare metal installed in the previous steps uses a bundled load balancer called MetalLB. This load balancer service is accessible only through a Virtual Private Cloud (VPC) IP address.. Thus, we set up a reverse proxy service in the admin host (cnuc-1) to route traffic coming in through its external IP to the bundled load balancer. This allows us to reach the API Server of the point of sale application through the external IP of the admin host (cnuc-1).

The installation scripts in the earlier steps would have already installed nginx in the admin hosts along with a sample configuration file. We update this file to use the IP address of the load balancer service and restart nginx.

  1. Setup nginx reverse proxy configuration to route traffic to the API Server Load balancer service.

    # get the IP address of the Load balancer type Kubernetes service
    ABM_INTERNAL_IP=$(kubectl get services api-server-lb -n pos | awk '{print $4}' | tail -n 1)
    
    # update the template configuration file with the fetched IP address
    sudo sh -c "sed 's/<K8_LB_IP>/${ABM_INTERNAL_IP}/g' /etc/nginx/nginx.conf.template > /etc/nginx/nginx.conf"
    
    # restart nginx to ensure the new configuration is picked up
    sudo systemctl restart nginx
    
    # check and verify the status of the nginx server to be "active (running)"
    sudo systemctl status nginx
    

    When the scripts run successfully, they produce output like the following:

    ● nginx.service - A high performance web server and a reverse proxy server
        Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
        Active: active (running) since Fri 2021-09-17 02:41:01 UTC; 2s ago
        Docs: man:nginx(8)
        Process: 92571 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
        Process: 92572 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
    Main PID: 92573 (nginx)
        Tasks: 17 (limit: 72331)
        Memory: 13.2M
        CGroup: /system.slice/nginx.service
                ├─92573 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
                ├─92574 nginx: worker process
                ├─92575 nginx: worker process
                ├─92577 nginx: ....
                ...
                ...
    
  2. Exit from the SSH session into the admin instance.

    exit
    

Access the point of sale application

The following commands are run on your local workstation.

  1. Get the external IP address of the admin Compute Engine instance and access the UI of the point of sale application.

    EXTERNAL_IP=$(gcloud compute instances list --project ${PROJECT_ID} --filter="name:cnuc-1" --format="get(networkInterfaces[0].accessConfigs[0].natIP)")
    echo "Point the browser to: ${EXTERNAL_IP}:${PROXY_PORT}"
    

    When the scripts run successfully, they produce output like the following:

    Point the browser to: 34.134.194.84:8082
    
    Version 1 of the point of sale application deployed.

Update the API Server version and observe the change

  1. Update the image field to change the API Server version from v1 to v2. The YAML configuration for the deployment is in the file at anthos-baremetal-edge-deployment/acm-config-sink/namespaces/pos/api-server.yaml.

    containers:
    - name: api-server
      image: us-docker.pkg.dev/anthos-dpe-abm-edge-pos/abm-edge-pos-images/api-server:v1
  2. Push the changes to your forked repository.

    git add acm-config-sink/namespaces/pos/api-server.yaml
    git commit -m "chore: updated api-server version to v2"
    git push
    
  3. In the Cloud Console, go to the Config Management page to check the Config spec status. Verify that the status is Synced.

    Go to the Config Management page

  4. In the Cloud Console, go to the Kubernetes Engine Workloads page to verify that the Deployment is updated.

    Go to the Kubernetes Engine Workloads page

  5. When the status of the Deployment is OK, point your browser to the IP address from the previous section to view the point of sale application. Note that the version in the title shows "V2", indicating that your application change was deployed.

    Version 2 of the point of sale application deployed.

Clean up

To avoid unnecessary Google Cloud charges, delete the resources used for this guide when you are done with it. You can either delete these resources manually, or delete your Google Cloud project, which also gets rid of all resources. In addition, you might also want to clean up the changes made in your local workstation:

Local workstation

The following files have to be updated to clear changes made by the installation scripts.

  • Remove the Compute Engine VM IP addresses added to the /etc/hosts file.
  • Remove the SSH configuration for cnuc-* in the ~/.ssh/config file.
  • Remove the Compute Engine VM fingerprints from the ~/.ssh/known_hosts file.

Delete Project

If you created a dedicated project for this procedure, delete the Google Cloud project from the Cloud Console.

  • In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  • In the project list, select the project that you want to delete, and then click Delete.
  • In the dialog, type the project ID, and then click Shut down to delete the project.
  • Manual

    If you used an existing project for this procedure, do the following:

    • Unregister all Kubernetes clusters with a name prefixed by cnuc-.
    • Delete all Compute Engine VMs with a name prefixed by cnuc-.
    • Delete the Cloud Storage bucket with a name prefixed by abm-edge-boot.
    • Delete the Firewall Rules allow-pod-ingress and allow-pod-egress.
    • Delete the Secret Manager secret install-pub-key.

    What's next?

    You can expand on this guide by adding another edge location. Setting the MACHINE_COUNT environment variable to 6 and re-running the same steps from the preceding sections creates three new Compute Engine instances (cnuc-4, cnuc-5, cnuc-6) and a new Anthos on bare metal user cluster called cnuc-4.

    You can also try updating the cluster configurations in your forked repository to selectively apply different versions of the point of sale application to the two clusters, cnuc-1 and cnuc-4, using ClusterSelectors.

    For details about the individual steps in this guide, the scripts involved, and the implementation of the point of sale application, see the point-of-sale repository.