Version 1.9. This version is no longer supported. For more information see the version support policy.

Known issues

This document describes known issues for version 1.9 of Google Distributed Cloud.

/var/log/audit/ filling up disk space

Identified Versions

1.8.0+, 1.9.0+, 1.10.0+, 1.11.0+, 1.12.0+, 1.13.0+

Symptoms

/var/log/audit/ is filled with audit logs. You can check the disk usage by running sudo du -h -d 1 /var/log/audit.

Cause

Since Anthos v1.8, the Ubuntu image is hardened with CIS Level2 Benchmark. And one of the compliance rules, 4.1.2.2 Ensure audit logs are not automatically deleted, ensures the auditd setting max_log_file_action = keep_logs. This results in all the audit rules kept on the disk.

Workaround

Admin workstation

For the admin workstation, you can manually change the auditd settings to rotate the logs automatically, and then restart the auditd service:

sed -i 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' /etc/audit/auditd.conf
sed -i 's/num_logs = .*/num_logs = 250/g' /etc/audit/auditd.conf
systemctl restart auditd

The above setting would make auditd automatically rotate its logs once it has generated more than 250 files (each with 8M size).

Cluster nodes

For cluster nodes, apply the following DaemonSet to your cluster to prevent potential issues:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: change-auditd-log-action
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: change-auditd-log-action
  template:
    metadata:
      labels:
        app: change-auditd-log-action
    spec:
      hostIPC: true
      hostPID: true
      containers:
      - name: update-audit-rule
        image: ubuntu
        command: ["chroot", "/host", "bash", "-c"]
        args:
        - |
          while true; do
            if $(grep -q "max_log_file_action = keep_logs" /etc/audit/auditd.conf); then
              echo "updating auditd max_log_file_action to rotate with a max of 250 files"
              sed -i 's/max_log_file_action = keep_logs/max_log_file_action = rotate/g' /etc/audit/auditd.conf
              sed -i 's/num_logs = .*/num_logs = 250/g' /etc/audit/auditd.conf
              echo "restarting auditd"
              systemctl restart auditd
            else
              echo "auditd setting is expected, skip update"
            fi
            sleep 600
          done
        volumeMounts:
        - name: host
          mountPath: /host
        securityContext:
          privileged: true
      volumes:
      - name: host
        hostPath:
          path: /

Note that making this auditd config change would violate CIS Level2 rule 4.1.2.2 Ensure audit logs are not automatically deleted.

systemd-timesyncd not running after reboot on Ubuntu Node

Identified Versions

1.7.1-1.7.5, 1.8.0-1.8.4, 1.9.0+

Symptoms

systemctl status systemd-timesyncd should show that the service is dead:

● systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendor preset: enabled)
Active: inactive (dead)

This could cause time out of sync issues.

Cause

chrony was incorrectly installed on Ubuntu OS image, and there's conflict between chrony and systemd-timesyncd, where systemd-timesyncd would become inactive and chrony become active everytime Ubuntu VM got rebooted. However, systemd-timesyncd should be the default ntp client for the VM.

Workaround

Option 1: Manually run restart systemd-timesyncd every time when VM got rebooted.

Option 2: Deploy the following Daemonset so that systemd-timesyncd will always be restarted if it's dead.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ensure-systemd-timesyncd
spec:
  selector:
    matchLabels:
      name: ensure-systemd-timesyncd
  template:
    metadata:
      labels:
        name: ensure-systemd-timesyncd
    spec:
      hostIPC: true
      hostPID: true
      containers:
      - name: ensure-systemd-timesyncd
        # Use your preferred image.
        image: ubuntu
        command:
        - /bin/bash
        - -c
        - |
          while true; do
            echo $(date -u)
            echo "Checking systemd-timesyncd status..."
            chroot /host systemctl status systemd-timesyncd
            if (( $? != 0 )) ; then
              echo "Restarting systemd-timesyncd..."
              chroot /host systemctl start systemd-timesyncd
            else
              echo "systemd-timesyncd is running."
            fi;
            sleep 60
          done
        volumeMounts:
        - name: host
          mountPath: /host
        resources:
          requests:
            memory: "10Mi"
            cpu: "10m"
        securityContext:
          privileged: true
      volumes:
      - name: host
        hostPath:
          path: /

ClientConfig custom resource

gkectl update reverts any manual changes that you have made to the ClientConfig custom resource. We strongly recommend that you back up the ClientConfig resource after every manual change.

gkectl check-config validation fails: can't find F5 BIG-IP partitions

Symptoms: Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
Potential causes: An issue with the F5 BIG-IP API can cause validation to fail.
Resolution: Try running gkectl check-config again.

Disruption for workloads with PodDisruptionBudgets

Upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).

Nodes fail to complete their upgrade process

If you have PodDisruptionBudget objects configured that are unable to allow any additional disruptions, node upgrades might fail to upgrade to the control plane version after repeated attempts. To prevent this failure, we recommend that you scale up the Deployment or HorizontalPodAutoscaler to allow the node to drain while still respecting the PodDisruptionBudget configuration.

To see all PodDisruptionBudget objects that do not allow any disruptions:

kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'

User cluster installation failed because of cert-manager/ca-injector's leader election issue in Anthos 1.9.0

You might see an installation failure due to cert-manager-cainjector in crashloop, when the apiserver/etcd is slow. The following command,

  kubectl logs --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system deployments/cert-manager-cainjector

might produce something like the following logs:

I0923 16:19:27.911174       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition
E0923 16:19:27.911110       1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded
I0923 16:19:27.911593       1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition
E0923 16:19:27.911629       1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"

Run the following commands to mitigate the problem.

First, scale down the monitoring-operator so it will not revert the changes to the cert-manager-cainjector Deployment.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0

Second, patch the cert-manager-cainjector Deployment to disable leader election, which is safe because we only have one replica running. It is not required for a single replica.

# Ensure that we run only 1 cainjector replica, even during rolling updates.
kubectl patch --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system deployment cert-manager-cainjector --type=strategic --patch '
spec:
  strategy:
    rollingUpdate:
      maxSurge: 0
'
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl patch --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system deployment cert-manager-cainjector --type=json --patch '[
    {
        "op": "add",
        "path": "/spec/template/spec/containers/0/args/-",
        "value": "--leader-elect=false"
    }
]'

Keep monitoring-operator replicas at 0 as a mitigation until the installation is finished. Otherwise it will revert the change.

After the installation is finished and the cluster is up and running, turn on the monitoring-operator for day-2 operations:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=1

After upgrading to 1.9.1 or above, these steps will no longer be necessary since Anthos will disable leader-election for cainjector.

Renewal of certificates might be required before an admin cluster upgrade

Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.

Admin cluster certificate renewal process

Make sure that OpenSSL is installed on the admin workstation before you begin.
Set the KUBECONFIG variable:
```
KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
```
Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.

Get the IP address and SSH keys for the admin master node:

kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \
-o jsonpath='{.data.vsphere_tmp}' | base64 -d > \
~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key

export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \
jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \
--selector='node-role.kubernetes.io/master')

Check if the certificates are expired:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \
"sudo kubeadm alpha certs check-expiration"
```
If the certificates are expired, you must renew them before upgrading the admin cluster.
Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
- Back up the admin cluster kubeconfig file:
```
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 

"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf
vi "${KUBECONFIG}"
```
- Replace client-certificate-data and client-key-data in kubeconfig with client-certificate-data and client-key-data in the new_admin.conf file that you created.

Back up old certificates:

This is an optional, but recommended, step.

# ssh into admin master if you didn't in the previous step
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"

# on admin master
sudo tar -czvf backup.tar.gz /etc/kubernetes
logout

# on worker node
sudo scp -i ~/.ssh/admin-cluster.key \
ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .

Renew the certificates with kubeadm:

 # ssh into admin master
 ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
 # on admin master
 sudo kubeadm alpha certs renew all

Restart static Pods running on the admin master node:

  # on admin master
  cd /etc/kubernetes
  sudo mkdir tempdir
  sudo mv manifests/*.yaml tempdir/
  sleep 5
  echo "remove pods"
  # ensure kubelet detect those change remove those pods
  # wait until the result of this command is empty
  sudo docker ps | grep kube-apiserver

  # ensure kubelet start those pods again
  echo "start pods again"
  sudo mv tempdir/*.yaml manifests/
  sleep 30
  # ensure kubelet start those pods again
  # should show some results
  sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd

  # clean up
  sudo rm -rf tempdir

  logout

You must validate the renewed certificates, and validate the certificate of kube-apiserver.

Check certificates expiration:

ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" 

"sudo kubeadm alpha certs check-expiration"

Check certificate of kube-apiserver:

# Get the IP address of kube-apiserver
cat $KUBECONFIG | grep server
# Get the current kube-apiserver certificate
openssl s_client -showcerts -connect : 

| sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'  

> current-kube-apiserver.crt
# check expiration date of this cert
openssl x509 -in current-kube-apiserver.crt -noout -enddate

Restarting or upgrading vCenter for versions lower than 7.0U2

If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise, the network name in VM Information from vCenter is incorrect, and results in the machine being in an Unavailable state. This eventually leads to the nodes being auto-repaired to create new ones.

Related govmomi bug: https://github.com/vmware/govmomi/issues/2552

This workaround is provided by VMware support:

1. The issue is fixed in vCenter versions 7.0U2 and above.

2. For lower versions:
Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the
VM's portgroup.

SSH connection closed by remote host

For Google Distributed Cloud version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark. To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config has the following settings:

ClientAliveInterval 300
ClientAliveCountMax 0

The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0 value causes unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:

Connection to [IP] closed by remote host.
Connection to [IP] closed.

As a workaround, you can either:

Use nohup to prevent your command being terminated on SSH disconnection,

nohup gkectl upgrade admin --config admin-cluster.yaml --kubeconfig kubeconfig

Update the sshd_config to use a non-zero ClientAliveCountMax value. The CIS rule recommends to use a value less than 3.
```
sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
Make sure you reconnect your ssh session.

Conflict with `cert-manager` when upgrading to version 1.9.0 or 1.9.1

If you have your own cert-manager installation with Google Distributed Cloud, you might experience a failure when you attempt to upgrade to versions 1.9.0 or 1.9.1. This is a result of a conflict between your version of cert-manager, which is likely installed in the cert-manager namespace, and the monitoring-operator version.

If you try to install another copy of cert-manager after upgrading to Google Distributed Cloud version 1.9.0 or 1.9.1, the installation might fail due to a conflict with the existing one managed by monitoring-operator.

The metrics-ca cluster issuer, which control-plane and observability components rely on for creation and rotation of cert secrets, requires a metrics-ca cert secret to be stored in the cluster resource namespace. This namespace is kube-system for the monitoring-operator installation, and likely to be cert-manager for your installation.

If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.9.0 and 1.9.1:

Avoid conflicts during upgrade

Uninstall your version of cert-manager. If you defined your own resources, you may want to backup them.
Perform the upgrade.
Follow the following instructions to restore your own cert-manager.

Restore your own cert-manager in user clusters

Scale the monitoring-operator deployment to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0

Scale the cert-manager deployments managed by monitoring-operator to 0.

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0

Reinstall cert-manager.
Restore your customized resources if you have them.

Copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from kube-system to the cluster resource namespace of your installed cert-manager. Your installed cert-manager namespace is cert-manager if using the upstream default cert-manager installation, but that depends on your installation.

relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f1=$(mktemp)
f2=$(mktemp)
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get issuer -n kube-system metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f1
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get certificate -n kube-system metrics-ca -o json | jq "${relevant_fields}" > $f2
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2

Restore your own cert-manager in admin clusters

In general, you shouldn't need to re-install cert-manager in admin clusters because admin clusters only run Google Distributed Cloud control plane workloads. In the rare cases that you also need to install your own cert-manager in admin clusters, please follow the following instructions to avoid conflicts. Please note, if you are an Apigee customer and you only need cert-manager for Apigee, you do not need to run the admin cluster commands.

Scale the monitoring-operator deployment to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0

Scale the cert-manager deployments managed by monitoring-operator to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0

Reinstall the customer's cert-manager. Restore your customized resources if you have.

relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f3=$(mktemp)
f4=$(mktemp)
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get issuer -n kube-system metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f3
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get certificate -n kube-system metrics-ca -o json | jq "${relevant_fields}" > $f4
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4

Conflict with `cert-manager` when upgrading to version 1.9.2 or above

In 1.9.2 or above releases, monitoring-operator will install cert-manager in the cert-manager namespace. If for certain reasons, you need to install your own cert-manager, please follow the following instructions to avoid conflicts:

Avoid conflicts during upgrade

Uninstall your version of cert-manager. If you defined your own resources, you may want to backup them.
Perform the upgrade.
Follow the following instructions to restore your own cert-manager.

Restore your own cert-manager in user clusters

Scale the monitoring-operator deployment to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0

Scale the cert-manager deployments managed by monitoring-operator to 0.

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager --replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-cainjector --replicas=0
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-webhook --replicas=0

Reinstall the customer's cert-manager. Restore your customized resources if you have.

You can skip this step if you are using upstream default cert-manager installation, or you are sure your cert-manager is installed in the cert-manager namespace. Otherwise, copy the metrics-ca cert-manager.io/v1 Certificate and the metrics-pki.cluster.local Issuer resources from cert-manager to the cluster resource namespace of your installed cert-manager.

relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f1=$(mktemp)
f2=$(mktemp)
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get issuer -n cert-manager metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f1
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get certificate -n cert-manager metrics-ca -o json | jq "${relevant_fields}" > $f2
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f1
kubectl apply --kubeconfig USER_CLUSTER_KUBECONFIG -f $f2

Restore your own cert-manager in admin clusters

Scale the monitoring-operator deployment to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0

Scale the cert-manager deployments managed by monitoring-operator to 0.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager --replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-cainjector --replicas=0
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-webhook --replicas=0

Reinstall the customer's cert-manager. Restore your customized resources if you have.

relevant_fields='
{
apiVersion: .apiVersion,
kind: .kind,
metadata: {
name: .metadata.name,
namespace: "YOUR_INSTALLED_CERT_MANAGER_NAMESPACE"
},
spec: .spec
}
'
f3=$(mktemp)
f4=$(mktemp)
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get issuer -n cert-manager metrics-pki.cluster.local -o json | jq "${relevant_fields}" > $f3
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get certificate -n cert-manager metrics-ca -o json | jq "${relevant_fields}" > $f4
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f3
kubectl apply --kubeconfig ADMIN_CLUSTER_KUBECONFIG -f $f4

False positives in docker, containerd, and runc vulnerability scanning

The docker, containerd, and runc in the Ubuntu OS images shipped with Google Distributed Cloud are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Google Distributed Cloud before each release.

However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc vulnerability scanning results.

For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Google Distributed Cloud.

Refer to the release notes for any CVE fixes.

Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.

Unhealthy konnectivity server Pods when using the Seesaw or manual mode load balancer

If you are using Seesaw or the manual mode load balancer, you might notice the konnectivity server Pods are unhealthy. This happens because Seesaw does not support reusing an IP address across a service. For manual mode, creating a load balancer service does not automatically provision the service on your load balancer.

SSH tunneling is enabled in version 1.9 clusters. Thus, even if the konnectivity server is not healthy, you can still use the SSH tunnel, so that the connectivity to and within the cluster should not be affected. Therefore, you do not need to be concerned about these unhealthy Pods.

If you plan to upgrade from version 1.9.0 to 1.9.x, it is recommended you delete the unhealthy konnectivity server deployments before upgrading. Run this command.

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME delete Deployment konnectivity-server

`/etc/cron.daily/aide` CPU and memory spike issue

Starting from Google Distributed Cloud version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.

As a result, the cron script /etc/cron.daily/aide has been installed so that an aide check is scheduled so as to ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.

The cron job runs daily at 6:25 AM UTC. Depending on the number of files on the filesystem, you may experience CPU and memory usage spikes around that time that are caused by this aide process.

If the spikes are affecting your workload, you can disable the daily cron job:

`sudo chmod -x /etc/cron.daily/aide`.

Load balancers and NSX-T stateful distributed firewall rules interact unpredictably

When deploying Google Distributed Cloud version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator might fail to create gke-metrics-agent-conf ConfigMap and cause gke-connect-agent Pods to be in a crash loop.

The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Google Distributed Cloud releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K. Follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.

If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.

Failure to register admin cluster during creation

If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect spec during its creation, you will get the following error.

  Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: code = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH

You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.

  failed to migrate to first admin trust chain: failed to parse current version "":
  invalid version: "" failed to migrate to first admin trust chain: failed to parse 
  current version "": invalid version: ""

If this error occurs, follow these steps to fix the cluster registration issue. After you do this fix, you can then upgrade your admin cluster.

Provide govc, the command line interface to vSphere, some variables declaring elements of your vCenter Server and vSphere environment.

Note: You can find many of these values in the admin cluster configuration file or by logging in to the vCenter Server
```
export GOVC_URL=https://VCENTER_SERVER_ADDRESS
export GOVC_USERNAME=VCENTER_SERVER_USERNAME
export GOVC_PASSWORD=VCENTER_SERVER_PASSWORD
export GOVC_DATASTORE=VSPHERE_DATASTORE
export GOVC_DATACENTER=VSPHERE_DATACENTER
export GOVC_INSECURE=true
# DATA_DISK_NAME should not include the suffix ".vmdk"
export DATA_DISK_NAME=DATA_DISK_NAME
```
Replace the following:
- VCENTER_SERVER_ADDRESS is your vCenter Server's IP address or hostname.
- VCENTER_SERVER_USERNAME is the username of an account that holds the Administrator role or equivalent privileges in vCenter Server.
- VCENTER_SERVER_PASSWORD is the vCenter Server account's password.
- VSPHERE_DATASTORE is the name of the datastore you've configured in your vSphere environment.
- VSPHERE_DATACENTER is the name of the datacenter you've configured in your vSphere environment.
- DATA_DISK_NAME is the name of the data disk.

Download the DATA_DISK_NAME‑checkpoint.yaml file.

govc datastore.download ${DATA_DISK_NAME}-checkpoint.yaml temp-checkpoint.yaml

Edit the checkpoint fields.

# Find out the gkeOnPremVersion
export KUBECONFIG=ADMIN_CLUSTER_KUBECONFIG
ADMIN_CLUSTER_NAME=$(kubectl get onpremadmincluster -n kube-system --no-headers | awk '{ print $1 }')
GKE_ON_PREM_VERSION=$(kubectl get onpremadmincluster -n kube-system $ADMIN_CLUSTER_NAME -o=jsonpath='{.spec.gkeOnPremVersion}')

# Replace the gkeOnPremVersion in temp-checkpoint.yaml
sed -i "s/gkeonpremversion: \"\"/gkeonpremversion: \"$GKE_ON_PREM_VERSION\"/" temp-checkpoint.yaml

#The steps below are only needed for upgrading from 1.9x to 1.10x clusters.

# Find out the provider ID of the admin control-plane VM
ADMIN_CONTROL_PLANE_MACHINE_NAME=$(kubectl get machines --no-headers | grep master)
ADMIN_CONTROL_PLANE_PROVIDER_ID=$(kubectl get machines $ADMIN_CONTROL_PLANE_MACHINE_NAME -o=jsonpath='{.spec.providerID}' | sed 's/\//\\\//g')

# Fill in the providerID field in temp-checkpoint.yaml
sed -i "s/providerid: null/providerid: \"$ADMIN_CONTROL_PLANE_PROVIDER_ID\"/" temp-checkpoint.yaml

Replace ADMIN_CLUSTER_KUBECONFIG with the path of your admin cluster kubeconfig file.

Generate a new checksum.
- Change the last line of the checkpoint file to
```
checksum:$NEW_CHECKSUM
```
  Replace NEW_CHECKSUM with the output of the following command:
```
sha256sum temp-checkpoint.yaml
```

Upload the new checkpoint file.

govc datastore.upload temp-checkpoint.yaml ${DATA_DISK_NAME}-checkpoint.yaml

Using Anthos Identity Service can cause the Connect Agent to restart unpredictably

If you are using the Anthos Identity Service feature to manage Anthos Identity Service ClientConfig, the Connect Agent might restart unexpectedly.

If you have experienced this issue with an existing cluster, you can do one of the following:

Disable Anthos Identity Service (AIS). If you disable AIS, that will not remove the deployed AIS binary or remove AIS ClientConfig. To disable AIS, run this command:
```
gcloud beta container hub identity-service disable --project PROJECT_NAME
```
Replace PROJECT_NAME with the name of the cluster's fleet host project.
Update the cluster to version 1.9.3, or 1.10.1 or later, so as to upgrade the Connect Agent version.

High network traffic to monitoring.googleapis.com

You might see high network traffic to monitoring.googleapis.com, even in a new cluster that has no user workloads.

This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.

To fix this issue, upgrade to version 1.10.2/1.9.5 or later.

To mitigate this issue for an earlier version:

Scale down stackdriver-operator:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \
   scale deployment stackdriver-operator --replicas=0

Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.

Open the gke-metrics-agent-conf ConfigMap for editing:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \
   edit configmap gke-metrics-agent-conf

Increase the probe interval from 0.1 seconds to 13 seconds:

processors:
  disk_buffer/metrics:
    backend_endpoint: https://monitoring.googleapis.com:443
    buffer_dir: /metrics-data/nsq-metrics-metrics
    probe_interval: 13s
    retention_size_mib: 6144
 disk_buffer/self:
    backend_endpoint: https://monitoring.googleapis.com:443
    buffer_dir: /metrics-data/nsq-metrics-self
    probe_interval: 13s
    retention_size_mib: 200
  disk_buffer/uptime:
    backend_endpoint: https://monitoring.googleapis.com:443
    buffer_dir: /metrics-data/nsq-metrics-uptime
    probe_interval: 13s
    retention_size_mib: 200

Close the editing session.

Change gke-metrics-agent DaemonSet version to 1.1.0-anthos.8:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \
   edit daemonset gke-metrics-agent

image: gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8 # use 1.1.0-anthos.8
imagePullPolicy: IfNotPresent
name: gke-metrics-agent

Missing metrics on some nodes

You might find that the following metrics are missing on some, but not all, nodes:

kubernetes.io/anthos/container_memory_working_set_bytes
kubernetes.io/anthos/container_cpu_usage_seconds_total
kubernetes.io/anthos/container_network_receive_bytes_total

To fix this issue:

[version 1.9.5+]: increase cpu for gke-metrics-agent by following steps 1 - 4
[version 1.9.0-1.9.4]: follow steps 1 - 9

Open your stackdriver resource for editing:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system edit stackdriver stackdriver

To increase the CPU request for gke-metrics-agent from 10m to 50m, add the following resourceAttrOverride section to the stackdriver manifest :

spec:
  resourceAttrOverride:
    gke-metrics-agent/gke-metrics-agent:
      limits:
        cpu: 100m
        memory: 4608Mi
      requests:
        cpu: 50m
        memory: 200Mi

Your edited resource should look similar to the following:

spec:
  anthosDistribution: on-prem
  clusterLocation: us-west1-a
  clusterName: my-cluster
  enableStackdriverForApplications: true
  gcpServiceAccountSecretName: ...
  optimizedMetrics: true
  portable: true
  projectID: my-project-191923
  proxyConfigSecretName: ...
  resourceAttrOverride:
    gke-metrics-agent/gke-metrics-agent:
      limits:
        cpu: 100m
        memory: 4608Mi
      requests:
        cpu: 50m
        memory: 200Mi

Save your changes and close the text editor.

To verify your changes have taken effect, run the following command:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system get daemonset gke-metrics-agent -o yaml | grep "cpu: 50m"

The command finds cpu: 50m if your edits have taken effect.

To prevent your following changes from reverting, scale down stackdriver-operator:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system scale deploy stackdriver-operator --replicas=0

Open gke-metrics-agent-conf for editing:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system edit configmap gke-metrics-agent-conf

Edit the configuration to change all instances of probe_interval: 0.1s to probe_interval: 13s:

 183     processors:
 184       disk_buffer/metrics:
 185         backend_endpoint: https://monitoring.googleapis.com:443
 186         buffer_dir: /metrics-data/nsq-metrics-metrics
 187         probe_interval: 13s
 188         retention_size_mib: 6144
 189       disk_buffer/self:
 190         backend_endpoint: https://monitoring.googleapis.com:443
 191         buffer_dir: /metrics-data/nsq-metrics-self
 192         probe_interval: 13s
 193         retention_size_mib: 200
 194       disk_buffer/uptime:
 195         backend_endpoint: https://monitoring.googleapis.com:443
 196         buffer_dir: /metrics-data/nsq-metrics-uptime
 197         probe_interval: 13s
 198         retention_size_mib: 200

Save your changes and close the text editor.

Change gke-metrics-agent DaemonSet version to 1.1.0-anthos.8:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \
   edit daemonset gke-metrics-agent

image: gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8 # use 1.1.0-anthos.8
imagePullPolicy: IfNotPresent
name: gke-metrics-agent

Cisco ACI doesn't work with Direct Server Return (DSR)

Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning. A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).

You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.

gkectl diagnose checking certificates failure

If your work station does not have access to user cluster worker nodes, it will get the following failures when running gkectl diagnose, it is safe to ignore them.

Checking user cluster certificates...FAILURE
    Reason: 3 user cluster certificates error(s).
    Unhealthy Resources:
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out
    Node kubelet CA and certificate on node xxx: failed to verify kubelet certificate on node xxx: dial tcp xxx.xxx.xxx.xxx:10250: connect: connection timed out

Known issues

/var/log/audit/ filling up disk space

Category

Identified Versions

Symptoms

Cause

Workaround

Admin workstation

Cluster nodes

systemd-timesyncd not running after reboot on Ubuntu Node

Category

Identified Versions

Symptoms

Cause

Workaround

ClientConfig custom resource

gkectl check-config validation fails: can't find F5 BIG-IP partitions

Disruption for workloads with PodDisruptionBudgets

Nodes fail to complete their upgrade process

User cluster installation failed because of cert-manager/ca-injector's leader election issue in Anthos 1.9.0

Renewal of certificates might be required before an admin cluster upgrade

Admin cluster certificate renewal process

Restarting or upgrading vCenter for versions lower than 7.0U2

SSH connection closed by remote host

Conflict with cert-manager when upgrading to version 1.9.0 or 1.9.1

Avoid conflicts during upgrade

Restore your own cert-manager in user clusters

Restore your own cert-manager in admin clusters

Conflict with cert-manager when upgrading to version 1.9.2 or above

Avoid conflicts during upgrade

Restore your own cert-manager in user clusters

Restore your own cert-manager in admin clusters

False positives in docker, containerd, and runc vulnerability scanning

Unhealthy konnectivity server Pods when using the Seesaw or manual mode load balancer

/etc/cron.daily/aide CPU and memory spike issue

Load balancers and NSX-T stateful distributed firewall rules interact unpredictably

Failure to register admin cluster during creation

Using Anthos Identity Service can cause the Connect Agent to restart unpredictably

High network traffic to monitoring.googleapis.com

Missing metrics on some nodes

Cisco ACI doesn't work with Direct Server Return (DSR)

gkectl diagnose checking certificates failure

Conflict with `cert-manager` when upgrading to version 1.9.0 or 1.9.1

Conflict with `cert-manager` when upgrading to version 1.9.2 or above

`/etc/cron.daily/aide` CPU and memory spike issue