This document describes known issues for version 1.9 of Anthos clusters on VMware (GKE on-prem).
ClientConfig custom resource
gkectl update
reverts any manual changes that you have made to the ClientConfig
custom resource. We strongly recommend that you back up the ClientConfig
resource after every manual change.
gkectl check-config validation fails: can't find F5 BIG-IP partitions
- Symptoms
Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
- Potential causes
An issue with the F5 BIG-IP API can cause validation to fail.
- Resolution
Try running
gkectl check-config
again.
Disruption for workloads with PodDisruptionBudgets
Upgrading clusters can cause disruption or downtime for workloads that use PodDisruptionBudgets (PDBs).
Nodes fail to complete their upgrade process
If you have PodDisruptionBudget
objects configured that are unable to
allow any additional disruptions, node upgrades might fail to upgrade to the
control plane version after repeated attempts. To prevent this failure, we
recommend that you scale up the Deployment
or HorizontalPodAutoscaler
to
allow the node to drain while still respecting the PodDisruptionBudget
configuration.
To see all PodDisruptionBudget
objects that do not allow any disruptions:
kubectl get poddisruptionbudget --all-namespaces -o jsonpath='{range .items[?(@.status.disruptionsAllowed==0)]}{.metadata.name}/{.metadata.namespace}{"\n"}{end}'
User cluster installation failed because of cert-manager/ca-injector's leader election issue
You might see an installation failure due to cert-manager-cainjector
in crashloop, when the apiserver/etcd is slow:
# These are logs from `cert-manager-cainjector`, from the command
# `kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system cert-manager-cainjector-xxx`
I0923 16:19:27.911174 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election: timed out waiting for the condition
E0923 16:19:27.911110 1 leaderelection.go:321] error retrieving resource lock kube-system/cert-manager-cainjector-leader-election-core: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election-core": context deadline exceeded
I0923 16:19:27.911593 1 leaderelection.go:278] failed to renew lease kube-system/cert-manager-cainjector-leader-election-core: timed out waiting for the condition
E0923 16:19:27.911629 1 start.go:163] cert-manager/ca-injector "msg"="error running core-only manager" "error"="leader election lost"
Run the following commands to mitigate the problem.
First Scale down the monitoring-operator
so it will not revert the changes to the
cert-manager
Deployment.
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0
Second, edit the cert-manager-cainjector
Deployment to disable leader election, because we only have one replica running. It is not required for a single replica.
# Add a command line flag for cainjector: `--leader-elect=false`
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG edit -n kube-system deployment cert-manager-cainjector
The relevant yaml snippet for cert-manager-cainjector
deployment should look like this:
...
apiVersion: apps/v1
kind: Deployment
metadata:
name: cert-manager-cainjector
namespace: kube-system
...
spec:
...
template:
...
spec:
...
containers:
- name: cert-manager
image: "gcr.io/gke-on-prem-staging/cert-manager-cainjector:v1.0.3-gke.0"
args:
...
- --leader-elect=false
...
Keep monitoring-operator
replicas at 0 as a mitigation until the installation is finished. Otherwise it will revert the change.
After the installation is finished and the cluster is up and running, turn on the monitoring-operator
for day-2 operations:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=1
After each upgrade, the changes will be reverted. Perform the same steps again to mitigate the issue until this is fixed in a future release.
Renewal of certificates might be required before an admin cluster upgrade
Before you begin the admin cluster upgrade process, you should make sure that your admin cluster certificates are currently valid, and renew these certificates if they are not.
Admin cluster certificate renewal process
Make sure that OpenSSL is installed on the admin workstation before you begin.
Set the
KUBECONFIG
variable:KUBECONFIG=ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG
Replace ABSOLUTE_PATH_ADMIN_CLUSTER_KUBECONFIG with the absolute path to the admin cluster kubeconfig file.
Get the IP address and SSH keys for the admin master node:
kubectl --kubeconfig "${KUBECONFIG}" get secrets -n kube-system sshkeys \ -o jsonpath='{.data.vsphere_tmp}' | base64 -d > \ ~/.ssh/admin-cluster.key && chmod 600 ~/.ssh/admin-cluster.key export MASTER_NODE_IP=$(kubectl --kubeconfig "${KUBECONFIG}" get nodes -o \ jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}' \ --selector='node-role.kubernetes.io/master')
Check if the certificates are expired:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" \ "sudo kubeadm alpha certs check-expiration"
If the certificates are expired, you must renew them before upgrading the admin cluster.
Back up old certificates:
This is an optional, but recommended, step.
# ssh into admin master ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" # on admin master sudo tar -czvf backup.tar.gz /etc/kubernetes logout # on worker node sudo scp -i ~/.ssh/admin-cluster.key \ ubuntu@"${MASTER_NODE_IP}":/home/ubuntu/backup.tar.gz .
Renew the certificates with kubeadm:
# ssh into admin master ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}" # on admin master sudo kubeadm alpha certs renew all
Restart the admin master node:
# on admin master cd /etc/kubernetes sudo mkdir tempdir sudo mv manifests/*.yaml tempdir/ sleep 5 echo "remove pods" # ensure kubelet detect those change remove those pods # wait until the result of this command is empty sudo docker ps | grep kube-apiserver # ensure kubelet start those pods again echo "start pods again" sudo mv tempdir/*.yaml manifests/ sleep 30 # ensure kubelet start those pods again # should show some results sudo docker ps | grep -e kube-apiserver -e kube-controller-manager -e kube-scheduler -e etcd # clean up sudo rm -rf tempdir logout
Because the admin cluster kubeconfig file also expires if the admin certificates expire, you should back up this file before expiration.
Back up the admin cluster kubeconfig file:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo cat /etc/kubernetes/admin.conf" > new_admin.conf vi "${KUBECONFIG}"Replace
client-certificate-data
andclient-key-data
in kubeconfig withclient-certificate-data
andclient-key-data
in thenew_admin.conf
file that you created.
You must validate the renewed certificates, and validate the certificate of kube-apiserver.
Check certificates expiration:
ssh -i ~/.ssh/admin-cluster.key ubuntu@"${MASTER_NODE_IP}"
"sudo kubeadm alpha certs check-expiration"Check certificate of kube-apiserver:
# Get the IP address of kube-apiserver cat $KUBECONFIG | grep server # Get the current kube-apiserver certificate openssl s_client -showcerts -connect
:
| sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'
> current-kube-apiserver.crt # check expiration date of this cert openssl x509 -in current-kube-apiserver.crt -noout -enddate
Restarting or upgrading vCenter for versions lower than 7.0U2
If the vCenter, for versions lower than 7.0U2, is restarted, after an upgrade or otherwise,
the network name in VM Information from vCenter is incorrect, and results in the machine being in an Unavailable
state. This eventually leads to the nodes being auto-repaired to create new ones.
Related govmomi bug: https://github.com/vmware/govmomi/issues/2552
This workaround is provided by VMware support:
1. The issue is fixed in vCenter versions 7.0U2 and above. 2. For lower versions: Right-click the host, and then select Connection > Disconnect. Next, reconnect, which forces an update of the VM's portgroup.
SSH connection closed by remote host
For Anthos clusters on VMware version 1.7.2 and above, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
To meet the CIS rule "5.2.16 Ensure SSH Idle Timeout Interval is configured", /etc/ssh/sshd_config
has the following settings:
ClientAliveInterval 300 ClientAliveCountMax 0
The purpose of these settings is to terminate a client session after 5 minutes of idle time. However, the ClientAliveCountMax 0
value causes
unexpected behavior. When you use the ssh session on the admin workstation, or a cluster node, the SSH connection might be disconnected
even your ssh client is not idle, such as when running a time-consuming command, and your command could get terminated with the following message:
Connection to [IP] closed by remote host. Connection to [IP] closed.
As a workaround, you can either:
Use
nohup
to prevent your command being terminated on SSH disconnection,nohup gkectl upgrade admin --config admin-cluster.yaml --kubeconfig kubeconfig
Update the
sshd_config
to use a non-zeroClientAliveCountMax
value. The CIS rule recommends to use a value less than 3.sudo sed -i 's/ClientAliveCountMax 0/ClientAliveCountMax 1/g' /etc/ssh/sshd_config sudo systemctl restart sshd
Make sure you reconnect your ssh session.
Conflict with cert-manager
when upgrading to a version higher than 1.8.2
If you have your own cert-manager
installation with Anthos clusters on VMware, you might experience a failure when you attempt to upgrade to versions 1.8.2 or above. This is a result of a conflict between your version of cert-manager
, which is likely installed in the cert-manager
namespace, and the monitoring-operator
version.
If you try to install another copy of cert-manager
after upgrading to Anthos clusters on VMware version 1.8.2 or above, the installation might fail due to a conflict with the existing one managed by monitoring-operator
.
The metrics-ca
cluster issuer, which control-plane and observability components rely on for creation and rotation of cert secrets, requires a metrics-ca
cert secret to be stored in the cluster resource namespace. This namespace is kube-system
for the monitoring-operator installation, and likely to be cert-manager
for your installation.
If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.8.2 or later:
Uninstall your version of
cert-manager
.Perform the upgrade.
If you want to restore your own installation of cert-manager, follow these steps:
Scale the
monitoring-operator
deployment to 0. For the admin cluster, run this command:kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0
Scale the
cert-manager
deployments managed bymonitoring-operator
to 0.For the admin cluster, run this command:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n kube-system scale deployment cert-manager-webhook --replicas=0
Reinstall your
cert-manager
.Copy the
metrics-ca
cert-manager.io/v1 Certificate and themetrics-pki.cluster.local
Issuer resources fromkube-system
to the cluster resource namespace of your installed cert-manager. This namespace might becert-manager
if using the upstream cert-manager release, such as v1.0.3, but that depends on your installation.
Managing a third-party cert-manager
when upgrading to version 1.9.2 or higher
If you have experienced an installation failure, follow these steps to upgrade successfully to version 1.9.2 or later:
Uninstall your version of
cert-manager
.Perform the upgrade.
If you want to restore your own installation of cert-manager, follow these steps:
Scale the
monitoring-operator
deployment to 0. For the admin cluster, run this command:kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n kube-system scale deployment monitoring-operator --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME scale deployment monitoring-operator --replicas=0
Scale the
cert-manager
deployments managed bymonitoring-operator
to 0.For the admin cluster, run this command:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-webhook --replicas=0
For a user cluster, run this command:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-cainjector --replicas=0 kubectl --kubeconfig USER_CLUSTER_KUBECONFIG -n cert-manager scale deployment cert-manager-webhook --replicas=0
Reinstall your
cert-manager
.Copy the
metrics-ca
cert-manager.io/v1 Certificate and themetrics-pki.cluster.local
Issuer resources fromcert-manager
to the cluster resource namespace of your installed cert-manager. This namespace might becert-manager
if using the upstream cert-manager release, such as v1.5.4, but that depends on your installation.
False positives in docker, containerd, and runc vulnerability scanning
The docker, containerd, and runc in the Ubuntu OS images shipped with Anthos clusters on VMware are pinned to special versions using Ubuntu PPA. This ensures that any container runtime changes will be qualified by Anthos clusters on VMware before each release.
However, the special versions are unknown to the Ubuntu CVE Tracker, which is used as the vulnerability feeds by various CVE scanning tools. Therefore, you will see false positives in docker, containerd, and runc vulnerability scanning results.
For example, you might see the following false positives from your CVE scanning results. These CVEs are already fixed in the latest patch versions of Anthos clusters on VMware.
Refer to the release notes for any CVE fixes.
Canonical is aware of this issue, and the fix is tracked at https://github.com/canonical/sec-cvescan/issues/73.
Unhealthy konnectivity server Pods when using the Seesaw or manual mode load balancer
If you are using Seesaw or the manual mode load balancer, you might notice the konnectivity server Pods are unhealthy. This happens because Seesaw does not support reusing an IP address across a service. For manual mode, creating a load balancer service does not automatically provision the service on your load balancer.
SSH tunneling is enabled in version 1.9 clusters. Thus, even if the konnectivity server is not healthy, you can still use the SSH tunnel, so that the connectivity to and within the cluster should not be affected. Therefore, you do not need to be concerned about these unhealthy Pods.
If you plan to upgrade from version 1.9.0 to 1.9.x, it is recommended you delete the unhealthy konnectivity server deployments before upgrading. Run this command.
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME delete Deployment konnectivity-server
/etc/cron.daily/aide
CPU and memory spike issue
Starting from Anthos clusters on VMware version 1.7.2, the Ubuntu OS images are hardened with CIS L1 Server Benchmark.
As a result, the cron script /etc/cron.daily/aide
has been installed so that an aide
check is scheduled so as to
ensure that the CIS L1 Server rule "1.4.2 Ensure filesystem integrity is regularly checked" is followed.
The cron job runs daily at 6:00 AM UTC. Depending on the number of files on the filesystem,
you may experience CPU and memory usage spikes around that time that are caused by this aide
process.
If the spikes are affecting your workload, you can disable the daily cron job:
`sudo chmod -x /etc/cron.daily/aide`.
Load balancers and NSX-T stateful distributed firewall rules interact unpredictably
When deploying Anthos clusters on VMware version 1.9 or later, when the deployment has the Seesaw bundled load balancer in an environment that uses NSX-T stateful distributed firewall rules, stackdriver-operator
might fail to create gke-metrics-agent-conf
ConfigMap and cause gke-connect-agent
Pods to be in a crash loop.
The underlying issue is that the stateful NSX-T distributed firewall rules terminate the connection from a client to the user cluster API server through the Seesaw load balancer because Seesaw uses asymmetric connection flows. The integration issues with NSX-T distributed firewall rules affect all Anthos clusters on VMware releases that use Seesaw. You might see similar connection problems on your own applications when they create large Kubernetes objects whose sizes are bigger than 32K. Follow these instructions to disable NSX-T distributed firewall rules, or to use stateless distributed firewall rules for Seesaw VMs.
If your clusters use a manual load balancer, follow these instructions to configure your load balancer to reset client connections when it detects a backend node failure. Without this configuration, clients of the Kubernetes API server might stop responding for several minutes when a server instance goes down.
Failure to register admin cluster during creation
If you create an admin cluster for version 1.9.x or 1.10.0, and if the admin cluster fails to register with the provided gkeConnect
spec during its creation, you will get the following error.
Failed to create root cluster: failed to register admin cluster: failed to register cluster: failed to apply Hub Membership: Membership API request failed: rpc error: code = PermissionDenied desc = Permission 'gkehub.memberships.get' denied on PROJECT_PATH
You will still be able to use this admin cluster, but you will get the following error if you later attempt to upgrade the admin cluster to version 1.10.y.
failed to migrate to first admin trust chain: failed to parse current version "":
invalid version: "" failed to migrate to first admin trust chain: failed to parse
current version "": invalid version: ""
If this error occurs, follow these steps to fix the cluster registration issue. After you do this fix, you can then upgrade your admin cluster.
Provide
govc
, the command line interface to vSphere, some variables declaring elements of your vCenter Server and vSphere environment.export GOVC_URL=https://VCENTER_SERVER_ADDRESS export GOVC_USERNAME=VCENTER_SERVER_USERNAME export GOVC_PASSWORD=VCENTER_SERVER_PASSWORD export GOVC_DATASTORE=VSPHERE_DATASTORE export GOVC_DATACENTER=VSPHERE_DATACENTER export GOVC_INSECURE=true # DATA_DISK_NAME should not include the suffix ".vmdk" export DATA_DISK_NAME=DATA_DISK_NAME
Replace the following:
- VCENTER_SERVER_ADDRESS is your vCenter Server's IP address or hostname.
- VCENTER_SERVER_USERNAME is the username of an account that holds the Administrator role or equivalent privileges in vCenter Server.
- VCENTER_SERVER_PASSWORD is the vCenter Server account's password.
- VSPHERE_DATASTORE is the name of the datastore you've configured in your vSphere environment.
- VSPHERE_DATACENTER is the name of the datacenter you've configured in your vSphere environment.
- DATA_DISK_NAME is the name of the data disk.
Download the DATA_DISK_NAME‑checkpoint.yaml file.
govc datastore.download ${DATA_DISK_NAME}-checkpoint.yaml temp-checkpoint.yaml
Edit the checkpoint fields.
# Find out the gkeOnPremVersion export KUBECONFIG=ADMIN_CLUSTER_KUBECONFIG ADMIN_CLUSTER_NAME=$(kubectl get onpremadmincluster -n kube-system --no-headers | awk '{ print $1 }') GKE_ON_PREM_VERSION=$(kubectl get onpremadmincluster -n kube-system $ADMIN_CLUSTER_NAME -o=jsonpath='{.spec.gkeOnPremVersion}') # Replace the gkeOnPremVersion in temp-checkpoint.yaml sed -i "s/gkeonpremversion: \"\"/gkeonpremversion: \"$GKE_ON_PREM_VERSION\"/" temp-checkpoint.yaml #The steps below are only needed for upgrading from 1.9x to 1.10x clusters. # Find out the provider ID of the admin control-plane VM ADMIN_CONTROL_PLANE_MACHINE_NAME=$(kubectl get machines --no-headers | grep master) ADMIN_CONTROL_PLANE_PROVIDER_ID=$(kubectl get machines $ADMIN_CONTROL_PLANE_MACHINE_NAME -o=jsonpath='{.spec.providerID}' | sed 's/\//\\\//g') # Fill in the providerID field in temp-checkpoint.yaml sed -i "s/providerid: null/providerid: \"$ADMIN_CONTROL_PLANE_PROVIDER_ID\"/" temp-checkpoint.yaml
Replace ADMIN_CLUSTER_KUBECONFIG with the path of your admin cluster kubeconfig file.
Generate a new checksum.
Change the last line of the checkpoint file to
checksum:$NEW_CHECKSUM
Replace NEW_CHECKSUM with the output of the following command:
sha256sum temp-checkpoint.yaml
Upload the new checkpoint file.
govc datastore.upload temp-checkpoint.yaml ${DATA_DISK_NAME}-checkpoint.yaml
Using Anthos Identity Service can cause the Connect Agent to restart unpredictably
If you are using the Anthos Identity Service feature to manage Anthos Identity Service ClientConfig, the Connect Agent might restart unexpectedly.
If you have experienced this issue with an existing cluster, you can do one of the following:
Disable Anthos Identity Service (AIS). If you disable AIS, that will not remove the deployed AIS binary or remove AIS ClientConfig. To disable AIS, run this command:
gcloud beta container hub identity-service disable --project PROJECT_NAME
Replace PROJECT_NAME with the name of the cluster's fleet host project.
Update the cluster to version 1.9.3, or 1.10.1 or later, so as to upgrade the Connect Agent version.
High network traffic to monitoring.googleapis.com
You might see high network traffic to monitoring.googleapis.com, even in a new cluster that has no user workloads.
This issue affects version 1.10.0-1.10.1 and version 1.9.0-1.9.4. This issue is fixed in version 1.10.2 and 1.9.5.
To fix this issue, upgrade to version 1.10.2/1.9.5 or later.
To mitigate this issue for an earlier version:
Scale down
stackdriver-operator
:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \ scale deployment stackdriver-operator --replicas=0
Replace USER_CLUSTER_KUBECONFIG with the path of the user cluster kubeconfig file.
Open the
gke-metrics-agent-conf
ConfigMap for editing:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \ edit configmap gke-metrics-agent-conf
Increase the probe interval from 0.1 seconds to 13 seconds:
processors: disk_buffer/metrics: backend_endpoint: https://monitoring.googleapis.com:443 buffer_dir: /metrics-data/nsq-metrics-metrics probe_interval: 13s retention_size_mib: 6144 disk_buffer/self: backend_endpoint: https://monitoring.googleapis.com:443 buffer_dir: /metrics-data/nsq-metrics-self probe_interval: 13s retention_size_mib: 200 disk_buffer/uptime: backend_endpoint: https://monitoring.googleapis.com:443 buffer_dir: /metrics-data/nsq-metrics-uptime probe_interval: 13s retention_size_mib: 200
Close the editing session.
Change
gke-metrics-agent
DaemonSet version to 1.1.0-anthos.8:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \ edit daemonset gke-metrics-agent
image: gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8 # use 1.1.0-anthos.8 imagePullPolicy: IfNotPresent name: gke-metrics-agent
Missing metrics on some nodes
You might find that the following metrics are missing on some, but not all, nodes:
kubernetes.io/anthos/container_memory_working_set_bytes
kubernetes.io/anthos/container_cpu_usage_seconds_total
kubernetes.io/anthos/container_network_receive_bytes_total
To fix this issue:
- [version 1.9.5+]: increase cpu for gke-metrics-agent by following steps 1 - 4
- [version 1.9.0-1.9.4]: follow steps 1 - 9
Open your
stackdriver
resource for editing:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system edit stackdriver stackdriver
To increase the CPU request for
gke-metrics-agent
from10m
to50m
, add the followingresourceAttrOverride
section to thestackdriver
manifest :spec: resourceAttrOverride: gke-metrics-agent/gke-metrics-agent: limits: cpu: 100m memory: 4608Mi requests: cpu: 50m memory: 200Mi
Your edited resource should look similar to the following:
spec: anthosDistribution: on-prem clusterLocation: us-west1-a clusterName: my-cluster enableStackdriverForApplications: true gcpServiceAccountSecretName: ... optimizedMetrics: true portable: true projectID: my-project-191923 proxyConfigSecretName: ... resourceAttrOverride: gke-metrics-agent/gke-metrics-agent: limits: cpu: 100m memory: 4608Mi requests: cpu: 50m memory: 200Mi
Save your changes and close the text editor.
To verify your changes have taken effect, run the following command:
kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system get daemonset gke-metrics-agent -o yaml | grep "cpu: 50m"
The command finds
cpu: 50m
if your edits have taken effect.To prevent your following changes from reverting, scale down
stackdriver-operator
:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system scale deploy stackdriver-operator --replicas=0
Open
gke-metrics-agent-conf
for editing:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system edit configmap gke-metrics-agent-conf
Edit the configuration to change all instances of
probe_interval: 0.1s
toprobe_interval: 13s
:183 processors: 184 disk_buffer/metrics: 185 backend_endpoint: https://monitoring.googleapis.com:443 186 buffer_dir: /metrics-data/nsq-metrics-metrics 187 probe_interval: 13s 188 retention_size_mib: 6144 189 disk_buffer/self: 190 backend_endpoint: https://monitoring.googleapis.com:443 191 buffer_dir: /metrics-data/nsq-metrics-self 192 probe_interval: 13s 193 retention_size_mib: 200 194 disk_buffer/uptime: 195 backend_endpoint: https://monitoring.googleapis.com:443 196 buffer_dir: /metrics-data/nsq-metrics-uptime 197 probe_interval: 13s 198 retention_size_mib: 200
Save your changes and close the text editor.
Change
gke-metrics-agent
DaemonSet version to 1.1.0-anthos.8:kubectl --kubeconfig USER_CLUSTER_KUBECONFIG --namespace kube-system \ edit daemonset gke-metrics-agent
image: gcr.io/gke-on-prem-release/gke-metrics-agent:1.1.0-anthos.8 # use 1.1.0-anthos.8 imagePullPolicy: IfNotPresent name: gke-metrics-agent
Cisco ACI doesn't work with Direct Server Return (DSR)
Seesaw runs in DSR mode, and by default it doesn't work in Cisco ACI because of data-plane IP learning. A possible workaround is to disable IP learning by adding the Seesaw IP address as a L4-L7 Virtual IP in the Cisco Application Policy Infrastructure Controller (APIC).
You can configure the L4-L7 Virtual IP option by going to Tenant > Application Profiles > Application EPGs or uSeg EPGs. Failure to disable IP learning will result in IP endpoint flapping between different locations in the Cisco API fabric.