Troubleshooting resource contention issues

This page describes how to identify and troubleshoot resource contention issues in your GKE on VMware environment.

If you need additional assistance, reach out to Cloud Customer Care.

Overview

Sometimes your GKE on VMware might experience resource contention, causing your containers to slow down, underperform, or get terminated. This can happen due to high CPU or memory consumption by the containers.

How CPU and memory management works

  • CPU:

    • A Pod is scheduled to a node based on the CPU requests specified by the containers in the Pod.
    • A container in a Pod can't use more CPUs than the limit specified by the container
    • The CPU usage of the container is throttled at the CPU limit.
    • If the CPU usage is throttled at the node level, the containers are automatically assigned the CPU cycles proportional to the requests.

    Learn more about how Pods with resource requests are scheduled.

  • Memory:

    • A Pod is scheduled to a node based on the memory requests specified by the containers in the Pod.
    • A container can't use more memory than the limit specified by the container.
    • If there is no memory limit specified, a container might consume all the available memory on a node. Then the system might trigger the OOM-Killer (Out Of Memory Killer) and evict the low priority Pods.

For more information, see Assign CPU resources, Assign memory resources in Kubernetes, and GKE Enterprise metrics.

Issues

Container becomes slow

CPU contention issues can cause the containers to become slow. Following are some of the potential reasons:

High CPU utilization on the container:

A container can become slow if it doesn't get CPU cycles proportional to the CPU requests, or the CPU requests have been set to too low than what the container needs. So check the ratio of CPU limit to CPU utilization for the container.

In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

  fetch k8s_container
  | metric 'kubernetes.io/anthos/container/cpu/limit_utilization'
  | group_by 1m, [value_limit_utilization_mean: mean(value.limit_utilization)]
  | filter resource.cluster_name == 'CLUSTER_NAME'
  | filter resource.container_name == 'CONTAINER_NAME'
  | filter resource.pod_name == 'POD_NAME'
  | filter resource.namespace_name == 'NAMESPACE_NAME'
  | every 1m

Then do one of the following:

High CPU utilization on the node

If the ratio of CPU limit to utilization is not high for any individual container of the Pod, then it might be that the node doesn't have enough CPU cycles to allocate to the set of containers running on the node. So follow these steps to check the ratio of actual CPU usage to the allocatable CPUs on the node:

  1. Get the node for the Pod that is working slow:

    kubectl get pod –kubeconfig CLUSTER_KUBECONFIG --namespace NAMESPACE POD --output wide
    
  2. In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

    fetch k8s_node
    | metric 'kubernetes.io/anthos/node/cpu/allocatable_utilization'
    | group_by 1m,
        [value_allocatable_utilization_mean: mean(value.allocatable_utilization)]
    | filter resource.cluster_name == 'CLUSTER_NAME'
    | filter resource.node_name == 'NODE_NAME'
    | every 1m
    

    If this ratio is high (>=0.8), then it means that the node doesn't have enough CPU cycles, and is oversubscribed. So follow these steps to check the CPU usage for all the other Pods on that node, and investigate If there is another container using more CPUs.

    1. Get all Pods on the node:
    kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=NODE_NAME
    
    1. Check the CPU utilization on each container:
    fetch k8s_container
    | metric 'kubernetes.io/anthos/container/cpu/limit_utilization'
    | group_by 1m, [value_limit_utilization_mean: mean(value.limit_utilization)]
    | filter resource.cluster_name == 'CLUSTER_NAME'
    | filter resource.container_name == 'CONTAINER_NAME'
    | filter resource.pod_name == 'POD_NAME'
    | filter resource.namespace_name == 'NAMESPACE_NAME'
    | every 1m
    

    If there is another container using high CPU on the node, increase the CPU requests and limits on the container that's working slow. This will recreate the Pod on another node to get the required CPU cycles.

In case it's a system Pod that's working slow, contact Google support.

CPU oversubscription at the vSphere level

If the CPU consumption is not high on either the node or the Pod, and the container is still slow, then the VM might be oversubscribed at the vSphere level. Hence, the node is unable to get the expected CPU cycles from the underlying virtualization.

Follow these steps to check if the VM is oversubscribed. If oversubscription is detected, try the following:

  • Move some VMs to other hosts.
  • Evaluate, and decrease the number of vCPUs per VM, for the host.
  • Allocate more resources to the GKE Enterprise VMs.
  • Increase the CPU requests and limits on the container. This will recreate the Pod on another node to get the required CPU cycles.

Pod gets OOMkilled (Out of Memory-Killed)

The Pods can get OOMKilled due to the memory leaks, or poor configuration of memory requests and limits on the containers. Following are some of the potential reasons:

High memory usage on the container

A Pod can get OOMkilled if any container in a Pod over consumes the total allocated memory. So check the ratio of memory requests to memory limits on the container.

In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

fetch k8s_container
| metric 'kubernetes.io/anthos/container/memory/limit_utilization'
| filter (metric.memory_type == 'non-evictable')
| group_by 1m, [value_limit_utilization_mean: mean(value.limit_utilization)]
| filter resource.cluster_name == 'CLUSTER_NAME'
| filter resource.container_name == 'CONTAINER_NAME'
| filter resource.pod_name == 'POD_NAME'
| filter resource.namespace_name == 'NAMESPACE_NAME'
| every 1m

Then do one of the following:

High memory usage on the node

A Pod can get OOMkilled if the memory usage of all the Pods running on the node exceeds the available memory. So check if the MemoryPressure condition on the node is True.

  1. Run the following command and inspect the Conditions section:

    kubectl describe nodes --kubeconfig CLUSTER_KUBECONFIG NODE-NAME
    
  2. If the MemoryPressure condition is True, then check memory utilization on the node:

    fetch k8s_node
    | metric 'kubernetes.io/anthos/node/memory/allocatable_utilization'
    | filter (metric.memory_type == 'non-evictable')
    | group_by 1m,
        [value_allocatable_utilization_mean: mean(value.allocatable_utilization)]
    | filter resource.cluster_name == 'CLUSTER_NAME'
    | filter resource.node_name = 'NODE_NAME'
    | every 1m
    

    If this ratio is high (>= 0.8), then it means that the node doesn't have enough memory to allocate to the Pod, possibly due to some processes or other Pods consuming high memory.

  3. In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query to check memory usage for the containers on the node:

    fetch k8s_node
    | metric 'kubernetes.io/anthos/container_memory_usage_bytes'
    | filter resource.cluster_name == 'CLUSTER_NAME'
    | filter resource.node_name == 'NODE_NAME'
    | group_by 1m,
        [value_container_memory_usage_bytes_mean:
          mean(value.container_memory_usage_bytes)]
    | every 1m
    

    If there is a container using high memory, investigate the functioning of the container or increase the memory request for the container, if needed.

In case it's a system Pod that's consuming high memory, contact Google support.

Additionally, you can enable the autoscaling feature in GKE on VMware to automatically scale-up and scale-down the node pools based on the demands of your workloads.

Learn how to enable the autoscaler.

Etcd issues

Sometimes your Anthos clusters on VMware might experience container failures due to the etcd server issues, and you might observe the following:

  • Repeated API server logs of the form:

    etcdserver: request timed out and etcdserver: leader changed

  • Repeated etcd logs of the form:

    W | wal: sync duration of 2.466870869s, expected less than 1s and W | etcdserver: read-only range request * took too long

Following are some of the potential reasons:

CPU throttling

The etcd server might be slow due to the CPU throttling on the etcd server Pod and/or the node on which the etcd server is running. Refer to the steps in Container becomes slow section to check for any CPU contention issues.

If you detect CPU contention on the ectd server Pod or on the node, add CPUs to the control plane node of the user cluster. Use gkectl update to edit the cpus field in the user cluster configuration file.

Etcd Pod OOMkilled

The etcd Pod might get OOMkilled due to resource contention issues. Refer to the steps in Pod gets OOMkilled (Out of Memory-Killed) section to check for any memory contention issues with the etcd server Pod and/or the node on which the etcd server is running.

If you detect OOMkills for the etcd Pod, increase the memory available to the control plane node of the user cluster. Use gkectl update to edit the memoryMB field in the user cluster configuration file.

Disk slowness

If there are no issues with the CPU or memory consumption on the etcd server Pod or the node, the etcd might be slow if the underlying datastore is slow or throttled.

Check for the following problems:

  • To check if the etcd server is taking too long to read/write to the underlying disk:

    1. Fetch the etcd logs:

      kubectl –kubeconfig ADMIN_CLUSTER_KUBECONFIG logs -n ETCD_POD_NAMESPACE ETCD_POD
      
    2. Look for the entries of the following pattern to detect if etcd is taking too long to read from the disk:

      W | etcdserver: read-only range request "key:\"/registry/configmaps/default/clusterapi-vsphere-controller-manager-leader-election\" " with result "range_response_count:1 size:685" took too long (6.893127339s) to execute

    3. Look for the entries of the following pattern to detect if etcd is taking too long to write to the disk:

      W | wal: sync duration of 2.466870869s, expected less than 1s

    If any or both of the above log patterns show frequently in the etcd logs, it indicates disk slowness. Then check the performance of the datastore and disks.

  • To check the etcd metrics:

    1. Fetch the etcd wal sync latencies:

      In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

      fetch k8s_container::kubernetes.io/anthos/etcd_disk_wal_fsync_duration_seconds
      | every 1m
      | filter resource.cluster_name == 'CLUSTER_NAME'
      | filter resource.pod_name == 'POD_NAME'
      | filter resource.namespace_name == 'NAMESPACE_NAME'
      | percentile 99
      
    2. Fetch the etcd write latencies:

      In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

      fetch k8s_container::kubernetes.io/anthos/etcd_disk_backend_commit_duration_seconds
      | every 1m
      | filter resource.cluster_name == 'CLUSTER_NAME'
      | filter resource.pod_name == 'POD_NAME'
      | filter resource.namespace_name == 'NAMESPACE_NAME'
      | percentile 99
      

    If p99 for etcd_disk_wal_fsync_duration_seconds is continuously over 10ms, and/or p99 for etcd_disk_backend_commit_duration_seconds is continuously over 25ms, it indicates disk slowness. Then check the performance of the datastore and disks.

Read/write latencies on the VM disk

Follow these steps to check for the read/write latencies on the VM virtual disk

  1. Identify the node for the slow etcd Pod:

    kubectl –kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods -n ETCD_POD_NAMESPACE ETCD_POD -owide
    
  2. Login to vSphere and select the VM identified in the above step: In vSphere, go to Monitor > Performance > Advanced, and select Virtual Disk from the View section to identify the read and write latencies for the virtual disks.

    If the virtual disk read/write latency is high:

    • Examine other VMs running on the datastore to check for the high input/output operations per second (IOPS) usage. If any VM shows spikes in the IOPS, assess the functioning of that VM.
    • Consult your lab/infra team to make sure that the read and write bandwidth is not throttled or limited at any point.
    • Consult your lab/infra team to identify the disk performance and storage performance issues, if any.

For more information, see the best practices for scaling your resources.

API server issues

If the containers in your GKE on VMware experience latency while communicating with the API server, or the Kubectl commands fail or take too long to respond, this might indicate issues with the API server.

Following are some of the potential reasons:

High volume of API requests

The API server might be slow to respond if the frequency and volume of the requests on it is too high. The slow response time might persist even after the API server starts throttling the requests. So check the rate of API requests on the API server.

In Google Cloud Console > Monitoring > Metrics explorer, in the MQL editor, run the following query:

fetch k8s_container::kubernetes.io/anthos/apiserver_request_total
| filter resource.cluster_name == 'CLUSTER_NAME'
| filter resource.pod_name == 'APISERVER_POD_NAME'
| filter resource.namespace_name == 'NAMESPACE_NAME'
| align rate(1m)
| every 1m
| group_by [metric.verb]

If there is unexpected increase in the API requests, use Cloud audit logging to identify the Pod that might be querying the API server too often.

  • If it's a system Pod, contact Google support.
  • If it's a user Pod, investigate further to determine if the API requests are expected.

CPU throttling

High request rate on the API server can lead to CPU throttling. Then the API server might become slow due to CPU contention on the API server Pod and/or the node.

Refer to the Container becomes slow section to check for any CPU contention issues with the Pod and/or the node.

API server Pod OOMkilled

The API server Pod might get OOMkilled due to resource contention issues. Refer to the steps in Pod gets OOMkilled (Out of Memory-Killed) section to check for any memory contention issues with the Pod and/or the node.

Slow etcd responses

The API server relies on communication with the etcd cluster to serve read / write requests to the clients. If the etcd is slow or unresponsive, the API server also becomes slow.

Fetch the logs of the API server to check if the API server is slow because of the etcd issues:

kubectl –kubeconfig ADMIN_CLUSTER_KUBECONFIG logs -n APISERVER_NAMESPACE APISERVER_POD_NAME

If you observe the recurring logs like, etcdserver: request timedout or etcdserver: leader changed, follow the steps in Etcd issues to resolve any disk related issues.

What's next

If you need additional assistance, reach out to Cloud Customer Care.