Resolving resource limit issues in Anthos Service Mesh

This section explains common Anthos Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.

Anthos Service Mesh resource limit problems can be caused by any of the following:

  • LimitRange objects created in the istio-system namespace or any namespace with automatic sidecar injection enabled.
  • User-defined limits that are set too low.
  • Nodes run out of memory or other resources.

Potential symptoms of resource problems:

  • Anthos Service Mesh repeatedly not receiving configuration from Istiod indicated by the error, Envoy proxy NOT ready. Seeing this error a few times at startup is normal, but otherwise it is a concern.
  • Networking problems with some pods or nodes that become unreachable.
  • istioctl proxy-status showing STALE statuses in the output.
  • OOMKilled messages in the logs of a node.
  • Memory usage by containers: kubectl top pod POD_NAME --containers.
  • Memory usage by pods inside a node: kubectl top node my-node.
  • Envoy out of memory: kubectl get pods shows status OOMKilled in the output.

Istio sidecars take a long time to receive configuration

Slow configuration propagation can occur due to insufficient resources allocated to Istiod or an excessively large cluster size.

There are several possible solutions to this problem:

  1. If your monitoring tools (prometheus, stackdriver, etc.) show high utilization of a resource by Istiod, increase the allocation of that resource, for example increase the CPU or memory limit of the Istiod deployment. This is a temporary solution and we recommended that you investigate methods for reducing resource consumption.

  2. If you encounter this issue in a large cluster/deployment, reduce the amount of configuration state pushed to each proxy by configuring Sidecar resources.

  3. If the problem persists, try horizontally scaling Istiod.

  4. If all other troubleshooting steps fail to resolve the problem, report a bug detailing your deployment and the observed problems. Follow these steps to include a CPU/Memory profile in the bug report if possible, along with a detailed description of cluster size, number of pods, number of services, etc.