Resolving resource limit issues in Cloud Service Mesh
This section explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.
Cloud Service Mesh resource limit problems can be caused by any of the following:
LimitRange
objects created in theistio-system
namespace or any namespace with automatic sidecar injection enabled.- User-defined limits that are set too low.
- Nodes run out of memory or other resources.
Potential symptoms of resource problems:
- Cloud Service Mesh repeatedly not receiving configuration from the control plane
indicated by the error,
Envoy proxy NOT ready
. Seeing this error a few times at startup is normal, but otherwise it is a concern. - Networking problems with some pods or nodes that become unreachable.
istioctl proxy-status
showingSTALE
statuses in the output.OOMKilled
messages in the logs of a node.- Memory usage by containers:
kubectl top pod POD_NAME --containers
. - Memory usage by pods inside a node:
kubectl top node my-node
. - Envoy out of memory:
kubectl get pods
shows statusOOMKilled
in the output.
Sidecars take a long time to receive configuration
Slow configuration propagation can occur due to insufficient resources allocated
to istiod
or an excessively large cluster size.
There are several possible solutions to this problem:
For in-cluster Cloud Service Mesh, if your monitoring tools (prometheus, stackdriver, etc.) show high utilization of a resource by
istiod
, increase the allocation of that resource, for example increase the CPU or memory limit of theistiod
deployment. This is a temporary solution and we recommended that you investigate methods for reducing resource consumption.If you encounter this issue in a large cluster or deployment, reduce the amount of configuration state pushed to each proxy by configuring Sidecar resources.
For in-cluster Cloud Service Mesh, if the problem persists, try horizontally scaling
istiod
.If all other troubleshooting steps fail to resolve the problem, report a bug detailing your deployment and the observed problems. Follow these steps to include a CPU/Memory profile in the bug report if possible, along with a detailed description of cluster size, number of pods, and number of services.