Insights: Understanding Anomalies
Stackdriver Monitoring continuously analyzes the monitoring data from your infrastructure. Based on this analysis, Stackdriver Monitoring sends you performance summaries containing recommended actions to optimize your application. The recommended actions are described below.
Investigate transient performance anomalies
This recommendation is based on the observation of short-lived anomalies in the performance of your application. Often, these anomalies are benign, resulting from expected events like a database backup or a code deployment. However, the anomalies might also reflect an instability in your application that could ripple to other parts of your system. Investigate these episodes, understand their cause and impact, and address them with software or deployment changes as necessary. Adjust your policies so you will receive alerts on these incidents in the future.
Reconfigure unbalanced clusters
This recommendation is based on the observation of variation in behavior across members in a cluster. Because members of a cluster are expected to exhibit uniform behavior, variation can indicate problems in the cluster. Check that clusters are configured properly. Confirm that all members are using the same software and the same versions; that they are running the same services; and that they are running on equivalent machines. Check that the work load is balanced evenly, and that no member is running out of resources.
Replace contended instances
This recommendation is based on the observation that instances are exhibiting signs of contention with other instances on the same physical host. Consider replacing the contended hosts or deploying new instances to different physical infrastructure that has less contention and more consistent performance.
Upsize busy instances
This recommendation is based on the observation that certain instances are being fully utilized much of the time. Stackdriver Monitoring samples CPU utilization periodically and computes the percentage of time the instance is busy. We consider a node to be "busy" or "fully utilized" when the system reports that the CPU is idle only a tiny fraction of the time. You might also observe that some instances show indications of being throttled, indicating that your maximum available CPU might be limited. Consider re-deploying these instances on larger hardware.
Prevent resource exhaustion
This recommendation is based on our prediction that, by current usage trends, some of your resources will be exhausted soon. The impact of resource exhaustion can vary. Running out of memory can cause individual services to slow down or crash. Running out of disk space can render entire nodes unusable. Running out of disk space on a database can lead to catastrophic system failure. Monitor your resource consumption so it doesn't cause serious problems.