This page dives deeper into the fleet and team resource utilization metrics described in Use the fleet overview and Use the team overview. These metrics describe how effectively your clusters are utilizing the physically available resources you pay for or resources that you allocate on on-premises hardware. You can use this information to understand resource utilization effectiveness at scale, on a fleet or team scope level. This can help you either optimize cluster size and resource allocation across clusters and namespaces, or optimize how application teams request and reserve resources.
This page explains how these metrics are calculated and provides some tips for how to use these metrics to optimize resource usage.
Use resource utilization metrics
The following tips can help you use the metrics in the console to identify and address problems:
- If your fleet's Total CPU/Memory/Disk utilization indicates unexpectedly high or low utilization over the last seven days, always check the corresponding CPU/Memory/Disk utilization by fleet chart to evaluate if the unexpected utilization is constant or caused by usage spikes.
- If Top CPU/Memory/Disk utilization by cluster indicates individual clusters that behave differently than the rest, consider investigating those particular clusters more closely. Consider resizing the clusters if possible.
- If Top CPU/Memory/Disk utilization by namespace shows an unexpected spike over the last seven days, consider investigating if a specific workload is causing the spike. A possible solution might be to redistribute workloads across resources.
- CPU/Memory/Disk utilization by fleet lets you observe the ratio between used and requested resources. A big difference between the two might mean that application teams are requesting and reserving too many resources.
Understand resource utilization metrics
The following metrics are provided in the fleet and team scope overview, calculated using information from Cloud Monitoring on your fleet clusters.
CPU metrics
- Total CPU utilization:
- For the fleet Overview dashboard, an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that are registered to a fleet.
- Allocatable: The amount of CPU allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/cpu/allocatable_cores metric.
- Used: The amount of CPU used by all containers across all clusters that are registered to a fleet. Calculated from the container/cpu/core_usage_time metric.
- For the team Monitoring dashboard, an average of all points in time for a given time window where point in time is a ratio between requested and used resources across all namespaces that are associated with a team scope.
- Requested: The amount of CPU requested by all containers across all namespaces that are associated with a team scope. Calculated from the container/cpu/request_cores metric.
- Used: The amount of CPU used by all containers across all namespaces that are associated with a team scope. Calculated from the container/cpu/core_usage_time metric.
- For the fleet Overview dashboard, an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that are registered to a fleet.
- CPU utilization by fleet/team:
- For the fleet level, the relationship between used, requested and allocated resources.
- Used: The amount of CPU used by all containers across all clusters that are registered to a fleet. Calculated from the container/cpu/core_usage_time metric.
- Requested: The amount of CPU requested by all containers across all clusters that are registered to a fleet. Calculated from the container/cpu/request_cores metric.
- Allocatable: The amount of CPU allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/cpu/allocatable_cores metric.
- For the team level, the relationship between resource limit, and used and requested resources.
- Used: The amount of CPU used by all containers across all namespaces that are associated with a team scope. Calculated from the container/cpu/core_usage_time metric.
- Requested: The amount of CPU requested by all containers across all namespaces that are associated with a team scope. Calculated from the container/cpu/request_cores metric.
- Limit: The maximum amount of CPU available to all containers across all namespaces that are associated with a team scope. Calculated from the container/cpu/limit_cores metric.
- For the fleet level, the relationship between used, requested and allocated resources.
- Top CPU utilization by cluster: Cluster list sorted by an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources for a particular cluster.
- Allocatable: The amount of CPU allocated to all nodes in a cluster. Calculated from the node/cpu/allocatable_cores metric.
- Used: The amount of CPU used by all containers in a cluster. Calculated from the container/cpu/core_usage_time metric.
- Top CPU utilization by namespace: Namespace list sorted by an average of all points in time for a given time window where point in time is a ratio between used and requested resources for a particular namespace.
- Used: The amount of CPU used by all containers in a namespace. Calculated from the container/cpu/core_usage_time metric.
- Requested: The amount of CPU requested by all containers in a namespace. Calculated from the container/cpu/request_cores metric.
Memory metrics
- Total memory utilization:
- For the fleet Overview dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that belong to a fleet.
- Allocatable: The amount of memory allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/memory/allocatable_byte metric.
- Used: The amount of non-evictable memory used by all containers across all clusters that are registered to a fleet. Calculated from the container/memory/used_bytes metric.
- For the team Monitoring dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between requested and used resources across all namespaces that belong to a team scope.
- Requested: The amount of memory requested by all containers across all namespaces that are associated with a scope. Calculated from the container/memory/request_bytes metric.
- Used: The amount of non-evictable memory used by all containers across all namespaces that are associated with a scope. Calculated from the container/memory/used_bytes metric.
- For the fleet Overview dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that belong to a fleet.
- Memory utilization by fleet/team:
- For the fleet level, the relationship between used, requested and allocated resources.
- Used: The amount of non-evictable memory used by all containers across all clusters that are registered to a fleet. Calculated from the container/memory/used_bytes metric.
- Requested: The amount of memory requested by all containers across all clusters that are registered to a fleet. Calculated from the container/memory/request_bytes metric.
- Allocatable: The amount of memory allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/memory/allocatable_byte metric.
- For the team level, the relationship between resource limit, and used and requested resources.
- Used: The amount of non-evictable memory used by all containers across all namespaces that are associated with a scope. Calculated from the container/memory/used_bytes metric.
- Requested: The amount of memory requested by all containers across all namespaces that are associated with a scope. Calculated from the container/memory/request_bytes metric.
- Limit: The maximum amount of memory available to all containers across all namespaces that are associated with a scope. Calculated from the container/memory/limit_bytes metric.
- For the fleet level, the relationship between used, requested and allocated resources.
- Top memory utilization by cluster: Cluster list sorted by an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources for a particular cluster.
- Allocatable: The amount of memory allocated to all nodes in a cluster. Calculated from the node/memory/allocatable_byte metric.
- Used: The amount of non-evictable memory used by all containers in a cluster. Calculated from the container/memory/used_bytes metric.
- Top memory utilization by namespace: Namespace list sorted by an average of all points in time for a given time window where point in time is a ratio between used and requested resources for a particular namespace.
- Used: The amount of non-evictable memory used by all containers in a namespace. Calculated from the container/memory/used_bytes metric.
- Requested: The amount of memory requested by all containers in a namespace. Calculated from the container/memory/request_bytes metric.
Disk metrics
- Total disk utilization:
- For the fleet Overview dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that belong to a fleet.
- Allocatable: The amount of local ephemeral storage allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/ephemeral_storage/allocatable_bytes metric.
- Used: The amount of local ephemeral storage used by all containers across all clusters that are registered to a fleet. Calculated from the container/ephemeral_storage/used_bytes metric.
- For the team Monitoring dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between requested and used resources across all namespaces that belong to a team scope.
- Requested: The amount of local ephemeral storage requested by all containers across all namespaces that are associated with a scope. Calculated from the container/ephemeral_storage/request_bytes metric.
- Used: The amount of local ephemeral storage used by all containers across all namespaces that are associated with a scope. Calculated from the container/ephemeral_storage/used_bytes metric.
- For the fleet Overview dashboard, this refers to an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources across all clusters that belong to a fleet.
- Disk utilization by fleet/team:
- For the fleet level, the relationship between used, requested and allocated resources.
- Used: The amount of local ephemeral storage used by all containers across all clusters that are registered to a fleet. Calculated from the container/ephemeral_storage/used_bytes metric.
- Requested: The amount of local ephemeral storage requested by all containers across all clusters that are registered to a fleet. Calculated from the container/ephemeral_storage/request_bytes metric.
- Allocatable: The amount of local ephemeral storage allocated to all nodes across all clusters that are registered to a fleet. Calculated from the node/ephemeral_storage/allocatable_bytes metric.
- For the team level, the relationship between resource limit, and used and requested resources.
- Used: The amount of local ephemeral storage used by all containers across all namespaces that are associated with a scope. Calculated from the container/ephemeral_storage/used_bytes metric.
- Requested: The amount of local ephemeral storage requested by all containers across all namespaces that are associated with a scope. Calculated from the container/ephemeral_storage/request_bytes metric.
- Limit: The maximum amount of local ephemeral storage available to all containers across all namespaces that are associated with a scope. Calculated from the container/ephemeral_storage/limit_bytes metric.
- For the fleet level, the relationship between used, requested and allocated resources.
- Top disk utilization by cluster: Cluster list sorted by an average of all points in time for a given time window where point in time is a ratio between allocatable and used resources for a particular cluster.
- Allocatable: The amount of local ephemeral storage allocated to all nodes in a cluster. Calculated from the node/ephemeral_storage/allocatable_bytes metric.
- Used: The amount of local ephemeral storage used by all containers in a cluster. Calculated from the container/ephemeral_storage/used_bytes metric.
- Top disk utilization by namespace: Namespace list sorted by an average of all points in time for a given time window where point in time is a ratio between used and requested resources for a particular namespace.
- Used: The amount of local ephemeral storage used by all containers in a namespace. Calculated from the container/ephemeral_storage/used_bytes metric.
- Requested: The amount of local ephemeral storage requested by all containers in a namespace. Calculated from the container/ephemeral_storage/request_bytes metric.
Error distribution by namespace (team-level only)
Namespace list sorted by the highest number of error logs for a given time window. Logs are collected from Cloud Logging.
Restart counts distribution by namespace (team-level only)
Namespace list sorted by the highest number of container restarts for a given time window. Calculated from the container/restart_count metric.
Troubleshooting
Metrics fail to load for new clusters
If you have created new clusters, depending on the time window you select, you
may see No Data
throughout the Monitoring dashboard, or you may see
metrics. For example, if you created a cluster within the last hour, and select a time
window of 1 hour or 6 hours, the dashboard may return some metrics for
your workloads. However, if you select a time window of 1 day or more, you
may see No data
displayed throughout the dashboard.
This is because Cloud Monitoring collects data in different periods (intervals) for different time windows. For time windows of 1 hour and 6 hours, data is collected in 1-minute periods. So if your cluster has existed for a few minutes, you will see metrics for these time windows.
For time windows of 1 day and 1 week, Cloud Monitoring collects data in 1-hour periods. If your cluster has existed for less than an hour, you may see no data for these time windows.
If you experience this error, check the dashboard after more time has elapsed since creating the new cluster.