This document shows you how to diagnose and mitigate CPU, memory, and storage performance issues on Compute Engine virtual machine (VM) instances.
Before you begin
- If you want to use the command-line examples in this guide, do the following:
- Install the Ops Agent to view full VM performance metrics, such as memory and disk space utilization
View performance metrics
To view performance metrics for a VM, use the Cloud Monitoring observability metrics available in the Google Cloud console.
In the Google Cloud console, go to the VM Instances page.
Click the name of the VM you want to view performance metrics for. The VM instance Details page opens.
Click the Observability tab to open the Observability Overview page.
Explore the VM's performance metrics. View the Overview, CPU, Memory, Network and Disk sections to see detailed metrics about each topic. The following are key metrics that indicate VM performance:
On the Overview page:
CPU Utilization. The percent of CPU used by the VM.
Memory Utilization. The percent of memory used by the VM, excluding disk caches. For Linux VMs, this also excludes kernel memory.
Network Traffic. The average rate of bytes sent and received in one minute intervals.
New Connections with VMs/External/Google. The estimated number of distinct TCP/UDP flows in one minute, grouped by peer type.
Disk Throughput. The average rate of bytes written to and read from disks.
Disk IOPS. The average rate of I/O read and write operations to disks.
On the Network Summary page:
Sent to VMs/External/Google. The rate of network traffic rate sent to Google services, VMs, and external destinations, based on a sample of packets. The metric is scaled so that the sum matches the total sent network traffic.
Received from VMs/External/Google. The rate of network traffic received from Google services, VMs, and external sources, based on a sample of packets. The metric is scaled so that the sum matches the total received network traffic.
Network Packet Totals. The total rate of sent and received packets in one minute intervals.
Packet Mean Size. The mean size of packets, in bytes, sent and received in one minute intervals.
Firewall Incoming Packets Denied. The rate of incoming network packets sent to the VM, but not received by they VM, because they were denied by firewall rules.
On the Disks Performance page:
I/O Size Avg. The average size of I/O read and write operations to disks. Small (4-16 KiB) random I/Os are usually limited by IOPS and sequential/large (256 KiB-1 MiB) I/Os by throughput.
Queue Length Avg. The number of queued and running disk I/O operations, also called queue depth, for the top 5 devices. To reach the performance limits of your persistent disks, use a high I/O queue depth. Persistent disks are networked storage and generally have higher latency compared to physical disks or local SSDs.
I/O Latency Avg. The average latency of I/O read and write operations aggregated across operations of all disks attached to the VM, measured by the Ops Agent in the VM. This value includes operating system and file system processing latency, and is dependent on queue length and I/O size.
Understand performance metrics
VM performance is affected by the hardware the VM runs on, the workload running on the VM, and the VM's machine type. If the hardware cannot support the workload or network traffic of your VM, your VM's performance might be affected.
CPU and memory performance
CPU and memory performance is affected by the following hardware constraints:
- Each virtual CPU (vCPU) is implemented as a single hardware multi thread on a CPU processor.
- Intel Xeon CPU processors support multiple app threads on a single processor core.
- VMs that use C2 machine types have fixed virtual-to-physical core mapping and expose NUMA cell architecture to the guest OS.
- Most VMs get the all-core turbo frequency listed on CPU platforms, even if only the base frequency is advertised to the guest environment
- Shared-core machine types use context-switching to share a physical core between vCPUs for multitasking. They also offer bursting capabilities during which the CPU utilization for a VM can go over 100%. For more information, see Shared-core machine types.
To understand a VM's CPU and memory performance, view performance metrics for CPU Utilization and Memory Utilization. You can additionally use process metrics to view currently running processes, attribute anomalies in resource consumption to a specific process, or identify your VM's most expensive resource consumers.
Consistently high CPU or memory utilization indicate the need to scale up a VM. If the VM consistently uses greater than 90% of its CPU or memory, change the VM's machine type to a machine type with more vCPUs or memory.
Network performance is affected by the following hardware constraints:
- Each Machine type has a specific egress bandwidth cap. To find the maximum egress bandwidth for your VM's machine type, visit the page that corresponds to your VM's machine family:
- Adding additional network interfaces or adding additional IP addresses per network interface to a VM doesn't increase the VM's ingress or egress network bandwidth, but you can configure some machine types for higher bandwidth. For more information, see Configuring a VM with higher bandwidth.
To understand a VM's network performance, view performance metrics for Network Packet Totals, Packet Mean Size, New Connections with VMs/External/Google, Sent to VMs/External/Google, Received From VMs/External/Google, and Firewall Incoming Packets Denied.
Review whether Network Packet Totals, Packet Mean Size, and New Connections with VMs/External/Google are typical for your workload. For example, a web server might experience many connections and small packets, while a database might experience few connections and large packets.
Consistently high sent network traffic might indicate the need to change the VM's machine type to a machine type that has a higher egress bandwidth limit.
If you notice high numbers of incoming packets denied by firewalls, visit the Network Intelligence Firewall Insights page in the Google Cloud console to learn more about the origins of denied packets.
If you think your own traffic is being incorrectly denied by firewalls, try running connectivity tests.
If your VM sends and receives a high amount of traffic from VMs in different zones or regions, consider modifying your workload to keep more data within a zone or region to increase latency and decrease costs. For more information, see "VM-VM egress pricing within Google Cloud" on the pricing page. If your VM sends a large amount of traffic to other VMs within the same zone, consider a placement policy to achieve low network latency.
Storage is affected by the following hardware constraints:
- The total size of all persistent disks combined with the number of vCPUs determine total storage performance. If there are different types of persistent disks attached to a VM, the SSD persistent disk performance limit is shared by all disks on the VM. For more information,see Block storage performance.
- When persistent disks compete with network egress traffic, 60% of the maximum egress bandwidth is used for persistent disks, and the remaining 40% can be used for network egress. For more information, see Other factors that affect performance.
- I/O size and queue depth performance are dependant on workloads. Some workloads might not be large enough to use full I/O size and queue depth performance limits.
- A VM's machine type affects its storage performance. For more information, see Machine type and vCPU count.
To understand a VM's storage performance, view performance metrics for Throughput, Operations (IOPS), I/O Size, I/O Latency, and Queue Length.
Disk throughput and IOPS indicate whether the VM workload is operating as expected. If throughput or IOPS is lower than the expected maximum listed in the disk type chart, then I/O size, queue length, or I/O latency performance issues might be present.
You can expect I/O size to be between 4-16 KiB for workloads that require high IOPS and low latency, and 256 KiB-1 MiB for workloads that involve sequential or large write sizes. I/O size outside of these ranges indicate disk performance issues.
Queue length, also known as queue depth, is a factor of throughput and IOPS. When a disk performs well, its queue length should be about the same as the queue length recommended to achieve a particular throughput or IOPS level, listed in the Recommended I/O queue depth chart.
I/O latency is dependent on queue length and I/O size. If the queue length or I/O size for a disk is high, the latency will also be high.
If any storage performance metrics indicate disk performance issues, do one or more of the following:
- Review Optimizing persistent disk performance and implement the best practices suggested to improve performance.
- Attach a new persistent disk to the VM to increase the disk performance limits. Disk performance is based on the total amount of storage attached to a VM. This option is the least disruptive as it does not require a you to unmount the filesystem, restart, or shutdown the VM.
- Resize the persistent disks to increase the per-disk IOPS and throughput limits. Persistent disks do not have any reserved, unusable capacity, so you can use the full disk without performance degradation.
- Change the disk type to a disk type that offers higher performance. For more information, see Configure disks to meet performance requirements.