Troubleshooting VM performance issues


This document shows you how to diagnose and mitigate CPU, memory, and storage performance issues on Compute Engine virtual machine (VM) instances.

Before you begin

View performance metrics

To view performance metrics for your VMs, use the Cloud Monitoring observability metrics available in the Google Cloud console.

  1. In the Google Cloud console, go to the VM Instances page.

    Go to VM Instances

  2. You can view metrics for individual VMs or for the five VMs consuming the most of a resource.

    To view metrics for individual VMs, do the following:

    1. Click the name of the VM you want to view performance metrics for. The VM instance Details page opens.

    2. Click the Observability tab to open the Observability Overview page.

    To view metrics for the five VMs consuming the most of a resource, click the Observability tab on the VM instances page.

  3. Explore the VM's performance metrics. View the Overview, CPU, Memory, Network and Disk sections to see detailed metrics about each topic. The following are key metrics that indicate VM performance:

    • On the Overview page:

      • CPU Utilization. The percent of CPU used by the VM.

      • Memory Utilization. The percent of memory used by the VM, excluding disk caches. For Linux VMs, this also excludes kernel memory.

      • Network Traffic. The average rate of bytes sent and received in one minute intervals.

      • New Connections with VMs/External/Google. The estimated number of distinct TCP/UDP flows in one minute, grouped by peer type.

      • Disk Throughput. The average rate of bytes written to and read from disks.

      • Disk IOPS. The average rate of I/O read and write operations to disks.

    • On the Network Summary page:

      • Sent to VMs/External/Google. The rate of network traffic rate sent to Google services, VMs, and external destinations, based on a sample of packets. The metric is scaled so that the sum matches the total sent network traffic.

      • Received from VMs/External/Google. The rate of network traffic received from Google services, VMs, and external sources, based on a sample of packets. The metric is scaled so that the sum matches the total received network traffic.

      • Network Packet Totals. The total rate of sent and received packets in one minute intervals.

      • Packet Mean Size. The mean size of packets, in bytes, sent and received in one minute intervals.

      • Firewall Incoming Packets Denied. The rate of incoming network packets sent to the VM, but not received by the VM, because they were denied by firewall rules.

    • On the Disks Performance page:

      • I/O Size Avg. The average size of I/O read and write operations to disks. Small (4-16 KiB) random I/Os are usually limited by IOPS and sequential/large (256 KiB-1 MiB) I/Os by throughput.

      • Queue Length Avg. The number of queued and running disk I/O operations, also called queue depth, for the top 5 devices. To reach the performance limits of your persistent disks, use a high I/O queue depth. Persistent disks are networked storage and generally have higher latency compared to physical disks or local SSDs.

      • I/O Latency Avg. The average latency of I/O read and write operations aggregated across operations of all disks attached to the VM, measured by the Ops Agent in the VM. This value includes operating system and file system processing latency, and is dependent on queue length and I/O size.

Understand performance metrics

VM performance is affected by the hardware the VM runs on, the workload running on the VM, and the VM's machine type. If the hardware cannot support the workload or network traffic of your VM, your VM's performance might be affected.

CPU and memory performance

Hardware details

CPU and memory performance is affected by the following hardware constraints:

  • Each virtual CPU (vCPU) is implemented as a single hardware multi thread on a CPU processor.
  • Intel Xeon CPU processors support multiple app threads on a single processor core.
  • VMs that use C2 machine types have fixed virtual-to-physical core mapping and expose NUMA cell architecture to the guest OS.
  • Most VMs get the all-core turbo frequency listed on CPU platforms, even if only the base frequency is advertised to the guest environment
  • Shared-core machine types use context-switching to share a physical core between vCPUs for multitasking. They also offer bursting capabilities during which the CPU utilization for a VM can go over 100%. For more information, see Shared-core machine types.

To understand a VM's CPU and memory performance, view performance metrics for CPU Utilization and Memory Utilization. You can additionally use process metrics to view currently running processes, attribute anomalies in resource consumption to a specific process, or identify your VM's most expensive resource consumers.

Consistently high CPU or memory utilization indicate the need to scale up a VM. If the VM consistently uses greater than 90% of its CPU or memory, change the VM's machine type to a machine type with more vCPUs or memory.

Network performance

Hardware details

Network performance is affected by the following hardware constraints:

To understand a VM's network performance, view performance metrics for Network Packet Totals, Packet Mean Size, New Connections with VMs/External/Google, Sent to VMs/External/Google, Received From VMs/External/Google, and Firewall Incoming Packets Denied.

Review whether Network Packet Totals, Packet Mean Size, and New Connections with VMs/External/Google are typical for your workload. For example, a web server might experience many connections and small packets, while a database might experience few connections and large packets.

Consistently high outgoing network traffic might indicate the need to change the VM's machine type to a machine type that has a higher egress bandwidth limit.

If you notice high numbers of incoming packets denied by firewalls, visit the Network Intelligence Firewall Insights page in the Google Cloud console to learn more about the origins of denied packets.

Go to the Firewall Insights page

If you think your own traffic is being incorrectly denied by firewalls, try running connectivity tests.

If your VM sends and receives a high amount of traffic from VMs in different zones or regions, consider modifying your workload to keep more data within a zone or region to increase latency and decrease costs. For more information, see "VM-VM outbound data transfer pricing within Google Cloud" on the pricing page. If your VM sends a large amount of traffic to other VMs within the same zone, consider a compact placement policy to achieve low network latency.

Storage performance

Hardware details

Storage is affected by the following hardware constraints:

  • The total size of all persistent disks combined with the number of vCPUs determine total storage performance. If there are different types of persistent disks attached to a VM, the SSD persistent disk performance limit is shared by all disks on the VM. For more information,see Block storage performance.
  • When Persistent Disk and Hyperdisk compete with outbound data transfer traffic, 60% of the maximum outbound network bandwidth is used for Persistent Disk and Hyperdisk, and the remaining 40% can be used for outbound network data transfer. For more information, see Other factors that affect performance.
  • I/O size and queue depth performance are dependant on workloads. Some workloads might not be large enough to use full I/O size and queue depth performance limits.
  • A VM's machine type affects its storage performance. For more information, see Machine type and vCPU count.

To understand a VM's storage performance, view performance metrics for Throughput, Operations (IOPS), I/O Size, I/O Latency, and Queue Length.

Disk throughput and IOPS indicate whether the VM workload is operating as expected. If throughput or IOPS is lower than the expected maximum listed in the disk type chart, then I/O size, queue length, or I/O latency performance issues might be present.

You can expect I/O size to be between 4-16 KiB for workloads that require high IOPS and low latency, and 256 KiB-1 MiB for workloads that involve sequential or large write sizes. I/O size outside of these ranges indicate disk performance issues.

Queue length, also known as queue depth, is a factor of throughput and IOPS. When a disk performs well, its queue length should be about the same as the queue length recommended to achieve a particular throughput or IOPS level, listed in the Recommended I/O queue depth chart.

I/O latency is dependent on queue length and I/O size. If the queue length or I/O size for a disk is high, the latency will also be high.

If any storage performance metrics indicate disk performance issues, do one or more of the following: