This page shows you how to resolve issues with Cloud NAT packet loss from a VPC-native Google Kubernetes Engine (GKE) cluster with private nodes enabled.
Node VMs in VPC-native GKE clusters with private nodes don't have external IP addresses. This means that clients on the internet cannot connect to the IP addresses of the nodes. You can use Cloud NAT to allocate the external IP addresses and ports that allow clusters with private nodes to make public connections.
If a node VM runs out of its allocation of external ports and IP addresses from Cloud NAT, packets will drop. To avoid this, you can reduce the outbound packet rate or increase the allocation of available Cloud NAT source IP addresses and ports. The following sections describe how to diagnose and troubleshoot packet loss from Cloud NAT in the context of GKE clusters with private nodes.
Diagnose packet loss
The following sections explains how to log dropped packets using Cloud Logging, and diagnose the cause of dropped packets using Cloud Monitoring.
Log dropped packets
You can log dropped packets with the following query in Cloud Logging:
resource.type="nat_gateway" resource.labels.region=REGION resource.labels.gateway_name=GATEWAY_NAME jsonPayload.allocation_status="DROPPED"
Replace the following:
REGION
: the name of the region that the cluster is in.GATEWAY_NAME
: the name of the Cloud NAT gateway.
This command returns a list of all packets dropped by a Cloud NAT gateway, but does not identify the cause.
Monitor causes for packet loss
To identify causes for dropped packets, query the Metrics observer in Cloud Monitoring. Packets drop for one of three reasons:
OUT_OF_RESOURCES
ENDPOINT_INDEPENDENT_CONFLICT
NAT_ALLOCATION_FAILED
To identify packets dropped due to OUT_OF_RESOURCES
or
ENDPOINT_ALLOCATION_FAILED
error codes, use the following query:
fetch nat_gateway metric 'router.googleapis.com/nat/dropped_sent_packets_count' filter (resource.gateway_name == GATEWAY_NAME) align rate(1m) every 1m group_by [metric.reason], [value_dropped_sent_packets_count_aggregate: aggregate(value.dropped_sent_packets_count)]
If you identify packets that drop because of these reasons, see Packets dropped with reason: out of resources and Packets dropped with reason: endpoint independent conflict for troubleshooting advice.
To identify packets dropped due to the NAT_ALLOCATION_FAILED
error code, use
the following query:
fetch nat_gateway metric 'router.googleapis.com/nat/nat_allocation_failed' group_by 1m, [value_nat_allocation_failed_count_true: count_true(value.nat_allocation_failed)] every 1m
If you identify packets that dropped for this reason, see Need to allocate more IP addresses.
Investigate Cloud NAT configuration
If the previous queries return empty results, and GKE Pods are unable to communicate to external IP addresses, use the following table to help you troubleshoot your configuration:
Configuration | Troubleshooting |
Cloud NAT configured to apply only to the subnet's primary IP address range. |
When Cloud NAT is configured only for the subnet's primary IP
address range, packets sent from the cluster to external IP addresses must
have a source node IP address. In this Cloud NAT configuration:
|
Cloud NAT configured to apply only to the subnet's secondary IP address range used for Pod IPs. |
When Cloud NAT is configured only for the subnet's secondary IP address range used by the cluster's Pod IPs, packets sent from the cluster to external IP addresses must have a source Pod IP address. In this Cloud NAT configuration:
|
Reduce packet loss
After you have diagnosed the cause of your packet loss, consider using the following recommendations to reduce the likelihood of the issue from recurring in the future:
Configure the Cloud NAT gateway to use dynamic port allocation and increase the maximum number of ports per VM.
If you're using static port allocation, increase the number of minimum ports per VM.
Reduce your application's outbound packet rate. When an application makes multiple outbound connections to the same destination IP address and port, it can quickly consume all connections Cloud NAT can make to that destination using the number of allocated NAT source addresses and source port tuples.
For details about how Cloud NAT uses NAT source addresses and source ports to make connections, including limits on the number of simultaneous connections to a destination, refer to Ports and connections.
To reduce the rate of outbound connections from the application, reuse open connections. Common methods of reusing connections include connection pooling, multiplexing connections using protocols such as HTTP/2, or establishing persistent connections reused for multiple requests. For more information, see Ports and Connections.
What's next
If you need additional assistance, reach out to
Cloud Customer Care.