Troubleshooting private clusters in GKE

Autopilot Standard

This page shows you how to resolve issues with Google Kubernetes Engine (GKE) private clusters.

If you need additional assistance, reach out to Cloud Customer Care.

Private cluster not running

Deleting the VPC peering between the cluster control plane and the cluster nodes, deleting the firewall rules that allow ingress traffic from the cluster control plane to nodes on port 10250, or deleting the default route to the default internet gateway, causes a private cluster to stop functioning. If you delete the default route, you must ensure traffic to necessary Google Cloud services is routed. For more information, see custom routing.

Timeout when creating private cluster

Every private cluster requires a peering route between VPCs, but only one peering operation can happen at a time. If you attempt to create multiple private clusters at the same time, cluster creation may time out. To avoid this, create new private clusters serially so that the VPC peering routes already exist for each subsequent private cluster. Attempting to create a single private cluster may also time out if there are operations running on your VPC.

VPC Network Peering connection on private cluster is accidentally deleted

Symptoms

When you accidentally delete a VPC Network Peering connection, the cluster goes in a repair state and all nodes show an UNKNOWN status. You won't be able to perform any operation on the cluster since reachability to the control plane is disconnected. When you inspect the control plane, logs will display an error similar to the following:

error checking if node NODE_NAME is shutdown: unimplemented

Potential causes

You accidentally deleted the VPC Network Peering connection.

Resolution

Follow these steps:

Create a new temporary VPC Network Peering cluster. Cluster creation causes VPC Network Peering recreation and old cluster is restored to its normal operation.
Delete the temporarily created VPC Network Peering cluster after the old cluster restors to its normal operation.

Cluster overlaps with active peer

Symptoms

Attempting to create a private cluster returns an error similar to the following:

Google Compute Engine: An IP range in the peer network overlaps with an IP
range in an active peer of the local network.

Potential causes

You chose an overlapping control plane CIDR.

Resolution

Delete and recreate the cluster using a different control plane CIDR.

Can't reach control plane of a private cluster

Increase the likelihood that your cluster control plane is reachable by implementing any of the cluster endpoint access configuration. For more information, see access to cluster endpoints.

Symptoms

After creating a private cluster, attempting to run kubectl commands against the cluster returns an error similar to one of the following:

Unable to connect to the server: dial tcp [IP_ADDRESS]: connect: connection
timed out.

Unable to connect to the server: dial tcp [IP_ADDRESS]: i/o timeout.

Potential causes

kubectl is unable to talk to the cluster control plane.

Resolution

Verify credentials for the cluster has been generated for kubeconfig or the correct context is activated. For more information on setting the cluster credentials see generate kubeconfig entry.

Verify that accessing the control plane using its external IP address is permitted. Disabling external access to the cluster control plane isolates the cluster from the internet.This configuration is immutable after the cluster creation. With this configuration, only authorized internal network CIDR ranges or reserved network have access to the control plane.

Verify the origin IP address is authorized to reach the control plane:
```
  gcloud container clusters describe CLUSTER_NAME \
      --format="value(masterAuthorizedNetworksConfig)"\
      --location=COMPUTE_LOCATION
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- COMPUTE_LOCATION: the Compute Engine location for the cluster.
If your origin IP address is not authorized, the output may return an empty result (only curly braces) or CIDR ranges which does not include your origin IP address
```
cidrBlocks:
  cidrBlock: 10.XXX.X.XX/32
  displayName: jumphost
  cidrBlock: 35.XXX.XXX.XX/32
  displayName: cloud shell
enabled: true
```
Add authorized networks to access control plane.

If you run the kubectl command from an on-premises environment or a region different from the cluster's location, ensure that control plane private endpoint global access is enabled. For more information, see accessing the control plane's private endpoint globally.

Describe the cluster to see control access config response:
```
gcloud container clusters describe CLUSTER_NAME \
    --location=COMPUTE_LOCATION \
    --flatten "privateClusterConfig.masterGlobalAccessConfig"
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- COMPUTE_LOCATION: the Compute Engine location for the cluster.
A successful output is similar to the following:
```
  enabled: true
```
If null is returned, enable global access to the control plane.

Can't create cluster due to overlapping IPv4 CIDR block

Symptoms

gcloud container clusters create returns an error similar to the following:

The given master_ipv4_cidr 10.128.0.0/28 overlaps with an existing network
10.128.0.0/20.

Potential causes

You specified a control plane CIDR block that overlaps with an existing subnet in your VPC.

Resolution

Specify a CIDR block for --master-ipv4-cidr that does not overlap with an existing subnet.

Can't create cluster due to services range already in use by another cluster

Symptoms

Attempting to create a private cluster returns an error similar to the following:

Services range [ALIAS_IP_RANGE] in network [VPC_NETWORK], subnetwork
[SUBNET_NAME] is already used by another cluster.

Potential causes

Either of the following:

You chose a service range which is still in use by another cluster, or the cluster was not deleted.
There was a cluster using that services range which was deleted but the secondary ranges metadata was not properly cleaned up. Secondary ranges for a GKE cluster are saved in the Compute Engine metadata and should be removed once the cluster is deleted. Even when a clusters is successfully deleted, the metadata might not be removed.

Resolution

Follow these steps:

Check if the services range is in use by an existing cluster. You can use the gcloud container clusters list command with the filter flag to search for the cluster. If there is an existing cluster using the services ranges, you must delete that cluster or create a new services range.
If the services range is not in use by an existing cluster, then manually remove the metadata entry that matches the services range you want to use.

Can't create subnet

Symptoms

When you attempt to create a private cluster with an automatic subnet, or to create a custom subnet, you might encounter the following error:

An IP range in the peer network overlaps
with an IP range in one of the active peers of the local network.

Potential causes

The control plane CIDR range you specified overlaps with another IP range in the cluster. This can also occur if you've recently deleted a private cluster and you're attempting to create a new private cluster using the same control plane CIDR.

Resolution

Try using a different CIDR range.

Can't pull image from public Docker Hub

Symptoms

A Pod running in your cluster displays a warning in kubectl describe:

Failed to pull image: rpc error: code = Unknown desc = Error response
from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled
while waiting for connection (Client.Timeout exceeded while awaiting
headers)

Potential causes

Nodes in a private cluster don't have external IP addresses, so they don't meet the internet access requirements. However, the nodes can access Google Cloud APIs and services, including Artifact Registry, if you have enabled Private Google Access and met its network requirements.

Resolution

Use one of the following solutions:

Copy the images in your private cluster from Docker Hub to Artifact Registry. See Migrating containers from a third-party registry for more information.
GKE automatically checks mirror.gcr.io for cached copies of frequently-accessed Docker Hub images.
If you must pull images from Docker Hub or another public repository, use Cloud NAT or an instance-based proxy that is the target for a static 0.0.0.0/0 route.

API request that triggers admission webhook timing out

Symptoms

An API request that triggers an admission webhook configured to use a service with a targetPort other than 443 times out, causing the request to fail:

Error from server (Timeout): request did not complete within requested timeout 30s

Potential causes

By default, the firewall does not allow TCP connections to nodes except on ports 443 (HTTPS) and 10250 (kubelet). An admission webhook attempting to communicate with a Pod on a port other than 443 will fail if there is not a custom firewall rule that permits the traffic.

Resolution

Add a firewall rule for your specific use case.

Can't create cluster due to health check failing

Symptoms

After creating a private cluster, it gets stuck at the health check step and reports an error similar to one of the following:

All cluster resources were brought up, but only 0 of 2 have registered.

All cluster resources were brought up, but: 3 nodes out of 4 are unhealthy

Potential causes

Any of the following:

Cluster nodes cannot download required binaries from the Cloud Storage API (storage.googleapis.com).
Firewall rules restricting egress traffic.
Shared VPC IAM permissions are incorrect.
Private Google Access requires you to configure DNS for *.gcr.io.

Resolution

Use one of the following solutions:

Enable Private Google Access on the subnet for node network access to storage.googleapis.com, or enable Cloud NAT to allow nodes to communicate with storage.googleapis.com endpoints. For more information, see How to Troubleshoot GKE private cluster creation issues.
For node read access to storage.googleapis.com, confirm that the service account assigned to the cluster node has storage read access.
Ensure that you have either a Google Cloud firewall rule to allow all egress traffic or configure a firewall rule to allow egress traffic for nodes to the cluster control plane and *.googleapis.com.
Create the DNS configuration for *.gcr.io.
If you have a non-default firewall or route setup, configure Private Google Access.
If you use VPC Service Controls, set up Container Registry or Artifact Registry for GKE private clusters.
Ensure you have not deleted or modified the automatically created firewall rules for Ingress.
If using Shared VPC, ensure you have configured the required IAM permissions.

kubelet Failed to create pod sandbox

Symptoms

After creating a private cluster, it reports an error similar to one of the following:

Warning  FailedCreatePodSandBox  12s (x9 over 4m)      kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.k8s.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Potential causes

The calico-node or netd Pod cannot reach *.gcr.io.

Resolution

Use one of the following solutions:

Ensure you have completed the required setup for Container Registry or Artifact Registry.

Private cluster nodes created but not joining the cluster

Often when using custom routing and third-party network appliances on the VPC your private cluster is using, the default route (0.0.0.0/0) is redirected to the appliance instead of the default internet gateway. In addition to the control plane connectivity, you need to ensure that the following destinations are reachable:

*.googleapis.com
*.gcr.io
gcr.io

Configure Private Google Access for all three domains. This best practice allows the new nodes to startup and join the cluster while keeping the internet bound traffic restricted.

Workloads on private GKE clusters unable to access internet

Pods in private GKE clusters cannot access the internet. For example, after running the apt update command from the Pod exec shell, it reports an error similar to the following:

0% [Connecting to deb.debian.org (199.232.98.132)] [Connecting to security.debian.org (151.101.130.132)]

If subnet secondary IP address range used for Pods in the cluster is not configured on Cloud NAT gateway, the Pods cannot connect to the internet as they don't have an external IP address configured for Cloud NAT gateway.

Ensure you configure the Cloud NAT gateway to apply at least the following subnet IP address ranges for the subnet that your cluster uses:

Subnet primary IP address range (used by nodes)
Subnet secondary IP address range used for Pods in the cluster
Subnet secondary IP address range used for Services in the cluster

To learn more, see how to add secondary subnet IP range used for Pods.