Control communication between Pods and Services using network policies


This page explains how to control communication between your cluster's Pods and Services using GKE's network policy enforcement.

You can also control Pods' egress traffic to any endpoint or Service outside of the cluster using fully qualified domain name (FQDN) network policies. For more information, see Control communication between Pods and Services using FQDNs.

About GKE network policy enforcement

Network policy enforcement lets you create Kubernetes Network Policies in your cluster. By default, all Pods within a cluster can communicate with each other freely. Network policies create Pod-level firewall rules that determine which Pods and Services can access one another inside your cluster.

Defining network policy helps you enable things like defense in depth when your cluster is serving a multi-level application. For example, you can create a network policy to ensure that a compromised front-end service in your application cannot communicate directly with a billing or accounting service several levels down.

Network policy can also make it easier for your application to host data from multiple users simultaneously. For example, you can provide secure multi-tenancy by defining a tenant-per-namespace model. In such a model, network policy rules can ensure that Pods and Services in a given namespace cannot access other Pods or Services in a different namespace.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Requirements and limitations

The following requirements and limitations apply to both Autopilot and Standard clusters: The following requirements and limitations only apply to Standard clusters:

  • You must allow egress to the metadata server if you use network policy with Workload Identity Federation for GKE.
  • Enabling network policy enforcement increases the memory footprint of the kube-system process by approximately 128 MB, and requires approximately 300 millicores of CPU. This means that if you enable network policies for an existing cluster, you might need to increase the cluster's size to continue running your scheduled workloads.
  • Enabling network policy enforcement requires that your nodes be re-created. If your cluster has an active maintenance window, your nodes are not automatically re-created until the next maintenance window. If you prefer, you can manually upgrade your cluster at any time.
  • The recommended minimum cluster size to run network policy enforcement is three e2-medium instances.
  • Network policy is not supported for clusters whose nodes are f1-micro or g1-small instances, as the resource requirements are too high.

For more information about node machine types and allocatable resources, see Standard cluster architecture - Nodes.

Enable network policy enforcement

Network policy enforcement is enabled by default for Autopilot clusters, so you can skip to Create a network policy.

You can enable network policy enforcement in Standard by using the gcloud CLI, the Google Cloud console, or the GKE API.

Network policy enforcement is built into GKE Dataplane V2. You do not need to enable network policy enforcement in clusters that use GKE Dataplane V2.

This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy and respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.

gcloud

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. To enable network policy enforcement when creating a new cluster, run the following command:

    gcloud container clusters create CLUSTER_NAME --enable-network-policy
    

    Replace CLUSTER_NAME with the name of the new cluster.

    To enable network policy enforcement for an existing cluster, perform the following tasks:

    1. Run the following command to enable the add-on:

      gcloud container clusters update CLUSTER_NAME --update-addons=NetworkPolicy=ENABLED
      

      Replace CLUSTER_NAME with the name of the cluster.

    2. Run the following command to enable network policy enforcement on your cluster, which in turn re-creates your cluster's node pools with network policy enforcement enabled:

      gcloud container clusters update CLUSTER_NAME --enable-network-policy
      

Console

To enable network policy enforcement when creating a new cluster:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. Click Create.

  3. In the Create cluster dialog, for GKE Standard, click Configure.

  4. Configure your cluster as chosen.

  5. From the navigation pane, under Cluster, click Networking.

  6. Select the Enable network policy checkbox.

  7. Click Create.

To enable network policy enforcement for an existing cluster:

  1. Go to the Google Kubernetes Engine page in Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Under Networking, in the Network policy field, click Edit network policy.

  4. Select the Enable network policy for master checkbox and click Save Changes.

  5. Wait for your changes to apply, and then click Edit network policy again.

  6. Select the Enable network policy for nodes checkbox.

  7. Click Save Changes.

API

To enable network policy enforcement, perform the following:

  1. Specify the networkPolicy object inside the cluster object that you provide to projects.zones.clusters.create or projects.zones.clusters.update.

  2. The networkPolicy object requires an enum that specifies which network policy provider to use, and a boolean value that specifies whether to enable network policy. If you enable network policy but do not set the provider, the create and update commands return an error.

Disable network policy enforcement in a Standard cluster

You can disable network policy enforcement by using the gcloud CLI, the Google Cloud console, or the GKE API. You cannot disable network policy enforcement in Autopilot clusters or clusters that use GKE Dataplane V2.

This change requires recreating the nodes, which can cause disruption to your running workloads. For details about this specific change, find the corresponding row in the manual changes that recreate the nodes using a node upgrade strategy and respecting maintenance policies table. To learn more about node updates, see Planning for node update disruptions.

gcloud

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  2. To disable network policy enforcement, perform the following tasks:

    1. Disable network policy enforcement on your cluster:
    gcloud container clusters update CLUSTER_NAME --no-enable-network-policy
    

    Replace CLUSTER_NAME with the name of the cluster.

    After you run this command, GKE re-creates your cluster node pools with network policy enforcement disabled.

  3. Verify that all your nodes were re-created:

    kubectl get nodes -l projectcalico.org/ds-ready=true
    

    If the operation is successful, the output is similar to the following:

    No resources found
    

    If the output is similar to the following, then you must wait for GKE to finish updating the node pools:

    NAME                                             STATUS                     ROLES    AGE     VERSION
    gke-calico-cluster2-default-pool-bd997d68-pgqn   Ready,SchedulingDisabled   <none>   15m     v1.22.10-gke.600
    gke-calico-cluster2-np2-c4331149-2mmz            Ready                      <none>   6m58s   v1.22.10-gke.600
    

    When you disable network policy enforcement, GKE might not update the nodes immediately if your cluster has a configured maintenance window or exclusion. For more information, see Cluster slow to update.

  4. After all of the nodes are re-created, disable the add-on:

    gcloud container clusters update CLUSTER_NAME --update-addons=NetworkPolicy=DISABLED
    

Console

To disable network policy enforcement for an existing cluster, perform the following:

  1. Go to the Google Kubernetes Engine page in Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Under Networking, in the Network policy field, click Edit network policy.

  4. Clear the Enable network policy for nodes checkbox and click Save Changes.

  5. Wait for your changes to apply, and then click Edit network policy again.

  6. Clear the Enable network policy for master checkbox.

  7. Click Save Changes.

API

To disable network policy enforcement for an existing cluster, do the following:

  1. Update your cluster to use networkPolicy.enabled: false using the setNetworkPolicy API.

  2. Verify that all your nodes were re-created using the gcloud CLI:

    kubectl get nodes -l projectcalico.org/ds-ready=true
    

    If the operation is successful, the output is similar to the following:

    No resources found
    

    If the output is similar to the following, then you must wait for GKE to finish updating the node pools:

    NAME                                             STATUS                     ROLES    AGE     VERSION
    gke-calico-cluster2-default-pool-bd997d68-pgqn   Ready,SchedulingDisabled   <none>   15m     v1.22.10-gke.600
    gke-calico-cluster2-np2-c4331149-2mmz            Ready                      <none>   6m58s   v1.22.10-gke.600
    

    When you disable network policy enforcement, GKE might not update the nodes immediately if your cluster has a configured maintenance window or exclusion. For more information, see Cluster slow to update.

  3. Update your cluster to use update.desiredAddonsConfig.NetworkPolicyConfig.disabled: true using the updateCluster API.

Create a network policy

You can create a network policy using the Kubernetes Network Policy API.

For further details on creating a network policy, see the following topics in the Kubernetes documentation:

Network policy and Workload Identity Federation for GKE

If you use network policy with Workload Identity Federation for GKE, you must allow egress to the following IP addresses so your Pods can communicate with the GKE metadata server.

  • For clusters running GKE version 1.21.0-gke.1000 and later, allow egress to 169.254.169.252/32 on port 988.
  • For clusters running GKE versions earlier than 1.21.0-gke.1000, allow egress to 127.0.0.1/32 on port 988.
  • For clusters running GKE Dataplane V2, allow egress to 169.254.169.254/32 on port 80.

If you don't allow egress to these IP addresses and ports, you might experience disruptions during auto-upgrades.

Migrating from Calico to GKE Dataplane V2

If you migrate your network policies from Calico to GKE Dataplane V2, consider the following limitations:

  • You cannot use a Pod or Service IP address in the ipBlock.cidr field of a NetworkPolicy manifest. You must reference workloads using labels. For example, the following configuration is invalid:

    - ipBlock:
        cidr: 10.8.0.6/32
    
  • You cannot specify an empty ports.port field in a NetworkPolicy manifest. If you specify a protocol, you must also specify a port. For example, the following configuration is invalid:

    ingress:
    - ports:
      - protocol: TCP
    

Working with Application Load Balancers

When an Ingress is applied to a Service to build an Application Load Balancer, you must configure the network policy applied to Pods behind that Service to allow the appropriate Application Load Balancer health check IP ranges. If you are using an internal Application Load Balancer, you must also configure the network policy to allow the proxy-only subnet.

If you are not using container-native load balancing with network endpoint groups, node ports for a Service might forward connections to Pods on other nodes unless they are prevented from doing so by setting externalTrafficPolicy to Local in the Service definition. If externalTrafficPolicy is not set to Local, the network policy must also allow connections from other node IPs in the cluster.

Inclusion of Pod IP ranges in ipBlock rules

To control traffic for specific Pods, always select Pods by their namespace or Pod labels by using namespaceSelector and podSelector fields in your NetworkPolicy ingress or egress rules. Don't use the ipBlock.cidr field to intentionally select Pod IP address ranges, which are ephemeral in nature. The Kubernetes project doesn't explicitly define the behavior of the ipBlock.cidr field when it includes Pod IP address ranges. Specifying broad CIDR ranges in this field, like 0.0.0.0/0 (which include the Pod IP address ranges) might have unexpected results in different implementations of NetworkPolicy.

The following sections describe how the different implementations of NetworkPolicy in GKE evaluate the IP address ranges that you specify in the ipBlock.cidr field, and how that might impact Pod IP address ranges that are inherently included in broad CIDR ranges. Understanding the different behavior between implementations will help you to prepare for the results when you migrate to another implementation.

ipBlock behavior in GKE Dataplane V2

With the GKE Dataplane V2 implementation of NetworkPolicy, Pod traffic is never covered by an ipBlock rule. Therefore, even if you define a broad rule such as cidr: '0.0.0.0/0', it will not include Pod traffic. This is useful as it lets you to, for example, allow Pods in a namespace to receive traffic from the internet, without also allowing traffic from Pods. To also include Pod traffic, select Pods explicitly using an additional Pod or namespace selector in the ingress or egress rule definitions of the NetworkPolicy.

ipBlock behavior in Calico

For the Calico implementation of NetworkPolicy, the ipBlock rules do cover Pod traffic. With this implementation, to configure a broad CIDR range without allowing Pod traffic, explicitly exclude the cluster's Pod CIDR range, like in the following example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-non-pod-traffic
spec:
  ingress:
  - from:
    - ipBlock:
      cidr: '0.0.0.0/0'
      except: ['POD_IP_RANGE']

In this example, POD_IP_RANGE is your cluster's Pod IPv4 address range, for example 10.95.0.0/17. If you have multiple IP ranges, these can be included individually in the array, for example ['10.95.0.0/17', '10.108.128.0/17'].

Troubleshooting

Pods can't communicate with control plane on clusters that use Private Service Connect

Pods on GKE clusters that use Private Service Connect might experience a communication issue with the control plane if the Pod's egress to the control plane's internal IP address is restricted in egress network policies.

To mitigate this issue:

  1. Confirm that your cluster uses Private Service Connect. On clusters that use Private Service Connect, if you use the master-ipv4-cidr flag when creating the subnet, GKE assigns each control plane an internal IP address from the values you defined in master-ipv4-cidr. Otherwise, GKE uses the cluster node subnet to assign each control plane an internal IP address.

  2. Configure your cluster's egress policy to allow traffic to the control plane's internal IP address.

    To find the control plane's internal IP address:

    gcloud

    To look for privateEndpoint, run the following command:

    gcloud container clusters describe CLUSTER_NAME
    

    Replace CLUSTER_NAME with the name of the cluster.

    This command retrieves the privateEndpoint of the specified cluster.

    Console

    1. Go to the Google Kubernetes Engine page in the Google Cloud console.

      Go to Google Kubernetes Engine

    2. From the navigation pane, under Clusters, click the cluster whose internal IP address you want to find.

    3. Under Cluster basics, navigate to Internal endpoint, where the internal IP address is listed.

    Once you are able to locate the privateEndpoint or Internal endpoint, configure your cluster's egress policy to allow traffic to the control plane's internal IP address. For more information, see Create a network policy.

Cluster slow to update

When you enable or disable network policy enforcement on an existing cluster, GKE might not update the nodes immediately if the cluster has a configured maintenance window or exclusion.

You can manually upgrade a node pool by setting the --cluster-version flag to the same GKE version that the control plane is running. You must use the Google Cloud CLI to perform this operation. For more information, see caveats for maintenance windows.

Manually deployed Pods unscheduled

When you enable network policy enforcement on the control plane of existing cluster, GKE unschedules any ip-masquerade-agent or calico node Pods that you manually deployed.

GKE does not reschedule these Pods until network policy enforcement is enabled on the cluster nodes and the nodes are recreated.

If you have configured a maintenance window or exclusion, this might cause an extended disruption.

To minimize the duration of this disruption, you can manually assign the following labels to the cluster nodes:

  • node.kubernetes.io/masq-agent-ds-ready=true
  • projectcalico.org/ds-ready=true

Network policy not taking effect

If a NetworkPolicy is not taking effect, you can troubleshoot using the following steps:

  1. Confirm that network policy enforcement is enabled. The command that you use depends on if your cluster has GKE Dataplane V2 enabled.

    If your cluster has GKE Dataplane V2 enabled, run the following command:

    kubectl -n kube-system get pods -l k8s-app=cilium
    

    If the output is empty, network policy enforcement is not enabled.

    If your cluster does not have GKE Dataplane V2 enabled, run the following command:

    kubectl get nodes -l projectcalico.org/ds-ready=true
    

    If the output is empty, network policy enforcement is not enabled.

  2. Check the Pod labels:

    kubectl describe pod POD_NAME
    

    Replace POD_NAME with the name of the Pod.

    The output is similar to the following:

    Labels:        app=store
                   pod-template-hash=64d9d4f554
                   version=v1
    
  3. Confirm that the labels on the policy match the labels on the Pod:

    kubectl describe networkpolicy
    

    The output is similar to the following:

    PodSelector: app=store
    

    In this output, the app=store labels match the app=store labels from the previous step.

  4. Check if there are any network policies selecting your workloads:

    kubectl get networkpolicy
    

    If the output is empty, no NetworkPolicy was created in the namespace and nothing is selecting your workloads. If the output is not empty, check if the policy selects your workloads:

    kubectl describe networkpolicy
    

    The output is similar to the following:

    ...
    PodSelector:     app=nginx
    Allowing ingress traffic:
       To Port: <any> (traffic allowed to all ports)
       From:
          PodSelector: app=store
    Not affecting egress traffic
    Policy Types: Ingress
    

Known issues

StatefulSet pod termination with Calico

GKE clusters with Calico network policy enabled might experience an issue where a StatefulSet pod drops existing connections when the pod is deleted. After a pod enters the Terminating state, the terminationGracePeriodSeconds configuration in the pod spec is not honored and causes disruptions for other applications that have an existing connection with the StatefulSet pod. For more information about this issue, see Calico issue #4710.

This issue affects the following GKE versions:

  • 1.18
  • 1.19 to 1.19.16-gke.99
  • 1.20 to 1.20.11-gke.1299
  • 1.21 to 1.21.4-gke.1499

To mitigate this issue, upgrade your GKE control plane to one of the following versions:

  • 1.19.16-gke.100 or later
  • 1.20.11-gke.1300 or later
  • 1.21.4-gke.1500 or later

Pod stuck in containerCreating state

There can be scenario where GKE clusters with Calico network policy enabled might experience an issue where Pods get stuck in containerCreating state.

Under the Pod Events tab, you see a message similar to the following:

plugin type="calico" failed (add): ipAddrs is not compatible with
configured IPAM: host-local

To mitigate this issue, use host-local ipam for Calico instead of calico-ipam in GKE clusters.

What's next