Troubleshooting and operations for Ingress for Anthos

The Anthos Ingress controller manages Compute Engine (Compute Engine) resources. MultiClusterIngress (MCI) and MultiClusterService (MCS) resources map to different Compute Engine resources, so understanding the relationship between these resources helps you troubleshoot. For example, examine the following MCI resource:

apiVersion: extensions/v1beta1
kind: MultiClusterIngress
metadata:
  name: foo-ingress
spec:
  rules:
  - host: store.foo.com
    http:
      paths:
      - backend:
          serviceName: store-foo
          servicePort: 80
  - host: search.foo.com
    http:
      paths:
      - backend:
          serviceName: search-foo
          servicePort: 80

Compute Engine to Ingress for Anthos resource mappings

The table below shows the mapping of Hub resources to resources created in the Kubernetes clusters and Google Cloud:

Kubernetes resource Google Cloud resource Description
MultiClusterIngress Forwarding rule HTTP(S) load balancer VIP.
Target proxy HTTP/S terminations settings taken from annotations and the TLS block.
URL map Virtual host path mapping from the rules section.
MultiClusterService Kubernetes Service Derived resource from template.
Backend service A backend service is created for each (Service, ServicePort) pair.
Network endpoint groups Set of backend Pods participating in the Service.

The following diagram demonstrates the relationship between MCI/MCS and the Compute Engine load balancer resources across two member clusters:

MCI/MCS load balancer relationship

Inspecting Compute Engine LB resources

After creating a load balancer, the Ingress status will contain the names of every Compute Engine resource that was created to construct the load balancer. For example:

Name:         shopping-service
Namespace:    prod
Labels:       <none>
Annotations:  <none>
API Version:  networking.gke.io/v1beta1
Kind:         MultiClusterIngress
Metadata:
  Creation Timestamp:  2019-07-16T17:23:14Z
  Finalizers:
    mci.finalizer.networking.gke.io
Spec:
  Template:
    Spec:
      Backend:
        Service Name:  shopping-service
        Service Port:  80
Status:
  VIP:  34.102.212.68
  CloudResources:
    Firewalls: "mci-l7"
    ForwardingRules: "mci-abcdef-myforwardingrule"
    TargetProxies: "mci-abcdef-mytargetproxy"
    UrlMap: "mci-abcdef-myurlmap"
    HealthChecks: "mci-abcdef-80-myhealthcheck"
    BackendServices: "mci-abcdef-80-mybackendservice"
    NetworkEndpointGroups: "k8s1-neg1", "k8s1-neg2", "k8s1-neg3"

VIP not created

If you do not see a VIP, then an error may have occurred during its creation. To see if an error did occur, run the following command:

kubectl describe mci shopping-service

The output may look similar to:

Name:         shopping-service
Namespace:    prod
Labels:       <none>
Annotations:  <none>
API Version:  networking.gke.io/v1beta1
Kind:         MultiClusterIngress
Metadata:
  Creation Timestamp:  2019-07-16T17:23:14Z
  Finalizers:
    mci.finalizer.networking.gke.io
Spec:
  Template:
    Spec:
      Backend:
        Service Name:  shopping-service
        Service Port:  80
Status:
  VIP:  34.102.212.68
Events:
  Type     Reason  Age   From                              Message
  ----     ------  ----  ----                              -------
  Warning  SYNC    29s   multi-cluster-ingress-controller  error translating MCI prod/shopping-service: exceeded 4 retries with final error: error translating MCI prod/shopping-service: multiclusterservice prod/shopping-service does not exist

In this example, the error was that the user did not create a MultiClusterService resource that was referenced by a MultiClusterIngress.

In most cases, the error message indicates the underlying issue. In case it does not, please contact gke-mci-feedback@google.com for assistance.

502 response

If your load balancer acquired a VIP but is consistently serving a 502 response, the load balancer health checks may be failing. Health checks could fail for two reasons:

  1. Application Pods are not healthy (see Cloud Console debugging for example).
  2. A misconfigured firewall is blocking Google health checkers from performing health checks.

In the case of #1, make sure that your application is in fact serving a 200 response on the "/" path.

In the case of #2, make sure that a firewall named "mci-default-l7" exists in your VPC. The Ingress controller creates the firewall in your VPC to ensure Google health checkers can reach your backends. If the firewall does not exist, make sure there is no external automation that deletes this firewall upon its creation.

Traffic not added to or removed from cluster

When adding a new Membership, traffic should reach the backends in the underlying cluster when applicable. Similarly, if a Membership is removed, no traffic should reach the backends in the underlying cluster. If you are not observing this behavior, check for errors on the MultiClusterIngress and MultiClusterService resource.

Common cases in which this error would occur include adding a new Membership on a GKE cluster that is not in VPC-native mode or adding a new Membership but not deploying an application in the GKE cluster.

  1. Describe the MultiClusterService:

    kubectl describe mcs zone-svc
    
  2. Describe the MultiClusterIngress:

    kubectl describe mci zone-mci
    

If the above commands do not reveal the error, contact gke-mci-feedback@google.com for assistance.

Config cluster migration

To understand more about the use cases for migration, see the Config cluster design concept.

Config cluster migration can be a disruptive operation if not handled correctly. Follow these guidelines when performing a config cluster migration:

  1. Make sure to use the static-ip annotation on your MCI resources. Failing to do so will result in disrupted traffic while migrating. Ephemeral IPs will be recreated when migrating config clusters.
  2. The MultiClusterIngress and MultiClusterService resources must be deployed identically to the existing and new config cluster. Differences between them will result in the reconciliation of MCS and MCI resources that are different in the new config cluster.
  3. Only a single config cluster is active at any time. Until the config cluster is changed, the MCI and MCS resources in the new config cluster will not impact LB resources.

To migrate the config cluster, run the following command:

  gcloud alpha container hub ingress update \
    --config-membership=projects/<var>project_id</var>/locations/global/memberships/<var>new_config_cluster</var>

Verify the command worked by ensuring there are no visible errors in the Feature state:

  gcloud alpha container hub ingress describe

Console debugging

In most cases, checking the exact state of the load balancer is helpful when debugging an issue. You can find the load balancer by going to Load balancing in the Google Cloud Console.

Error/Warning Codes

Ingress for Anthos emits error and warning codes on MCI and MCS resources as well as the gcloud multiclusteringress Description field for known issues. These messages have documented error and warning codes to make it easier to understand what it means when something is not operating as expected. Each code consists of an error ID in the format AVMBR123 where 123 is a unique number that corresponds to an error or warning and suggestions on how to solve it.

AVMBR101: "Annotation [NAME] not recognized"

This error displays when an annotation is specified on an MCI/MCS spec which is not recognized. There are a couple reasons why the annotation may not be recognized:

  1. The annotation is not supported in Ingress for Anthos. This may be expected if annotating resources that are not expected to be used by the Anthos Ingress controller.

  2. The annotation is supported, but is misspelled and thus not recognized.

In both cases, please refer to documentation to understand the supported annotations and how they are specified.

AVMBR102: "[RESOURCE_NAME] not found"

This error displays when a supplementary resource is specified in an MCI but cannot be found in the Config Membership. For example, this error is thrown when an MCI refers to an MCS that cannot be found or an MCS refers to a BackendConfig that cannot be found. There are a couple reasons why a resource could not be found:

  1. It is not in the proper namespace. Ensure that resources which reference each other are all in the same namespace.
  2. The resource name is misspelled.
  3. The resource truly does not exist with the proper namespace + name. In this case, please create it.

AVMBR103: "[CLUSTER_SELECTOR] is invalid"

This error displays when a cluster selector specified on an MCS is invalid. There are a couple reasons why this selector could be invalid:

  1. The provided string contains a typo.
  2. The provided string refers to a Membership that no longer exists in the Hub.

AVMBR104: "Cannot find NEGs for Service Port [SERVICE_PORT]"

This error is thrown when the NetworkEndpointGroup's (NEGs) for a given MCS service + port pair cannot be found. NEGs are the resources which contain the pod endpoints in each of your backend clusters. The main reason why the NEGs may not exist is that there was an error creating / updating Derived Services in your backend clusters. Check the Events on your MCS resource for more information.

AVMBR105: "Missing Anthos license."

This error is displayed under Feature state. It means that Anthos API (anthos.googleapis.com) is not enabled.

AVMBR106: "Derived service is invalid: [REASON]."

This error is displayed under the events of MultiClusterService resource. One common reason for this error is that the Service resource derived from MultiClusterService has an invalid Spec.

For example, this MultiClusterService does not have any ServicePort defined in its Spec.

```yaml
apiVersion: networking.gke.io/v1
kind: MultiClusterService
metadata:
  name: zone-mcs
  namespace: zoneprinter
spec:
  clusters:
  - link: "us-central1-a/gke-us"
  - link: "europe-west1-c/gke-eu"
```

This error is displayed under Feature state. The reason is that there is no GKE cluster underlying the Membership resource. You can verify this by running

gcloud container hub memberships describe membership-name

and ensuring that there is no GKE Cluster resource link under endpoint field.

AVMBR108: "GKE cluster [NAME] not found."

This error is displayed under Feature state and thrown if the underlying GKE cluster for the Membership does not exist.

AVMBR109: "[NAME] is not a VPC-native GKE cluster."

This error is displayed under Feature state. This error is thrown if the specified GKE cluster is a route-based cluster. Ingress for Anthos controller creates container-native loadbalancer using NEGs. Clusters must be VPC-native to use a container-native loadbalancer.

For more information, see Creating a VPC-native cluster.

AVMBR110: "[IAM_PERMISSION] permission missing for GKE cluster [NAME]."

This error is displayed under Feature state. There are a couple reasons for this error:

  1. The underlying GKE cluster for the Membership is located in a different project from the Membership itself.
  2. The specified IAM Permission was removed from MultiClusterIngress Service Agent.

AVMBR111: "Failed to get Config Membership: [REASON]."

This error is displayed under Feature state. The main reason for this error occurring is that the Config Membership was deleted while the Feature is enabled.

You should never need to delete the Config Membership. If you would like to change it, then please follow the config cluster migration steps.

AVMBR112: "HTTPLoadBalancing Addon is disabled in GKE Cluster [NAME]."

This error is displayed under Feature state and occurs when HTTPLoadBalancing addon is disabled in a GKE Cluster. You can update your GKE cluster to enable HTTPLoadBalancing addon.

gcloud container clusters update name --update-addons=HttpLoadBalancing=ENABLED

AVMBR113: "This resource is orphaned."

In some cases, the usefulness of a resource depends on it being referenced by another resource. This error is thrown when a Kubernetes resource is created but it is not referenced by another resource. For example, you will see this error if you create a BackendConfig resource that it is not being referenced by a MultiClusterService.