Standalone Network Endpoint Groups sync fails

Problem

After you upgrade a Google Kubernetes Engine cluster, the standalone Network Endpoint Groups (NEGs) are not being populated with endpoints from Google Kubernetes Engine services. 

Previously, a NEG was created with CUSTOM_NAME. The NEG was added to the backend-service of a Load Balancer and cloud.google.com/neg annotation was specified in service definition with NEG_NAME=CUSTOM_NAME. Google Kubernetes Engine would populate NEG with the service endpoints.

Currently, when running the command Kubectl describe svc neg_svc, you get the error message:
​​Failed to sync NEG_NAME (will not retry): neg name NEG_NAME is already in use, found a custom named neg with an empty description.

Environment

  • Google Kubernetes Engine v1.18.19
  • Google Kubernetes Engine v1.19.19+
  • Standalone NEG

Solution

Workaround

  1. Recreate the NEG manually but add the expected description. This way NEG Controller will update NEG with service endpoints.
    • Note: Updating NEG description is impossible, so this only works if the NEG is used for populating endpoints from a specific set: [Cluster, service name, namespace, port] which does not change.

Cause

The fix was introduced because different Google Kubernetes Engine clusters could conflict in adding their service endpoints to the same NEG if they specified the same NEG_NAME and were in the same zone. As it stands, the NEG Controller will check the specified NEG_NAME in service definition, and verify if a NEG in the endpoints' zone exists. If it does, the controller checks the NEG description to see if it has the cluster's UUID, SERVICE_NAME, NAMESPACE and PORT information according to the service definition. If NEG description checks out, NEG Controller populates NEG with endpoints, otherwise it does not.