API proxy deployments fail with apigee-serving-cert is not found or expired

You're viewing Apigee and Apigee hybrid documentation.
View Apigee Edge documentation.

Symptoms

API proxy deployments fail with the following error messages.

Error Messages

If the TLS certificate of the apigee-webhook-service.apigee-system.svc service has expired or is not yet valid, the following error message will be shown on apigee-watcher logs:

{"level":"error","ts":1687991930.7745812,"caller":"watcher/watcher.go:60",
"msg":"error during watch","name":"ingress","error":"INTERNAL: INTERNAL: failed
to update ApigeeRoute [org-env]-group-84a6bb5, namespace apigee:
Internal error occurred: failed calling webhook
\"mapigeeroute.apigee.cloud.google.com\": Post
\"https://apigee-webhook-service.apigee-system.svc:443/mutate-apigee-cloud-google-com-v1alpha1-apigeeroute?timeout=30s\":
x509:
certificate has expired or is not yet valid: current time
2023-06-28T22:38:50Z is after 2023-06-17T17:14:13Z, INTERNAL: failed to update
ApigeeRoute [org-env]-group-e7b3ff6, namespace apigee 

Possible Causes

Cause Description
The apigee-serving-cert is not found If the apigee-serving-cert is not found in the apigee-system namespace, this issue could occur.
Duplicate certificate requests were created for renewing apigee-serving-cert If there are duplicate certificate requests created for renewing the apigee-serving-cert certificate, the apigee-serving-cert certificate may not get renewed.
cert-manager is not healthy If cert-manager is not healthy, the apigee-serving-cert certificate may not get renewed.

Cause: The apigee-serving-cert is not found

Diagnosis

  1. Check the availability of the apigee-serving-cert certificate in the apigee-system namespace:

    kubectl -n apigee-system get certificates apigee-serving-cert
    

    If this certificate is available, an output similar to following should be seen:

    NAME                  READY   SECRET                AGE
    apigee-serving-cert   True    webhook-server-cert   2d10h
  2. If the apigee-serving-cert certificate is not found in the apigee-system namespace, that could be the reason for this issue.

Resolution

  1. The apigee-serving-cert is created by the apigeectl init command during the Apigee hybrid installation. Therefore, execute that command with the relevant overrides.yaml file to recreate it:
    apigeectl init -f overrides/overrides.yaml
  2. Verify that the apigee-serving-cert certificate has been created:
    kubectl -n apigee-system get certificates apigee-serving-cert

Cause: Duplicate certificate requests were created for renewing apigee-serving-cert

Diagnosis

  1. Check cert-manager controller logs and see whether an error message similar to the following has been returned.

    List all cert-manager pods:

    kubectl -n cert-manager get pods

    An example output:

    NAME                                       READY   STATUS    RESTARTS        AGE
    cert-manager-66d9545484-772cr              1/1     Running   0               6d19h
    cert-manager-cainjector-7d8b6bd6fb-fpz6r   1/1     Running   0               6d19h
    cert-manager-webhook-669b96dcfd-6mnm2      1/1     Running   0               6d19h

    Check cert-manager controller logs:

    kubectl -n cert-manager logs cert-manager-66d9545484-772cr | grep "issuance is skipped until there are no more duplicates"

    An example output:

    1 controller.go:163] cert-manager/certificates-readiness "msg"="re-queuing item due to error processing" "error"="multiple CertificateRequests were found for the 'next' revision 3, issuance is skipped until there are no more duplicates" "key"="apigee-system/apigee-serving-cert"

    If an error message similar to this is displayed, that will prevent renewing the apigee-serving-cert certificate.

  2. List all certificate requests in the apigee-system namespace and check to see if there are multiple certificate requests created for renewing the same apigee-serving-cert certificate revision:
    kubectl -n apigee-system get certificaterequests

See the cert-manager issue relevant to this problem at cert-manager created multiple CertificateRequest objects with the same certificate-revision.

Resolution

  1. Delete all certificate requests in apigee-system namespace:
    kubectl -n apigee-system delete certificaterequests --all
  2. Verify that duplicated certificate requests have been deleted and only one certificate request is available for the apigee-serving-cert certificate in apigee-system namespace:
    kubectl -n apigee-system get certificaterequests
  3. Verify that the apigee-serving-cert certificate has been renewed:
    kubectl -n apigee-system get certificates apigee-serving-cert -o yaml

    An example output:

    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      creationTimestamp: "2023-06-26T13:25:10Z"
      generation: 1
      name: apigee-serving-cert
      namespace: apigee-system
      resourceVersion: "11053"
      uid: e7718341-b3ca-4c93-a6d4-30cf70a33e2b
    spec:
      dnsNames:
      - apigee-webhook-service.apigee-system.svc
      - apigee-webhook-service.apigee-system.svc.cluster.local
      issuerRef:
        kind: Issuer
        name: apigee-selfsigned-issuer
      secretName: webhook-server-cert
    status:
      conditions:
      - lastTransitionTime: "2023-06-26T13:25:11Z"
        message: Certificate is up to date and has not expired
        observedGeneration: 1
        reason: Ready
        status: "True"
        type: Ready
      notAfter: "2023-09-24T13:25:11Z"
      notBefore: "2023-06-26T13:25:11Z"
      renewalTime: "2023-08-25T13:25:11Z"
      revision: 1

Cause: cert-manager is not healthy

Diagnosis

  1. Check the health of the cert-manager pods in the cert-manager namespace:
    kubectl -n cert-manager get pods

    If cert-manager pods are healthy, all cert-manager pods should be ready (1/1) and in Running state, otherwise, that could be the reason for this issue:

    NAME                                       READY   STATUS    RESTARTS   AGE
    cert-manager-59cf78f685-mlkvx              1/1     Running   0          15d
    cert-manager-cainjector-78cc865768-krjcp   1/1     Running   0          15d
    cert-manager-webhook-77c4fb46b6-7g9g6      1/1     Running   0          15d
  2. The cert-manager can fail for many reasons. Check the cert-manager logs and identify the reason for the failure and resolve them accordingly.

    One known reason is that the cert-manager will fail if it cannot communicate with the Kubernetes API. In this case, an error message similar to following is displayed::

    E0601 00:10:27.841516       1 leaderelection.go:330] error retrieving
    resource lock kube-system/cert-manager-controller: Get
    "https://192.168.0.1:443/api/v1/namespaces/kube-system/configmaps/cert-manager-controller":
    dial tcp 192.168.0.1:443: i/o timeout

Resolution

  1. Check the health of the Kubernetes cluster and fix any issues found. See Troubleshooting Clusters.
  2. Refer to Troubleshooting for additional cert-manager troubleshooting information.

Must gather diagnostic information

If the problem persists even after following the above instructions, gather the following diagnostic information, and then contact Google Cloud Customer Care.

  1. Google Cloud Project ID
  2. Apigee hybrid organization
  3. Apigee hybrid overrides.yaml file, masking any sensitive information.
  4. Kubernetes pod status in all namespaces:
    kubectl get pods -A > kubectl-pod-status`date +%Y.%m.%d_%H.%M.%S`.txt
  5. Kubernetes cluster-info dump:
    # generate kubernetes cluster-info dump
    kubectl cluster-info dump -A --output-directory=/tmp/kubectl-cluster-info-dump
    # zip kubernetes cluster-info dump
    zip -r kubectl-cluster-info-dump`date +%Y.%m.%d_%H.%M.%S`.zip /tmp/kubectl-cluster-info-dump/*