Diagnosing issues with guardrails

Guardrails overview

Apigee hybrid Guardrails is a mechanism that will alert customers to a potential issue before the issue can impact a Hybrid instance. In other words, Hybrid Guardrails will stop a command in its tracks if the command risks the stability of a Hybrid instance. Whether it's an incorrect configuration or some insufficient resource, Hybrid Guardrails will prevent any modifications to a Hybrid instance until the risk of the issue is removed. This saves the customer from spending time on issues that would normally take hours or days to resolve.

Using Guardrails with Apigee hybrid

To use Hybrid Guardrails, execute the same Hybrid Helm install or Hybrid Helm upgrade commands documented in the Hybrid installation instructions. No additional commands are needed to run Guardrails.

When you issue a Helm command for Apigee hybrid, two things happen before the helm command applies the configuration to your hybrid instance:

  • Helm creates a temporary Guardrails pod with your applied configuration. If the Guardrails pod spins up to a healthy state, the pod will test your hybrid instance against your applied configuration. If testing passes, the Guardrails pod is terminated and your configuration is then applied to your Apigee hybrid instance.
  • If testing fails, the Guardails pod is left in an unhealthy state to allow diagnosis of the pod. The helm command will display an error message reporting the Guardrails pod has failed.

The following example shows using Guardrails to test network connectivity from a Hybrid instance to the Apigee Control Plane as part of the installation of the apigee-datastore component. You can use the same sequence for all Apigee hybrid components:

Install the apigee-datastore component using the following command:

helm upgrade datastore apigee-datastore/ \
  --install \
  --namespace apigee \
  --atomic \
  -f overrides.yaml

If there is an immediate error, the Helm command will also show an error message displaying the Guardrails checks failed as in the following example:

 helm upgrade datastore apigee-datastore/ \
  --install \
  --namespace apigee \
  -f ../my-overrides.yaml

  . . .
    Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred:
      * pod apigee-hybrid-helm-guardrail-datastore failed

To see what check has failed and why, check the Guardrails pod logs like the following example:

kubectl logs -n apigee apigee-hybrid-helm-guardrail-datastore
{"level":"INFO","timestamp":"2024-02-01T20:28:55.934Z","msg":"logging enabled","log-level":"INFO"}
{"level":"INFO","timestamp":"2024-02-01T20:28:55.935Z","msg":"","checkpoint":"upgrade","component":"apigee-datastore"}
{"level":"INFO","timestamp":"2024-02-01T20:28:55.935Z","msg":"initiating pre-install checks"}
{"level":"INFO","timestamp":"2024-02-01T20:28:55.935Z","msg":"check validation starting...","check":"controlplane_connectivity"}
{"level":"ERROR","timestamp":"2024-02-01T20:28:55.961Z","msg":"connectivity test failed","check":"controlplane_connectivity","host":"https://apigee.googleapis.com","error":"Get \"https://apigee.googleapis.com\": dial tcp: lookup apigee.googleapis.com on 10.92.0.10:53: no such host"}

In this example, the actual test failure message is this part:

{"level":"ERROR","timestamp":"2024-02-01T20:28:55.961Z","msg":"connectivity test failed","check":"controlplane_connectivity","host":"https://apigee.googleapis.com","error":"Get \"https://apigee.googleapis.com\": dial tcp: lookup apigee.googleapis.com on 10.92.0.10:53: no such host"}

The Guardrails pod is automatically provisioned when you issue the helm command. If the Apigee Control Plane connectivity test passes, the Guardrails pod is terminated at the end of execution.

Check the status of the pods quickly after issuing the helm install command. The following example output shows the Guardrail pods in a healthy state, meaning the Control Plane connectivity test passed:

kubectl get pods -n apigee -w
NAME                                      READY    STATUS             RESTARTS    AGE
apigee-hybrid-helm-guardrail-datastore    0/1      Pending            0           0s
apigee-hybrid-helm-guardrail-datastore    0/1      Pending            0           1s
apigee-hybrid-helm-guardrail-datastore    0/1      ContainerCreating  0           1s
apigee-hybrid-helm-guardrail-datastore    0/1      Completed          0           2s
apigee-hybrid-helm-guardrail-datastore    0/1      Completed          0           3s
apigee-hybrid-helm-guardrail-datastore    0/1      Terminating        0           3s
apigee-hybrid-helm-guardrail-datastore    0/1      Terminating        0           3s

If the Apigee Control Plane connectivity test fails, the Guardrails pod will remain in Error state similar to the following example output:

kubectl get pods -n apigee -w
NAME                                      READY    STATUS             RESTARTS    AGE
apigee-hybrid-helm-guardrail-datastore    0/1      Pending            0           0s
apigee-hybrid-helm-guardrail-datastore    0/1      Pending            0           0s
apigee-hybrid-helm-guardrail-datastore    0/1      ContainerCreating  0           0s
apigee-hybrid-helm-guardrail-datastore    0/1      Error              0           4s
apigee-hybrid-helm-guardrail-datastore    0/1      Error              0           5s
apigee-hybrid-helm-guardrail-datastore    0/1      Error              0           6s

Temporarily disabling Guardrails

If you need to disable the Guardrails checks, add the --no-hooks flag to the Helm command. The following example shows the --no-hooks flag in a Helm command:

helm upgrade datastore apigee-datastore/ \
  --install \
  --namespace apigee \
  -f ../my-overrides.yaml \
  --no-hooks

Configuring Guardrails in the overrides file

Starting in Apigee hybrid version 1.12, Guardrails are configured by default in each chart. You can override the image URL, tag, and image pull policy in your overrides file.

For example, the Guardrails image url, tag, and pull policy below would be added to your overrides file:

# Apigee Ingressgateway
ingressGateway:
  image:
    pullPolicy: Always

## NOTE: The Guardrails config is below. The ingressgateway config above is for position reference only and is NOT required for Guardrails config.

# Apigee Guardrails
guardrails:
  image:
    url: "gcr.io/ng-hybrid/guardrails/apigee-watcher"
    tag: "12345_6789abcde"
    pullPolicy: Always

Using Kubernetes tolerations with Guardrails

You can also add tolerations to Guardrails in your overrides file. If no tolerations are defined under the Guardrails overrides configuration, Guardrails will use any globally defined tolerations.

For example, to include tolerations specifically in the Guardrails section of your overrides file, you would add something similar to the following stanza:

  # Apigee Guardrails
  guardrails:
    image:
      url: "gcr.io/ng-hybrid/guardrails/apigee-watcher"
      tag: "12345_6789abcde"
      pullPolicy: Always
    tolerations:
    - key: "say"
      operator: "Equal"
      value: "taunt"
      effect: "NoSchedule"
  

Troubleshooting Guardrails

Environment variable checkpoint missing or empty

If you are seeing the Client.Timeout exceeded error in the apigee operator guradrails pod log, here are some troubleshooting steps to determine if the problem is on the guardrails side or the infra side.

  1. Create a new yaml file with the following content. The name of the yaml file can be anything you want.
  2.     apiVersion: v1
        kind: Pod
        metadata:
          labels:
          name: apigee-simple-client  
        spec:
          containers:
          - name: apigee-simple-client
            image: "gcr.io/apigee-release/hybrid/apigee-hybrid-cassandra-client:1.10.1"
            imagePullPolicy: Always
            command:
            - sleep
            - "3650d"
          restartPolicy: Never
          hostNetwork: false
      
  3. Apply the new yaml file with the following command:
  4.     kubectl apply -n apigee-system -f name of the yaml file
      
  5. Exec into the pod with the following command:
  6.     kubectl exec -it -n apigee-system apigee-simple-client -- /bin/bash
        
  7. Run the following command inside the apigee-simple-client pod and check the output:
  8.     curl -v -I --proxy http://cspnaproxy1.wlb2.nam.nsroot.net:8882 https://apigee.googleapis.com
        

    The cspnaproxy1.wlb2.nam.nsroot.net is the proxy address. You can use any proxy address that you have access to. If you do not have access to a proxy, you can use the following command to test the network connectivity:

        curl -v -I https://apigee.googleapis.com
        

    If the curl command successfully connects to the http proxy and reaches apigee.googleapis.com then the curl command should return an HTTP response code. If the curl command cannot reach the proxy or cannot connect to apigee.googleapis.com through the proxy, the curl command should show an error.