Best practices for high availability with OpenShift


This document describes best practices to achieve high availability (HA) with Red Hat OpenShift Container Platform workloads on Compute Engine. This document focuses on application-level strategies to help you ensure that your workloads remain highly available when failures occur. These strategies help you eliminate single points of failure and implement mechanisms for automatic failover and recovery.

This document is intended for platform and application architects and assumes that you have some experience in deploying OpenShift. For more information about how to deploy OpenShift, see the Red Hat documentation.

Spread deployments across multiple zones

We recommend that you deploy OpenShift across multiple zones within a Google Cloud region. This approach helps ensure that if a zone experiences an outage, the cluster's control plane nodes continue to function in the other zones the deployment is spread across. To deploy OpenShift across multiple zones, specify a list of Google Cloud zones from the same region in your install-config.yaml file.

For fine-grained control over the locations where nodes are deployed, we recommend defining VM placement policies which ensure that the VMs are spread across different failure domains in the same zone. Applying a spread placement policy to your cluster nodes helps reduce the number of nodes that are simultaneously impacted by location-specific disruptions. For more information on how to create a spread policy for existing clusters, see Create and apply spread placement policies to VMs.

Similarly, to prevent multiple pods from being scheduled on the same node, we recommend that you use pod anti-affinity rules. These rules spread application replicas across multiple zones. The following example demonstrates how to implement pod anti-affinity rules:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: my-app-namespace
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      # Pod Anti-Affinity: Prefer to schedule new pods on nodes in different zones.
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: my-app
            topologyKey: topology.kubernetes.io/zone
      containers:
      - name: my-app-container
        image: quay.io/myorg/my-app:latest
        ports:
        - containerPort: 8080

For stateless services like web front ends or REST APIs, we recommend that you run multiple pod replicas for each service or route. This approach ensures that traffic is automatically routed to pods in available zones.

Proactively manage load to prevent resource over-commitment

We recommend that you proactively manage your application's load to prevent resource over-commitment. Over-commitment can lead to poor service performance under load. You can help prevent over-commitment by setting resource request limits, for a more detailed explanation see managing resources for your pod. Additionally, you can automatically scale replicas up or down based on CPU, memory, or custom metrics, using the horizontal pod autoscaler.

We also recommend that you use the following load balancing services:

  • OpenShift ingress operator. Ingress operator deploys HAProxy-based ingress controllers to handle routing to your pods. Specifically, we recommend that you configure global access for Ingress controller, which enables clients in any region within the same VPC network and region as the load balancer, to reach the workloads running on your cluster. Additionally, we recommend that you implement ingress controller health checks to monitor the health of your pods and restart failing pods.
  • Google Cloud Load Balancing. Load Balancing distributes traffic across Google Cloud zones. Choose a load balancer that meets your application's needs.

Define pod disruption budgets

We recommend that you define disruption budgets to specify the minimum number of pods that your application requires to be available during disruptions like maintenance events or updates. The following example shows how to define a disruption budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
  namespace: my-app-namespace
spec:
  # Define how many pods need to remain available during a disruption.
  # At least one of "minAvailable" or "maxUnavailable" must be specified.
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

For more information, see Specifying a Disruption Budget for your Application.

Use storage that supports HA and data replication

For stateful workloads that require persistent data storage outside of containers, we recommend the following best practices.

Disk best practices

If you require disk storage use one of the following:

After you select a storage option, install its driver in your cluster:

Finally, set a StorageClass for your disk:

The following example shows how to set a StorageClass:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: regionalpd-balanced
provisioner: PROVISIONER
parameters:
  type: DISK-TYPE
  replication-type: REPLICATION-TYPE
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
allowVolumeExpansion: true
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values:
          - europe-west1-b
          - europe-west1-a

Database best practices

If you require a database use one of the following:

After you install your database operator, configure a cluster with multiple instances. The following example shows the configuration for a cluster with the following attributes:

  • A PostgreSQL cluster named my-postgres-cluster is created with three instances for high availability.
  • The cluster uses the regionalpd-balanced storage class for durable and replicated storage across zones.
  • A database named mydatabase is initialized with a user myuser, whose credentials are stored in a Kubernetes secret called my-database-secret.
  • Superuser access is disabled for enhanced security.
  • Monitoring is enabled for the cluster.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres-cluster
  namespace: postgres-namespace
spec:
  instances: 3
  storage:
    size: 10Gi
    storageClass: regionalpd-balanced
  bootstrap:
    initdb:
      database: mydatabase
      owner: myuser
      secret:
        name: my-database-secret
  enableSuperuserAccess: false
  monitoring:
    enabled: true
---
apiVersion: 1
kind: Secret
metadata:
  name: my-database-secret
  namespace: postgres-namespace
type: Opaque
data:
  username: bXl1c2Vy # Base64-encoded value of "myuser"
  password: c2VjdXJlcGFzc3dvcmQ= # Base64-encoded value of "securepassword"

Externalize application state

We recommend that you move session state or caching to shared in-memory stores (for example, Redis) or persistent datastores (for example, Postgres, MySQL) that are configured to run in HA mode.

Summary of best practices

In summary, implement the following best practices to achieve high availability with OpenShift:

  • Spread deployments across multiple zones
  • Proactively manage load to prevent resource over-commitment
  • Define pod disruption budgets
  • Use HA data replication features
  • Externalize application state

What's next