Perform failover tests for highly resilient environments

Cloud Composer 1 | Cloud Composer 2 | Cloud Composer 3

This page describes how to perform database and cluster failover tests for highly resilient environments.

Failover tests for your environment simulate a complete outage of a zone in a data center. In such a scenario, a zonal outage of a cluster and a zonal outage of a database might happen at the same time. By performing the two failover tests, you can monitor how your highly resilient environment performs a failover and check how this affects your DAGs and tasks.

Before you begin

  • To perform failover tests, your account must have the following roles and permissions:

    • composer.environments.update permission. See Access control with IAM for a list of roles with this permission.

    • Kubernetes Engine Cluster Admin (roles/container.clusterAdmin) role to run kubectl commands on the environment's cluster. As an alternative, you can provision Kubernetes RBAC roles directly in GKE.

  • If you use authorized networks, and your then you must run kubectl commands from a machine that can access the GKE cluster's control plane endpoint. Depending on how you set up access to your environment's control plane endpoint, you can use several options. For more information, see Running commands on a private IP environment.

Check that your environment is healthy

Make sure to perform database and environment's cluster failover tests only on healthy environments.

To check that your environment is healthy:

  1. In the Google Cloud console, go to the Environments page.

    Go to Environments

  2. In the list of environments, click the name of your environment. The Environment details page opens.

  3. Go to the Monitoring tab.

  4. Make sure that all health metrics are green.

Perform a database failover test

You can perform a database failover test, which simulates a zonal outage, by triggering it with a Google Cloud CLI command. For example, you might want to do it to measure the amount of time it takes for your environment's database to switch to another zone.

To perform a database failover test for your environment:

  1. Make sure that your environment is healthy.

  2. Get the primary zone of your environment's database:

    gcloud composer environments fetch-database-properties \
        ENVIRONMENT_NAME \
        --location LOCATION
    

    Replace the following:

    • ENVIRONMENT_NAME: the name of your Cloud Composer environment.
    • LOCATION: the region where the environment is located.

    Example:

    gcloud composer environments fetch-database-properties \
        example-environment \
        --location us-central1
    
  3. Start the database failover test:

    gcloud composer environments database-failover \
        ENVIRONMENT_NAME \
        --location LOCATION
    

    Replace the following:

    • ENVIRONMENT_NAME: the name of your Cloud Composer environment.
    • LOCATION: the region where the environment is located.

    Example:

    gcloud composer environments database-failover \
        example-environment \
        --location us-central1
    
  4. Wait until the database failover test is finished. The process can take up to 3 minutes.

  5. Check that the primary zone of your environment's database is changed:

    gcloud composer environments fetch-database-properties \
        ENVIRONMENT_NAME \
        --location LOCATION
    
  6. Check your environment's health metrics to make sure that your environment is healthy.

  7. Your environment's database becomes ready for another failover when the Database available for failover (composer.googleapis.com/environment/database/available_for_failover) environment metric becomes True. For more information about viewing your environment's metrics in Cloud Monitoring, see Monitor environments.

Perform your environment's cluster failover test

You can perform a failover test for your environment's cluster, which simulates a zonal outage. For example, you might want to do it to measure the amount of time it takes for your environment to switch to another zone.

Check that your environment is healthy

Before starting the test, make sure that your environment is healthy.

Configure credentials for your environment's cluster

To obtain cluster credentials:

  1. In the Google Cloud console, go to the Environments page.

    Go to Environments

  2. In the list of environments, click the name of your environment. The Environment details page opens.

  3. Go to the Environment configuration tab.

  4. Click View cluster details.

  5. Click Connect.

  6. Copy and execute the displayed Google Cloud CLI command.

    For example:

    gcloud container clusters get-credentials \
      us-central1-exam-db23ee12-gke \
      --region us-central1 \
      --project example-project
    

Inspect your environment's cluster

Check zones and nodes where the workloads run in your environment's cluster. You use this information to simulate a zonal outage later. You can also run these commands again while you perform the failover test to see how your environment's cluster performs the failover.

  1. Check nodes and zones:

    kubectl get nodes \
      -o=custom-columns=NAME:.metadata.name,NODE:.metadata.labels.topology\\.gke\\.io/zone
    
  2. Check pods:

    kubectl get pods --all-namespaces \
    -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName \
    --field-selector metadata.namespace!=kube-system
    
  3. View more detailed information about pods:

    kubectl get pods --all-namespaces -o wide \
    --field-selector metadata.namespace!=kube-system
    

Drain nodes

Choose a zone where you want to simulate an outage. If you perform the cluster failover test together with the database failover test, you might want to choose the primary zone of your environment's highly available Cloud SQL instance. For example, if the primary Cloud SQL instance runs in us-central1-a then you can simulate an outage in the whole us-central1-a zone by first performing the database failover test and then the cluster's failover test in us-central1-a.

The following command simulates a set of nodes becoming unavailable in a specific zone. It forcefully evicts Pods from nodes in the specified zone and prevents rescheduling Pods on these nodes. Because new Pods can't be scheduled, new nodes are added to the cluster.

This command does not affect workloads that run in the composer-system namespace. You might see related error messages in the command output. This does not affect the failover test. The nodes that exist in the selected zone are still marked as unschedulable.

To simulate a cluster zone failure in the selected zone:

kubectl get nodes -o name -l "topology.gke.io/zone=ZONE" | \
xargs kubectl drain \
--ignore-daemonsets --delete-emptydir-data --force --disable-eviction

Replace the following:

  • ZONE: the zone where you want to simulate a cluster zone failure.

Check environment metrics

Environment metrics during a simulated zone outage
Figure 1. Environment metrics during a simulated zone outage (click to enlarge)
  1. In the Google Cloud console, go to the Environments page.

    Go to Environments

  2. In the list of environments, click the name of your environment. The Environment details page opens.

  3. Go to the Monitoring tab.

  4. Check that following metrics are either "green" during the failover operation, or remain the "red" status for at most several minutes.

    • Environment health
    • Scheduler heartbeat
    • Web server health
    • Database health
    • Active workers
    • Active schedulers
    • Active web servers
    • Active triggers

    Note that the simulated outage is marked as a "Cluster maintenance operation".

  5. You do not need to do any additional actions to return your environment's cluster to failover readiness after the test. During the test, the environment's cluster automatically adds new nodes that replace the ones affected by the simulated outage.

What's next