Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1
This page describes how to perform database and cluster failover tests for highly resilient environments.
Failover tests for your environment simulate a complete outage of a zone in a data center. In such a scenario, a zonal outage of a cluster and a zonal outage of a database might happen at the same time. By performing the two failover tests, you can monitor how your highly resilient environment performs a failover and check how this affects your DAGs and tasks.
Before you begin
To perform failover tests, your account must have the following roles and permissions:
composer.environments.update
permission. See Access control with IAM for a list of roles with this permission.Kubernetes Engine Cluster Admin (
roles/container.clusterAdmin
) role to runkubectl
commands on the environment's cluster. As an alternative, you can provision Kubernetes RBAC roles directly in GKE.
If you use authorized networks, and your then you must run
kubectl
commands from a machine that can access the GKE cluster's control plane endpoint. Depending on how you set up access to your environment's control plane endpoint, you can use several options. For more information, see Running commands on a private IP environment.
Check that your environment is healthy
Make sure to perform database and environment's cluster failover tests only on healthy environments.
To check that your environment is healthy:
In the Google Cloud console, go to the Environments page.
In the list of environments, click the name of your environment. The Environment details page opens.
Go to the Monitoring tab.
Make sure that all health metrics are green.
Perform a database failover test
You can perform a database failover test, which simulates a zonal outage, by triggering it with a Google Cloud CLI command. For example, you might want to do it to measure the amount of time it takes for your environment's database to switch to another zone.
To perform a database failover test for your environment:
Make sure that your environment is healthy.
Get the primary zone of your environment's database:
gcloud composer environments fetch-database-properties \ ENVIRONMENT_NAME \ --location LOCATION
Replace the following:
ENVIRONMENT_NAME
: the name of your Cloud Composer environment.LOCATION
: the region where the environment is located.
Example:
gcloud composer environments fetch-database-properties \ example-environment \ --location us-central1
Start the database failover test:
gcloud composer environments database-failover \ ENVIRONMENT_NAME \ --location LOCATION
Replace the following:
ENVIRONMENT_NAME
: the name of your Cloud Composer environment.LOCATION
: the region where the environment is located.
Example:
gcloud composer environments database-failover \ example-environment \ --location us-central1
Wait until the database failover test is finished. The process can take up to 3 minutes.
Check that the primary zone of your environment's database is changed:
gcloud composer environments fetch-database-properties \ ENVIRONMENT_NAME \ --location LOCATION
Check your environment's health metrics to make sure that your environment is healthy.
Your environment's database becomes ready for another failover when the Database available for failover (
composer.googleapis.com/environment/database/available_for_failover
) environment metric becomesTrue
. For more information about viewing your environment's metrics in Cloud Monitoring, see Monitor environments.
Perform your environment's cluster failover test
You can perform a failover test for your environment's cluster, which simulates a zonal outage. For example, you might want to do it to measure the amount of time it takes for your environment to switch to another zone.
Check that your environment is healthy
Before starting the test, make sure that your environment is healthy.
Configure credentials for your environment's cluster
To obtain cluster credentials:
In the Google Cloud console, go to the Environments page.
In the list of environments, click the name of your environment. The Environment details page opens.
Go to the Environment configuration tab.
Click View cluster details.
Click Connect.
Copy and execute the displayed Google Cloud CLI command.
For example:
gcloud container clusters get-credentials \ us-central1-exam-db23ee12-gke \ --region us-central1 \ --project example-project
Inspect your environment's cluster
Check zones and nodes where the workloads run in your environment's cluster. You use this information to simulate a zonal outage later. You can also run these commands again while you perform the failover test to see how your environment's cluster performs the failover.
Check nodes and zones:
kubectl get nodes \ -o=custom-columns=NAME:.metadata.name,NODE:.metadata.labels.topology\\.gke\\.io/zone
Check pods:
kubectl get pods --all-namespaces \ -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName \ --field-selector metadata.namespace!=kube-system
View more detailed information about pods:
kubectl get pods --all-namespaces -o wide \ --field-selector metadata.namespace!=kube-system
Drain nodes
Choose a zone where you want to simulate an outage. If you perform the cluster
failover test together with the database failover test,
you might want to choose the primary zone of your environment's
highly available Cloud SQL instance.
For example, if the primary Cloud SQL instance runs in us-central1-a
then you can simulate an outage in the whole us-central1-a
zone by first
performing the database failover test and then the cluster's failover test in
us-central1-a
.
The following command simulates a set of nodes becoming unavailable in a specific zone. It forcefully evicts Pods from nodes in the specified zone and prevents rescheduling Pods on these nodes. Because new Pods can't be scheduled, new nodes are added to the cluster.
This command does not affect workloads that run in the composer-system
namespace. You might see related error messages in the command output. This
does not affect the failover test. The nodes that exist in the selected zone
are still marked as unschedulable.
To simulate a cluster zone failure in the selected zone:
kubectl get nodes -o name -l "topology.gke.io/zone=ZONE" | \
xargs kubectl drain \
--ignore-daemonsets --delete-emptydir-data --force --disable-eviction
Replace the following:
ZONE
: the zone where you want to simulate a cluster zone failure.
Check environment metrics
In the Google Cloud console, go to the Environments page.
In the list of environments, click the name of your environment. The Environment details page opens.
Go to the Monitoring tab.
Check that following metrics are either "green" during the failover operation, or remain the "red" status for at most several minutes.
- Environment health
- Scheduler heartbeat
- Web server health
- Database health
- Active workers
- Active schedulers
- Active web servers
- Active triggers
Note that the simulated outage is marked as a "Cluster maintenance operation".
You do not need to do any additional actions to return your environment's cluster to failover readiness after the test. During the test, the environment's cluster automatically adds new nodes that replace the ones affected by the simulated outage.