Manage and monitor your Google Cloud infrastructure

Last reviewed 2023-11-13 UTC

After you deploy an application to production in Google Cloud, you might need to modify the infrastructure that it uses. For example, you might need to change the machine types of your VMs or change the storage class of the Cloud Storage buckets. This part of the Google Cloud infrastructure reliability guide summarizes change-management guidelines that you can follow to reduce the reliability risk of the infrastructure resources. This part also describes how you can monitor the availability of Google Cloud infrastructure.

Deploy infrastructure changes progressively

When you need to change your Google Cloud infrastructure, as much as possible, deploy the changes to production progressively. For example, if you need to change the machine types of the VMs, deploy the changes to a few VMs in one zone, and monitor the effects of the changes. If you observe any issues, revert the infrastructure quickly to the previous stable state. Diagnose and resolve the issues, and then restart the progressive deployment process. After verifying that your workload runs as expected, gradually deploy the changes across all of your infrastructure.

For more information about strategies to reliably test and deploy changes to your Google Cloud infrastructure and applications, see Application deployment and testing strategies.

Control changes to global resources

When you modify global resources such as VPC networks and global load balancers, take extra care to verify the changes before deploying them to production.

Because global resources are resilient to zone and region outages, you might decide to use single instances of certain global resources in your architecture. In such deployments, the global resources can become single points of failure. For example, if you inadvertently misconfigure a forwarding rule of your global load balancer, the frontend can stop receiving or processing user requests. Effectively, the application is unavailable to users in this case though the backend is intact. To avoid such situations, exercise rigorous control over changes to global resources. For example, in your change-review process, you can classify any modifications to global resources as high-risk changes that additional reviewers must verify and approve.

Monitor availability of Google Cloud infrastructure

You can monitor the current status of the Google Cloud services across all the regions by using the Google Cloud Service Health Dashboard. You can also view a history of the infrastructure failures (called incidents) for each service. The history page provides the details of each incident, such as the incident duration, affected zones and regions, affected services, and any recommended workarounds.

You can also view incidents relevant to your project using Personalized Service Health. Service Health also lets you request incident information using an API on a per-project or per-organization basis and lets you configure alerts.

Google provides regular updates about the status of each incident, including an estimated time for the next update. You can programmatically get status updates for incidents by using an RSS feed. For more information, see Incidents and the Google Cloud Service Health Dashboard.