[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-19。"],[[["\u003cp\u003eResilient systems on Compute Engine are designed to withstand failures and disruptions without service interruption.\u003c/p\u003e\n"],["\u003cp\u003eImplementing diversity across regions and zones, alongside load balancing, is crucial for mitigating zone or region failures.\u003c/p\u003e\n"],["\u003cp\u003eUsing managed instance groups (MIGs) enables autoscaling and autohealing, improving the system's ability to handle traffic fluctuations and VM failures.\u003c/p\u003e\n"],["\u003cp\u003eEmploying startup and shutdown scripts automates tasks like software installation, updates, and data backups, enhancing recovery and maintenance.\u003c/p\u003e\n"],["\u003cp\u003eRegularly backing up data to multiple locations, such as Cloud Storage or disk snapshots, is essential for preparing against data loss.\u003c/p\u003e\n"]]],[],null,["*** ** * ** ***\n\nThis document describes best practices for designing resilient systems\non Compute Engine. It provides general advice and covers some features\nin Compute Engine that can help mitigate instance downtime and prepare\nfor times when your Compute Engine instances unexpectedly fail.\n\nA resilient system is a system that can withstand a certain amount of failures\nor disruptions without interrupting your service or affecting your users'\nexperience using your service. While Compute Engine makes every\neffort to prevent such disruptions, certain events are unpredictable, and it's\nbest to be prepared for these events.\n\nTypes of failures\n\nAt some point, one or more of your compute instances might be lost due to\nsystem or hardware failures. The following list contains some types of failure\nscenarios that you can mitigate:\n\n- **Unexpected single instance failure**\n\n Unexpected single instance failures can be due to hardware or system\n failure. You can mitigate these events by using\n [persistent disks](/compute/docs/disks/add-persistent-disk) and\n [startup scripts](#startup) to save your data and re-enable software after\n you restart the VM.\n- **Unexpected single VM reboot**\n\n At some point in time, you might experience an unexpected single VM failure\n and reboot. Unlike an unexpected single VM failure, Compute Engine\n automatically reboots your VM after it fails. To help mitigate these events,\n [backup your data](#backup), use\n [Hyperdisk](/compute/docs/disks/add-hyperdisk)\n\n or [Persistent Disk](/compute/docs/disks/add-persistent-disk)\n and use\n [startup scripts](#startup) to quickly re-configure software.\n- **Zone or region failures**\n\n [Zone and region](/docs/geography-and-regions) failures are rare failures\n that can cause all of your instances in a given zone or region to be inaccessible\n or fail. To mitigate these failures, create\n [diversity across regions and zones](#distribute) and implement\n [load balancing](#loadbalancing). You should also\n [back up your data](#backup) or\n [replicate your disks](/compute/docs/disks/about-regional-persistent-disk)\n across multiple zones.\n\nTips for designing resilient systems\n\nTo help mitigate compute instance failures, design your application to be\nresilient against failures, network interruptions, and unexpected disasters. A\nresilient system gracefully handles failures, for example, by redirecting\ntraffic from an inaccessible instance to a live instance, or by automating\ntasks on reboot.\n\nHere are some general tips to help you design a resilient system against\nfailures.\n\nUse live migration\n\nGoogle Cloud periodically performs maintenance on its infrastructure by patching\nsystems with the latest software, performing routine tests and preventative\nmaintenance, and generally ensuring that its infrastructure is as secure, fast,\nand efficient as possible. Compute Engine employs **live migration**\nto ensure that this infrastructure maintenance is transparent by default to your\ncompute instances.\n\n[Live migration](/compute/docs/instances/live-migration-process) is a technology\nthat moves your running instances away from systems that are about to undergo\nmaintenance work. Compute Engine does this automatically for supported\ninstance types.\n\nDuring live migration, your instance might experience a decrease in performance\nfor a short period of time. For instances that demand constant, maximum\nperformance, you can configure the instances to be restarted on another host\ninstead of undergoing live migration. If you choose this option,\nCompute Engine stops the instance and restarts it on a host that isn't\ninvolved in a maintenance event. Terminating and restarting the instance is\nsuitable for overall applications that are also built to handle instance\nfailures or reboots.\n\nTo configure your instances for live migration or to configure them to restart\ninstead of migrate, see\n[Set the host maintenance policy for a compute instance](/compute/docs/instances/setting-vm-host-options).\n\nDistribute your instances\n\nCreate instances across more than one region and zone so that you have\nalternative compute instances to point to if a zone or region containing one of\nyour instances is disrupted. If you create all your instances in the same zone\nor region, then you won't be able to access any of those instances if that\nzone or region becomes unreachable.\n\nUse zone-specific internal DNS names\n\nSet the default [internal DNS type](/compute/docs/internal-dns) for your project\nor organization to zonal DNS. In your applications, use zonal DNS names when\naccessing other compute instances. Internal DNS servers are distributed across\nall zones, so you can rely on zonal DNS names to resolve even if there are\nfailures in other locations.\n\nGlobal DNS is less resilient, due to single point failures. Zonal DNS mitigates\nthe risk of cross-regional outages. Zonal DNS does not require instance name\nuniqueness across all regions in a project, which allows for faster instance\ncreation.\n\nTo check if an instance uses zonal DNS names or global DNS names, see\n[Determine the internal DNS name for a VM](/compute/docs/networking/using-internal-dns#view_instance_dns_name).\n\nIf your project uses global DNS names, you can switch to using\nzonal DNS names. For more information, see\n[Use Zonal DNS for your internal DNS type](/compute/docs/networking/zonal-dns).\n\nCreate groups of VMs\n\nUse [managed instance groups](/compute/docs/instance-groups#managed_instance_groups)\nto create homogeneous groups of VMs so that load balancers can direct traffic to\nmore than one VM in case a single VM becomes unhealthy.\n\nManaged instance groups (MIGs) also offer features like [autoscaling](/compute/docs/autoscaler)\nand [autohealing](/compute/docs/instance-groups/autohealing-instances-in-migs).\nAutoscaling lets you deal with spikes in traffic by scaling the number of VMs up\nor down based on specific signals. Autohealing performs health checking and, if\nnecessary, automatically recreates unhealthy VMs.\n\nMIGs are also available for regions, so you can create a group of VMs\ndistributed across multiple zones within a single region. For more information,\nsee [Creating and managing regional MIGs](/compute/docs/instance-groups/distributing-instances-with-regional-instance-groups).\n\nUse load balancing\n\nGoogle Cloud offers a load balancing service that helps you support periods of\nheavy traffic so that you don't overload your compute instances. With\n[Cloud Load Balancing](/load-balancing/docs/load-balancing-overview), you can\ndo the following:\n\n- Deploy your application on VMs within multiple zones using\n [regional MIGs](/compute/docs/instance-groups#types_of_managed_instance_groups).\n Then, you can configure a\n [forwarding rule](/load-balancing/docs/forwarding-rule-concepts) that can\n spread traffic across all VMs in all zones within the region. Each forwarding\n rule can define one entry point to your application using an external IP\n address.\n\n- Deploy VMs across multiple regions using global load balancing.\n HTTP(S) load balancing enables your traffic to enter the Google Cloud system\n at the location nearest the client.\n [Cross-regional load balancing](/load-balancing/docs/https/cross-region-example)\n provides redundancy so that if a region is unreachable, traffic is\n automatically diverted to another region. In this way, your service remains\n reachable using the same external IP address.\n\n- Use [autoscaling](/compute/docs/autoscaler) to automatically add or delete\n VMs from a MIG based on increases or decreases in load.\n\nAdditionally, Cloud Load Balancing offers VM health checking, providing\nsupport in detecting and handling VM failures.\n\nUse startup and shutdown scripts\n\nCompute Engine offers startup and shutdown scripts that run when an\ninstance boots up or shuts down, respectively. Startup and shutdown scripts can\nautomate tasks like installing software, running updates, making backups, and\nlogging data.\n\nBoth startup and shutdown scripts are an efficient and invaluable way to\nbootstrap or cleanly shut down your instances. Instead of configuring your\ninstances using custom images, it can be beneficial to configure instances\nusing startup scripts.\n\nStartup scripts run whenever the instance reboots or restarts due to failures,\nand can be used to install software and updates. You can also use startup\nscripts to ensure that services are running within the instance. Coding the\nchanges to configure an instance in a startup script is often easier than\ntrying to figure out what files or bytes have changed on a custom image.\n\nShutdown scripts run when your instance shuts down, either intentionally or not.\nThey can perform last minute tasks like backing up data, saving logs, and\ngracefully closing connections before you stop an instance.\n\nFor more information, see [Running startup scripts](/compute/docs/instances/startup-scripts)\nand [Running shutdown scripts](/compute/docs/shutdownscript).\n\nBackup your data\n\nBackup your data regularly and in multiple locations. You can\n[upload your files to Cloud Storage](/storage/docs/uploading-objects),\n[create disk snapshots](/compute/docs/disks/create-snapshots), or\nreplicate your data to a disk in another zone using\n[synchronous replication](/compute/docs/disks/about-regional-persistent-disk) or\nto another region using [asynchronous replication](/compute/docs/disks/async-pd/about)."]]