Storage & Data Transfer
How Cloud Storage delivers 11 nines of durability—and how you can help
One of the most fundamental aspects of any storage solution is durability—how well is your data protected from loss or corruption? And that can feel especially important for a cloud environment. Cloud Storage has been designed for at least 99.999999999% annual durability, or 11 nines. That means that even with one billion objects, you would likely go a hundred years without losing a single one!
We take achieving our durability targets very seriously. In this post, we’ll explore the top ways we protect Cloud Storage data. At the same time, data protection is ultimately a shared responsibility (the most common cause of data loss is accidental deletion by a user or storage administrator), so we’ll provide best practices to help protect your data against risks like natural disasters and user errors.
Most people think about durability in the context of protecting against network, server, and storage hardware failures.
At Google, our philosophy is that software is ultimately the best way to protect against hardware failures. This allows us to attain higher reliability at an attractive cost, instead of depending on exotic hardware solutions. We assume hardware will fail all the time—because it does! But that doesn’t mean durability has to suffer.
To store an object in Cloud Storage, we break it up into a number of ‘data chunks’, which we place on different servers with different power sources. We also create a number of ‘code chunks’ for redundancy. In the event of a hardware failure (e.g., server, disk), we use data and code chunks to reconstruct the entire object. This technique is called erasure coding. In addition, we store several copies of the metadata needed to find and read the object, so that if one or more metadata servers fails, we can continue to access the object.
The key requirement here is that we always store data redundantly across multiple availability zones before a write is acknowledged as successful. The encodings we use provide sufficient redundancy to support a target of more than 11 nines of durability against a hardware failure. Once stored, we regularly verify checksums to guard data at rest from certain types of data errors. In the case of a checksum mismatch, data is automatically repaired using the redundancy present in our encodings.
Best practice: use dual-region or multi-region locations
These layers of protection against physical durability risks are well and good, but they may not protect against substantial physical destruction of a region—think acts of war, an asteroid hit, or other large-scale disasters.
Cloud Storage’s 11 nines durability target applies to a single region. To go further and protect against natural disasters that could wipe out an entire region, consider storing your most important data in dual-region or multi-region buckets. These buckets automatically ensure redundancy of your data across geographic regions. Using these buckets requires no additional configuration or API changes to your applications, while providing added durability against very rare, but potentially catastrophic, events. As an added benefit, these location types also come with significantly higher availability SLAs, because we can transparently serve your objects from more than one location if a region is temporarily inaccessible.
Durability in transit
Another class of durability risks concerns corruption to data in transit. This could be data transferred across networks within the Cloud Storage service itself or when uploading or downloading objects to/from Cloud Storage.
To protect against this source of corruption, data in transit within Cloud Storage is designed to be always checksum-protected, without exception. In the case of a checksum-validation error, the request is automatically retried, or an error is returned, depending on the circumstances.
Best practice: use checksums for uploads and downloads
While Google Cloud checksums all Cloud Storage objects that travel within our service, to achieve end-to-end protection, we recommend that you provide checksums when you upload your data to Cloud Storage, and validate these checksums on the client when you download an object.
Human-induced durability risks
Arguably the biggest risk of data loss is due to human error—not only errors made by us as developers and operators of the service, but also errors made by Cloud Storage users!
Software bugs are potentially the single biggest risk to data durability. To avoid durability loss from software bugs, we take steps to avoid introducing data-corrupting or data-erasing bugs in the first place. We then maintain safeguards to detect these types of bugs quickly, with the aim of catching them before durability degradation turns into durability loss.
To catch bugs up front, we only release a new version of Cloud Storage to production after it passes a large set of integration tests. These include exercising a variety of edge-case failure scenarios such as an availability zone going down, and comparing the behaviors of data encoding and placement APIs to previous versions to screen for regressions.
Once a new software release is approved, we roll out upgrades in stages by availability zone, starting with a very limited initial area of impact and slowly ramping up until it is in widespread use. This allows us to catch issues before they have a large impact and while there are still additional copies of data (or a sufficient number of erasure code chunks) from which to recover, if needed. These software rollouts are monitored closely with plans in place for quick rollbacks, if necessary.
There’s a lot you can do, too, to protect your data from being lost.
Best practice: turn on object versioning
One of the most common sources of data loss is accidental deletion of data by a storage administrator or end-user. When you turn on object versioning, Cloud Storage preserves deleted objects in case you need to restore them at a later time. By configuring Object Lifecycle Management policies, you can limit how long you keep versioned objects before they are permanently deleted in order to better control your storage costs.
Best practice: back up your data
Cloud Storage’s 11-nines durability target does not obviate the need to back up your data. For example, consider what a malicious hacker might do if they obtained access to your Cloud Storage account. Depending on your goals, a backup may be a second data copy in another region or cloud, on-premises, or even physically isolated with an air gap on tape or disk.
Best practice: use data access retention policies and audit logs
For long-term data retention, use the Cloud Storage bucket lock feature to set data retention policies and ensure data is locked for specific periods of time. Doing so prevents accidental modification/deletion, and when combined with data access audit logging, can satisfy regulatory and compliance requirements such as FINRA, SEC, and CFTC and certain health care industry retention regulations
Best practice: use role-based access control policies
You can limit the blast radius of malicious hackers and accidental deletions by ensuring that IAM data access control policies follow the principles of separation of duties and least privilege. For example, separate those with the ability to create buckets from those who can delete projects.
Encryption keys and durability
All Cloud Storage data is designed to always be encrypted at rest and in transit within the cloud. Because objects are unreadable without their encryption keys, the loss of encryption keys is a significant risk to durability—after all, what use is highly durable data if you can’t read it? With Cloud Storage, you have three choices for key management: 1) trust Google to manage the encryption keys for you, 2) use Customer Managed Encryption Keys (CMEK) with Cloud KMS, or 3) use Customer Supplied Encryption Keys (CSEK) with an external key server.
Google takes similar steps as described earlier (including erasure coding and consistency checking) to protect the durability of the encryption keys under its control.
Best practice: safeguard your encryption keys
By choosing either CMEK or CSEK to manage your keys, you take direct control of managing your own keys. It is vital in these cases that you also protect your keys in a manner that also provides at least 11 nines of durability. For CSEK, this means maintaining off-site backups of your keys so that you have a path to recovery even if your keys are lost or corrupted in some way. If such precautions are not taken, the durability of the encryption keys will determine the durability of the data.
Going beyond 11 nines
Google Cloud takes the responsibility of protecting your data extremely seriously. In practice, the numerous techniques outlined here have allowed Cloud Storage to exceed 11 nines of annual durability to date. Add to that the best practices we shared in this guide, and you’ll help to ensure that your data is here when you need it—whether that be later today or decades in the future. To get started, check out this comprehensive collection of Cloud Storage how-to guides.
Thanks to Dean Hildebrand, Technical Director, Office of the CTO, who is a coauthor of the document on which this post is based.