Cloud Storage encourages you to validate the data you transfer to/from your buckets. This page describes best practices for performing validations using either CRC32C or MD5 checksums.
Protect against data corruption by using hashes
There are a variety of ways that data can be corrupted while uploading to or downloading from the Cloud:
- Memory errors on client or server computers, or routers along the path
- Software bugs (e.g., in a library that customers use)
- Changes to the source file when an upload occurs over an extended period of time
Cloud Storage supports two types of hashes you can use to check the integrity of your data: CRC32C and MD5. CRC32C is the recommended validation method for performing integrity checks. Customers that prefer MD5 can use that hash, but MD5 hashes are not supported for composite objects or objects created from an XML API multipart upload.
CRC32C
All Cloud Storage objects have a CRC32C hash. Libraries for computing CRC32C include:
- Google's CRC32C for C++.
- hash/crc32 for Go.
- GoogleAPIs Guava for Java.
- google-crc32c for Python.
- digest-crc in Ruby.
The Base64 encoded CRC32C is in big-endian byte order.
MD5
Cloud Storage supports an MD5 hash for objects that meet the following criteria:
- The object is not a composite object
- The object was not uploaded using an XML API multipart upload
This hash only applies to a complete object, so it cannot be used to integrity check partial downloads caused by performing a range GET.
ETags
An object's ETag header returns the object's MD5 value if all the following are true:
- The request is being made through the XML API
- The object only uses Google-managed encryption keys for server-side encryption
- The object is not a composite object and was not uploaded using an XML API multipart upload
In all other cases, users should make no assumptions about the value used in an ETag except that it changes whenever the underlying data or metadata changes, per the specification.
The same object can have a different ETag value when it's requested from the XML API compared to the JSON API.
Validation
A download integrity check can be performed by hashing downloaded data on the fly and comparing your results to server-supplied hashes. You should discard downloaded data with incorrect hash values, and you should use retry logic to avoid potentially expensive infinite loops.
Cloud Storage performs server-side validation in the following cases:
When you perform a copy or rewrite request within Cloud Storage.
When supplying an object's expected MD5 or CRC32C hash in an upload request. Cloud Storage only creates the object if the hash you provide matches the value Cloud Storage calculates. If it does not match, the request is rejected with a
BadRequestException: 400
error.
Alternatively, users can choose to perform client-side validation of their uploads by issuing a request for the uploaded object's metadata, comparing the reported hash value, and deleting the object in case of a mismatch. This method is useful if the object's MD5 or CRC32C hash isn't known at the start of the upload. To avoid race conditions where independent processes delete or replace each other's data, use object generations and preconditions.
In the case of parallel composite uploads, users should perform an integrity check for each component upload and then use component preconditions with their compose requests to protect against race conditions. Object composition offers no server-side MD5 validation, so users who wish to perform an end-to-end integrity check should apply client-side validation to the new composite object.
XML API
In the XML API, base64-encoded MD5 and CRC32C hashes are exposed and accepted
via the x-goog-hash
header. In the past, MD5s were used as object ETags,
but users should avoid assuming this since some objects use opaque
ETag values that make no guarantees outside of changing when the object changes.
Server-side upload validation can be performed by supplying locally computed
hashes via the x-goog-hash
request header. Additionally, the MD5 can be
supplied using the standard HTTP Content-MD5 header (see the
specification).
JSON API
In the JSON API, the objects resource md5Hash
and crc32c
properties
contain base64-encoded MD5 and CRC32C hashes, respectively. Providing either
metadata property is optional. Supplying either property as part of a
resumable upload or JSON API multipart upload triggers server-side
validation for the new object. If Cloud Storage calculates a value for either
property that does not match a supplied value, the object is not created. If the
properties are not provided in an upload, Cloud Storage calculates the
values and writes them to the object's metadata.
gcloud storage
and gsutil
For both the gcloud storage
and gsutil command line tools, data
copied to or from a Cloud Storage bucket is validated. This applies to
cp
, mv
, and rsync
commands. If the checksum of the source data does
not match the checksum of the destination data, the tools delete the invalid
copy and print a warning message. This very rarely happens. If it does, you
should retry the operation.
For both CLIs, this automatic validation occurs after the object itself is finalized, so invalid objects are visible for 1-3 seconds before they're identified and deleted. Additionally, there is a chance that the tools could be interrupted after the upload completes but before they perform their validation, leaving the invalid object in place. These issues can be avoided when uploading single files to Cloud Storage by using server-side validation, which occurs when using the following flags:
- For
gcloud storage
, use the flag--content-md5=MD5
. - For gsutil, use the flag
-h Content-MD5:MD5
.
What's next
- Explore upload and download options for Cloud Storage.
- Learn about retry strategies for Cloud Storage.