Hashes and eTags: best practices

Cloud Storage encourages users to validate the data they transfer to/from their buckets using either CRC32C or MD5 checksums. This section describes best practices for performing these validations.

Using hashes for integrity checking

There are a variety of ways that data can be corrupted while uploading to or downloading from the Cloud:

  • Noisy network links
  • Memory errors on client or server computers, or routers along the path
  • Software bugs (e.g., in a library that customers use)

To protect against data corruption, Cloud Storage supports two types of hashes: CRC32C and MD5. CRC32C is the recommended validation method. Customers that prefer MD5 can use that hash, but MD5 hashes are not supported for composite objects.

CRC32C

All Cloud Storage objects have a CRC32C hash. Libraries for computing CRC32C include:

The Base64 encoded CRC32C is in big-endian byte order.

MD5

Cloud Storage supports an MD5 hash for non-composite objects. This hash only applies to a complete object, so it cannot be used to integrity check partial downloads caused by performing a range GET.

ETags

For non-composite objects, the XML API uses the object's MD5 for the value of the ETag header. In all other cases, users should make no assumptions about the value used in an ETag except that it changes whenever the underlying data changes, per the specification.

The same object can have a different ETag value when it's requested from the XML API compared to the JSON API.

Validation

A download integrity check can be performed by hashing downloaded data on the fly and comparing your results to server-supplied hashes. You should discard downloaded data with incorrect hash values, and you should use retry logic to avoid potentially expensive infinite loops.

Cloud Storage performs server-side validation in the following cases:

  • When you perform a copy or rewrite request within Cloud Storage.

  • When supplying an object's expected MD5 or CRC32C hash in an upload request. Cloud Storage only creates the object if the hash you provide matches the value Cloud Storage calculates.

Alternatively, users can choose to perform client-side validation of their uploads by issuing a request for the uploaded object's metadata, comparing the reported hash value, and deleting the object in case of a mismatch. This method is useful if the object's MD5 or CRC32C hash isn't known at the start of the upload. To avoid race conditions where independent processes delete or replace each other's data, we also recommend that you use object generations and preconditions.

In the case of parallel composite uploads, users should perform an integrity check for each component upload and then use component preconditions with their compose requests to protect against race conditions. Object composition offers no server-side MD5 validation, so users who wish to perform an end-to-end integrity check should apply client-side validation to the new composite object.

At the end of each copy operation, the gsutil cp and rsync commands validate that the checksum of the local file matches that of the checksum of the object stored in Cloud Storage. If it does not, gsutil deletes the invalid copy and prints a warning message. This very rarely happens. If it does, you can retry the operation.

XML API

In the XML API, base64-encoded MD5 and CRC32C hashes are exposed and accepted via the x-goog-hash header. In the past, MD5s were used as object ETags, but users should avoid assuming this since some objects use opaque ETag values that make no guarantees outside of changing when the object changes.

Server-side upload validation can be performed by supplying locally computed hashes via the x-goog-hash request header. Additionally, the MD5 can be supplied using the standard HTTP Content-MD5 header (see the specification).

JSON API

In the JSON API, the objects resource md5Hash and crc32c properties contain base64-encoded MD5 and CRC32C hashes, respectively. Providing either metadata property is optional. Supplying either property as part of a resumable upload or JSON API multipart upload triggers server-side validation for the new object. If Cloud Storage calculates a value for either property that does not match a supplied value, the object is not created. If the properties are not provided in an upload, Cloud Storage calculates the values and writes them to the object's metadata.