Cloud Storage encourages you to validate the data you transfer
to and from your buckets. This page describes best practices for performing
validations using either CRC32C or MD5 checksums and describes the change
detection algorithm used by the gcloud storage rsync
command.
Protect against data corruption by using hashes
There are a variety of ways that data can be corrupted while uploading to or downloading from the Cloud:
- Memory errors on client or server computers, or routers along the path
- Software bugs (e.g., in a library that customers use)
- Changes to the source file when an upload occurs over an extended period of time
Cloud Storage supports two types of hashes you can use to check the integrity of your data: CRC32C and MD5. CRC32C is the recommended validation method for performing integrity checks. Customers that prefer MD5 can use that hash, but MD5 hashes are not supported for all objects.
Client-side validation
You can perform a integrity check for downloaded data by hashing the data on the fly and comparing your results to the server-supplied checksums. Note, however, that the server-supplied checksums are based on the complete object as it's stored in Cloud Storage, which means the following types of downloads can't be validated using the server-supplied hashes:
Downloads which undergo decompressive transcoding, because the server-supplied checksum represents the object in its compressed state, while the served data has compression removed and consequently a different hash value.
A response that contains only a portion of the object data, which occurs when making a
range
request. Cloud Storage recommends using range requests only for restarting the download of a full object after the last received offset, because in that case you can compute and validate the checksum when the full download completes.
In cases where you can validate the download using checksums, you should discard downloaded data with incorrect hash values and use the recommended retry logic to retry the request.
Server-side validation
Cloud Storage performs server-side validation in the following cases:
When you perform a copy or rewrite request within Cloud Storage.
When supplying an object's expected MD5 or CRC32C hash in an upload request. Cloud Storage only creates the object if the hash you provide matches the value Cloud Storage calculates. If it does not match, the request is rejected with a
BadRequestException: 400
error.In applicable JSON API requests, you supply checksums as part of the objects resource.
In applicable XML API requests, you supply checksums using the
x-goog-hash
header. The XML API also accepts the standard HTTP Content-MD5 header (see the specification).Alternatively, you can perform client-side validation of your uploads by issuing a request for the uploaded object's metadata, comparing the uploaded object's hash value to the expected value, and deleting the object in case of a mismatch. This method is useful if the object's MD5 or CRC32C hash isn't known at the start of the upload.
In the case of parallel composite uploads, users should perform an integrity check for each component upload and then use preconditions with their compose requests to protect against race conditions. Compose requests don't perform server-side validation, so users who want an end-to-end integrity check should perform client-side validation on the new composite object.
Google Cloud CLI validation
For the Google Cloud CLI, data copied to or from a Cloud Storage
bucket is validated. This applies to cp
, mv
, and rsync
commands. If the
checksum of the source data does not match the checksum of the destination data,
the gcloud CLI deletes the invalid copy and prints a warning message.
This very rarely happens. If it does, you should retry the operation.
This automatic validation occurs after the object itself is finalized, so
invalid objects are visible for 1-3 seconds before they're identified and
deleted. Additionally, there is a chance that the gcloud CLI could be
interrupted after the upload completes but before it performs the validation,
leaving the invalid object in place. These issues can be avoided when uploading
single files to Cloud Storage by using server-side validation,
which occurs when using the --content-md5
flag.
Change detection for rsync
The gcloud storage rsync
command can also use MD5 or CRC32C checksums to
determine if there is a difference between the version of an object found at the
source and the version found at the destination. The command compares checksums
in the following cases:
The source and destination are both cloud buckets and the object has an MD5 or CRC32C checksum in both buckets.
The object does not have a file modification time (
mtime
) in either the source or destination.
In cases where the relevant object has an mtime
value in both the source and
destination, such as when the source and destination are file systems, the
rsync
command compares the objects' size and mtime
, instead of using
checksums. Similarly, if the source is a cloud bucket and the destination is a
local file system, the rsync
command uses the time created for the source
object as a substitute for mtime
, and the command does not use checksums.
If neither mtime
nor checksums are available, rsync
only compares file sizes
when determining if there is a change between the source version of an object
and the destination version. For example, neither mtime
nor checksums are
available when comparing composite objects with objects at a cloud provider
that doesn't support CRC32C, because composite objects don't have MD5 checksums.
What's next
- Explore upload and download options for Cloud Storage.
- Learn about retry strategies for Cloud Storage.