One strategy for uploading large files is called parallel composite uploads. In such an upload, a file is divided into up to 32 chunks, the chunks are uploaded in parallel to temporary objects, the final object is recreated using the temporary objects, and the temporary objects are deleted.
Parallel composite uploads can be significantly faster if network and disk speed are not limiting factors; however, the final object stored in your bucket is a composite object, which only has a crc32c hash and not an MD5 hash. As a result, you must use crcmod to perform integrity checks when downloading the object with gsutil or other Python applications. You should only perform parallel composite uploads if the following apply:
Any Python user who needs to download your objects has either google-crc32c or crcmod installed.
Any gsutil user who needs to download your objects has crcmod installed.
For example, if you use gsutil to upload video assets that are only served by a Java application, parallel composite uploads are a good choice because there are efficient CRC32C implementations available in Java.
You do not need the uploaded objects to have an MD5 hash.
gcloud support
You can configure how and when gcloud storage cp
performs parallel
composite uploads by modifying the following properties:
storage/parallel_composite_upload_enabled
: Property for enabling parallel composite uploads. IfFalse
, disable parallel composite uploads. IfTrue
orNone
, perform parallel composite uploads for objects that meet the criteria defined in the other properties. The default setting isNone
.storage/parallel_composite_upload_compatibility_check
: Property for toggling safety checks. IfTrue
,gcloud storage
only performs parallel composite uploads when all of the following conditions are met:- The storage class for the uploaded object is
STANDARD
. - The destination bucket does not have a retention policy.
- The destination bucket does not have default object holds enabled.
If
False
,gcloud storage
does not perform any checks. The default setting isTrue
.- The storage class for the uploaded object is
storage/parallel_composite_upload_threshold
: The minimum total file size for performing a parallel composite upload. The default setting is 150 MiB.storage/parallel_composite_upload_component_size
: The maximum size for each temporary object. The property is ignored if the total file size is so large that it would require more than 32 chunks at this size.
You can modify these properties by creating a named configuration and
applying the configuration either on a per-command basis by using the
--configuration
project-wide flag or for all gcloud commands by using
the gcloud config set
command.
No additional local disk space is required when using gcloud to perform parallel composite uploads. If a parallel composite upload fails prior to composition, run the gcloud command again to take advantage of resumable uploads for the temporary objects that failed. Any temporary objects that uploaded successfully before the failure do not get re-uploaded when you resume the upload.
Temporary objects are named in the following fashion:
gcloud/tmp/parallel_composite_uploads/see_gcloud_storage_cp_help_for_details/RANDOM_PREFIX_HEX_DIGEST_COMPONENT_ID
Where:
RANDOM_PREFIX
is a random numerical value.HEX_DIGEST
is a hash derived from the name of the source resource.COMPONENT_ID
is the sequential number of the component.
Generally, temporary objects are deleted at the end of a parallel composite upload, but to avoid leaving temporary objects around, you should check the exit status from the gcloud command, and you should manually delete any temporary objects that were uploaded as part of any aborted upload.
gsutil support
You can configure how and when gsutil cp
performs parallel composite
uploads, which are disabled by default, by modifying the following two
parameters:
parallel_composite_upload_threshold
: The minimum total file size for performing a parallel composite upload. You can disable all parallel composite uploads in gsutil by setting this value to0
.parallel_composite_upload_component_size
: The maximum size for each temporary object. The parameter is ignored if the total file size is so large that it would require more than 32 chunks at this size.
You can modify both parameters either on a per-command basis by using the -o global option or for all gsutil commands by editing the .boto configuration file.
No additional local disk space is required when using gsutil to perform parallel composite uploads. If a parallel composite upload fails prior to composition, run the gsutil command again to take advantage of resumable uploads for the temporary objects that failed. Any temporary objects that uploaded successfully before the failure do not get re-uploaded when you resume the upload.
Temporary objects are named in the following fashion:
RANDOM_PREFIX/gsutil/tmp/parallel_composite_uploads/for_details_see/gsutil_help_cp/HASH
Where RANDOM_PREFIX
is a numerical value, and
HASH
is an MD5 hash (not related to the hash of the
contents of the file or object).
Generally, temporary objects are deleted at the end of a parallel composite upload, but to avoid leaving temporary objects around, you should check the exit status from the gsutil command, and you should manually delete any temporary objects that were uploaded as part of any aborted upload.
JSON and XML support
Both the JSON API and XML API support uploading object chunks in parallel and
recombining them into a single object using the compose
operation.
Keep the following in mind when designing code for parallel composite uploads:
When using the
compose
operation, the source objects are unaffected by the composition process.This means that if they are meant to be temporary, you must explicitly delete them once you've successfully completed the composition, or else the source objects remain in your bucket and are billed accordingly.
In order to protect against changes to source objects between the upload and compose requests, you should provide an expected generation number for each source.