- rsync - Synchronize content of two buckets/directories
- Be Careful When Using -d Option!
- Be Careful When Synchronizing Over Os-Specific File Typtes (Symlinks, Devices, Etc.)
- Eventual Consistency With Non-Google Cloud Providers
- Checksum Validation And Failure Handling
- Change Detection Algorithm
- Copying In The Cloud And Metadata Preservation
- Slow Checksums
rsync - Synchronize content of two buckets/directories
gsutil rsync [OPTION]... src_url dst_url
The gsutil rsync command makes the contents under dst_url the same as the contents under src_url, by copying any missing files/objects (or those whose data has changed), and (if the -d option is specified) deleting any extra files/objects. src_url must specify a directory, bucket, or bucket subdirectory. For example, to make gs://mybucket/data match the contents of the local directory "data" you could do:
gsutil rsync -d data gs://mybucket/data
To recurse into directories use the -r option:
gsutil rsync -d -r data gs://mybucket/data
To copy only new/changed files without deleting extra files from gs://mybucket/data leave off the -d option:
gsutil rsync -r data gs://mybucket/data
If you have a large number of objects to synchronize you might want to use the gsutil -m option, to perform parallel (multi-threaded/multi-processing) synchronization:
gsutil -m rsync -d -r data gs://mybucket/data
The -m option typically will provide a large performance boost if either the source or destination (or both) is a cloud URL. If both source and destination are file URLs the -m option will typically thrash the disk and slow synchronization down.
To make the local directory "data" the same as the contents of gs://mybucket/data:
gsutil rsync -d -r gs://mybucket/data data
To make the contents of gs://mybucket2 the same as gs://mybucket1:
gsutil rsync -d -r gs://mybucket1 gs://mybucket2
You can also mirror data across local directories:
gsutil rsync -d -r dir1 dir2
To mirror your content across clouds:
gsutil rsync -d -r gs://my-gs-bucket s3://my-s3-bucket
Note 1: Shells (like bash, zsh) sometimes attempt to expand wildcards in ways that can be surprising. Also, attempting to copy files whose names contain wildcard characters can result in problems. For more details about these issues see the section "POTENTIALLY SURPRISING BEHAVIOR WHEN USING WILDCARDS" under gsutil help wildcards.
Note 2: If you are synchronizing a large amount of data between clouds you might consider setting up a Google Compute Engine account and running gsutil there. Since cross-provider gsutil data transfers flow through the machine where gsutil is running, doing this can make your transfer run significantly faster than running gsutil on your local workstation.
Be Careful When Using -d Option!
The rsync -d option is very useful and commonly used, because it provides a means of making the contents of a destination bucket or directory match those of a source bucket or directory. However, please exercise caution when you use this option: It's possible to delete large amounts of data accidentally if, for example, you erroneously reverse source and destination. For example, if you meant to synchronize a local directory from a bucket in the cloud but instead run the command:
gsutil -m rsync -r -d ./your-dir gs://your-bucket
and your-dir is currently empty, you will quickly delete all of the objects in gs://your-bucket.
You can also cause large amounts of data to be lost quickly by specifying a subdirectory of the destination as the source of an rsync. For example, the command:
gsutil -m rsync -r -d gs://your-bucket/data gs://your-bucket
would cause most or all of the objects in gs://your-bucket to be deleted (some objects may survive if there are any with names that sort lower than "data" under gs://your-bucket/data).
In addition to paying careful attention to the source and destination you specify with the rsync command, there are two more safety measures your can take when using gsutil rsync -d:
Try running the command with the rsync -n option first, to see what it would do without actually performing the operations. For example, if you run the command:
gsutil -m rsync -r -d -n gs://your-bucket/data gs://your-bucket
it will be immediately evident that running that command without the -n option would cause many objects to be deleted.
Enable object versioning in your bucket, which will allow you to restore objects if you accidentally delete them. For more details see gsutil help versions.
Eventual Consistency With Non-Google Cloud Providers
While Google Cloud Storage is strongly consistent, some cloud providers only support eventual consistency. You may encounter scenarios where rsync synchronizes using stale listing data when working with these other cloud providers. For example, if you run rsync immediately after uploading an object to an eventually consistent cloud provider, the added object may not yet appear in the provider's listing. Consequently, rsync will miss adding the object to the destination. If this happens you can rerun the rsync operation again later (after the object listing has "caught up").
Checksum Validation And Failure Handling
At the end of every upload or download, the gsutil rsync command validates that the checksum of the source file/object matches the checksum of the destination file/object. If the checksums do not match, gsutil will delete the invalid copy and print a warning message. This very rarely happens, but if it does, please contact email@example.com.
The rsync command will retry when failures occur, but if enough failures happen during a particular copy or delete operation the command will fail.
If the -C option is provided, the command will instead skip the failing object and move on. At the end of the synchronization run if any failures were not successfully retried, the rsync command will report the count of failures, and exit with non-zero status. At this point you can run the rsync command again, and it will attempt any remaining needed copy and/or delete operations.
Note that there are cases where retrying will never succeed, such as if you don't have write permission to the destination bucket or if the destination path for some objects is longer than the maximum allowed length.
For more details about gsutil's retry handling, please see gsutil help retries.
Change Detection Algorithm
To determine if a file or object has changed, gsutil rsync first checks whether the file modification time (mtime) of both the source and destination is available. If mtime is available at both source and destination, and the destination mtime is different than the source, or if the source and destination file size differ, gsutil rsync will update the destination. If the source is a cloud bucket and the destination is a local file system, and if mtime is not available for the source, gsutil rsync will use the time created for the cloud object as a substitute for mtime. Otherwise, if mtime is not available for either the source or the destination, gsutil rsync will fall back to using checksums. If the source and destination are both cloud buckets with checksums available, gsutil rsync will use these hashes instead of mtime. However, gsutil rsync will still update mtime at the destination if it is not present. If the source and destination have matching checksums and only the source has an mtime, gsutil rsync will copy the mtime to the destination. If neither mtime nor checksums are available, gsutil rsync will resort to comparing file sizes.
Checksums will not be available when comparing composite Google Cloud Storage objects with objects at a cloud provider that does not support CRC32C (which is the only checksum available for composite objects). See gsutil help compose for details about composite objects.
Copying In The Cloud And Metadata Preservation
If both the source and destination URL are cloud URLs from the same provider, gsutil copies data "in the cloud" (i.e., without downloading to and uploading from the machine where you run gsutil). In addition to the performance and cost advantages of doing this, copying in the cloud preserves metadata (like Content-Type and Cache-Control). In contrast, when you download data from the cloud it ends up in a file, which has no associated metadata, other than file modification time (mtime). Thus, unless you have some way to hold on to or re-create that metadata, synchronizing a bucket to a directory in the local file system will not retain the metadata other than mtime.
Note that by default, the gsutil rsync command does not copy the ACLs of objects being synchronized and instead will use the default bucket ACL (see gsutil help defacl). You can override this behavior with the -p option (see OPTIONS below).
If you find that CRC32C checksum computation runs slowly, this is likely because you don't have a compiled CRC32c on your system. Try running:
gsutil ver -l
If the output contains:
compiled crcmod: False
you are running a Python library for computing CRC32C, which is much slower than using the compiled code. For information on getting a compiled CRC32C implementation, see gsutil help crc32c.
- The gsutil rsync command will only allow non-negative file modification times to be used in its comparisons. This means gsutil rsync will resort to using checksums for any file with a timestamp before 1970-01-01 UTC.
- The gsutil rsync command considers only the current object generations in the source and destination buckets when deciding what to copy / delete. If versioning is enabled in the destination bucket then gsutil rsync's overwriting or deleting objects will end up creating versions, but the command doesn't try to make the archived generations match in the source and destination buckets.
- The gsutil rsync command does not support copying special file types such as sockets, device files, named pipes, or any other non-standard files intended to represent an operating system resource. If you run gsutil rsync on a source directory that includes such files (for example, copying the root directory on Linux that includes /dev ), you should use the -x flag to exclude these files. Otherwise, gsutil rsync may fail or hang.
- The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because cloud objects are immutable and no facility exists to read partial cloud object checksums or perform partial overwrites.
|-a canned_acl||Sets named canned_acl when uploaded objects created. See "gsutil help acls" for further details. Note that rsync will decide whether or not to perform a copy based only on object size and modification time, not current ACL state. Also see the -p option below.|
|-c||Causes the rsync command to compute and compare checksums (instead of comparing mtime) for files if the size of source and destination match. This option increases local disk I/O and run time if either src_url or dst_url are on the local file system.|
|-C||If an error occurs, continue to attempt to copy the remaining files. If errors occurred, gsutil's exit status will be non-zero even if this flag is set. This option is implicitly set when running "gsutil -m rsync...". Note: -C only applies to the actual copying operation. If an error occurs while iterating over the files in the local directory (e.g., invalid Unicode file name) gsutil will print an error message and abort.|
|-d||Delete extra files under dst_url not found under src_url. By default extra files are not deleted. Note: this option can delete data quickly if you specify the wrong source/destination combination. See the help section above, "BE CAREFUL WHEN USING -d OPTION!".|
|-e||Exclude symlinks. When specified, symbolic links will be ignored. Note that gsutil does not follow directory symlinks, regardless of whether -e is specified.|
Applies gzip transport encoding to any file upload whose extension matches the -j extension list. This is useful when uploading files with compressible content (such as .js, .css, or .html files) because it saves network bandwidth while also leaving the data uncompressed in Google Cloud Storage.
When you specify the -j option, files being uploaded are compressed in-memory and on-the-wire only. Both the local files and Cloud Storage objects remain uncompressed. The uploaded objects retain the Content-Type and name of the original files.
Note that if you want to use the top-level -m option to parallelize copies along with the -j/-J options, you should prefer using multiple processes instead of multiple threads; when using -j/-J, multiple threads in the same process are bottlenecked by Python's GIL. Thread and process count can be set using the "parallel_thread_count" and "parallel_process_count" boto config options, e.g.:
gsutil -o "GSUtil:parallel_process_count=8" \ -o "GSUtil:parallel_thread_count=1" \ -m rsync -j /local/source/dir gs://bucket/path
Applies gzip transport encoding to file uploads. This option works like the -j option described above, but it applies to all uploaded files, regardless of extension.
Warning: If you use this option and some of the source files don't compress well (e.g., that's often true of binary data), this option may result in longer uploads.
|-n||Causes rsync to run in "dry run" mode, i.e., just outputting what would be copied or deleted without actually doing any copying/deleting.|
Causes ACLs to be preserved when objects are copied. Note that rsync will decide whether or not to perform a copy based only on object size and modification time, not current ACL state. Thus, if the source and destination differ in size or modification time and you run gsutil rsync -p, the file will be copied and ACL preserved. However, if the source and destination don't differ in size or checksum but have different ACLs, running gsutil rsync -p will have no effect.
Note that this option has performance and cost implications when using the XML API, as it requires separate HTTP calls for interacting with ACLs. The performance issue can be mitigated to some degree by using gsutil -m rsync to cause parallel synchronization. Also, this option only works if you have OWNER access to all of the objects that are copied.
You can avoid the additional performance and cost of using rsync -p if you want all objects in the destination bucket to end up with the same ACL by setting a default object ACL on that bucket instead of using rsync -p. See gsutil help defacl.
Causes POSIX attributes to be preserved when objects are copied. With this feature enabled, gsutil rsync will copy fields provided by stat. These are the user ID of the owner, the group ID of the owning group, the mode (permissions) of the file, and the access/modification time of the file. For downloads, these attributes will only be set if the source objects were uploaded with this flag enabled.
On Windows, this flag will only set and restore access time and modification time. This is because Windows doesn't have a notion of POSIX uid/gid/mode.
|-R, -r||The -R and -r options are synonymous. Causes directories, buckets, and bucket subdirectories to be synchronized recursively. If you neglect to use this option gsutil will make only the top-level directory in the source and destination URLs match, skipping any sub-directories.|
|-u||When a file/object is present in both the source and destination, if mtime is available for both, do not perform the copy if the destination mtime is newer.|
|-U||Skip objects with unsupported object types instead of failing. Unsupported object types are Amazon S3 Objects in the GLACIER storage class.|
Causes files/objects matching pattern to be excluded, i.e., any matching files/objects will not be copied or deleted. Note that the pattern is a Python regular expression, not a wildcard (so, matching any string ending in "abc" would be specified using ".*abc$" rather than "*abc"). Note also that the exclude path is always relative (similar to Unix rsync or tar exclude options). For example, if you run the command:
gsutil rsync -x "data./.*\.txt$" dir gs://my-bucket
it will skip the file dir/data1/a.txt.
You can use regex alternation to specify multiple exclusions, for example:
gsutil rsync -x ".*\.txt$|.*\.jpg$" dir gs://my-bucket
NOTE: When using this on the Windows command line, use ^ as an escape character instead of \ and escape the | character.