How Subdirectories Work
This section provides details about how subdirectories work in gsutil. Most users probably don't need to know these details, and can simply use the commands (like cp -r) that work with subdirectories. We provide this additional documentation to help users understand how gsutil handles subdirectories differently than most GUI / web-based tools, and also to explain cost and performance implications of the gsutil approach, for those interested in such details.
gsutil provides the illusion of a hierarchical file tree atop the "flat"
name space supported by the Cloud Storage service. To the service,
gs://your-bucket/abc/def.txt is just an object that happens to
have "/" characters in its name. There is no "abc" directory, just a single
object with the given name. The following diagram illustrates how gsutil
provides a hierarchical view of objects in a bucket:
gsutil achieves the hierarchical file tree illusion by applying a variety of rules to try to make naming work the way users would expect. For example, in order to determine whether to treat a destination URL as an object name or the root of a directory under which objects should be copied gsutil uses these rules:
If the destination object ends with a "/" gsutil treats it as a directory. For example, if you run the command:
gsutil cp your-file gs://your-bucket/abc/
gsutil creates the object
If you attempt to copy multiple source files to a destination URL, gsutil treats the destination URL as a directory. For example, if you run the command:
gsutil cp -r your-dir gs://your-bucket/abc
gsutil creates objects like
gs://your-bucket/abc/your-dir/file1, etc. (assuming
file1is a file under the source directory
If none of the above rules applies, gsutil performs a bucket listing to determine if the target of the operation is a prefix match to the specified string. For example, if you run the command:
gsutil cp your-file gs://your-bucket/abc
gsutil makes a bucket listing request for the named bucket, using
prefix="abc". It then examines the bucket listing results and determines whether there are objects in the bucket whose path starts with
gs://your-bucket/abc/. If so, gsutil treats the target as a directory name. In turn this impacts the name of the object you create: If the above check indicates there is an
abcdirectory, you end up with the object
gs://your-bucket/abc/your-file; otherwise you end up with the object
gs://your-bucket/abc. (See "HOW NAMES ARE CONSTRUCTED" under gsutil help cp for more details)
This rule-based approach stands in contrast to the way many tools work, which
create 0-byte objects to mark the existence of folders.
gsutil understands several conventions used by such tools, such as the convention
_$folder$ to the end of the name of the 0-byte object, but
gsutil does not require such marker objects to implement naming behavior consistent
with UNIX commands.
A downside of the gsutil subdirectory naming approach is it requires an extra bucket listing before performing the needed cp or mv command. However those listings are relatively inexpensive, because they use delimiter and prefix parameters to limit result data. Moreover, gsutil makes only one bucket listing request per cp/mv command, and thus amortizes the bucket listing cost across all transferred objects (e.g., when performing a recursive copy of a directory to the cloud).
Potential For Surprising Destination Subdirectory Naming
The above rules-based approach for determining how destination paths are constructed can lead to the following surprise: Suppose you start by trying to upload everything under a local directory to a bucket "subdirectory" that doesn't yet exist:
gsutil cp -r ./your-dir/* gs://your-bucket/new
where there are directories under your-dir (say, dir1 and dir2). The first time you run this command it will create the objects:
because gs://your-bucket/new doesn't yet exist. If you run the same command again, because gs://your-bucket/new does now exist, it will create the additional objects:
Beyond the fact that this naming behavior can surprise users, one particular case you should be careful about is if you script gsutil uploads with a retry loop. If you do this and the first attempt copies some but not all files, the second attempt will encounter an already existing source subdirectory and result in the above-described naming problem.
There are a couple of ways to avoid this problem:
1. Use gsutil rsync. Since rsync doesn't use the Unix cp-defined directory naming rules, it will work consistently whether the destination subdirectory exists or not.
2. If using rsync won't work for you, you can start by creating a "placeholder" object to establish that the destination is a subdirectory, by running a command such as:
gsutil cp some-file gs://your-bucket/new/placeholder
At this point running the gsutil cp -r command noted above will consistently treat gs://your-bucket/new as a subdirectory. Once you have at least one object under that subdirectory you can delete the placeholder object and subsequent uploads to that subdirectory will continue to work with naming working as you'd expect.