gsutil supports URI wildcards. For example, the command:
gsutil cp gs://bucket/data/abc* .
will copy all objects that start with gs://bucket/data/abc followed by any number of characters within that subdirectory.
Directory By Directory Vs Recursive Wildcards
The "*" wildcard only matches up to the end of a path within a subdirectory. For example, if bucket contains objects named gs://bucket/data/abcd, gs://bucket/data/abcdef, and gs://bucket/data/abcxyx, as well as an object in a sub-directory (gs://bucket/data/abc/def) the above gsutil cp command would match the first 3 object names but not the last one.
If you want matches to span directory boundaries, use a '**' wildcard:
gsutil cp gs://bucket/data/abc** .
will match all four objects above.
Note that gsutil supports the same wildcards for both objects and file names. Thus, for example:
gsutil cp data/abc* gs://bucket
will match all names in the local file system. Most command shells also support wildcarding, so if you run the above command probably your shell is expanding the matches before running gsutil. However, most shells do not support recursive wildcards ('**'), and you can cause gsutil's wildcarding support to work for such shells by single-quoting the arguments so they don't get interpreted by the shell before being passed to gsutil:
gsutil cp 'data/abc**' gs://bucket
You can specify wildcards for bucket names within a single project. For example:
gsutil ls gs://data*.example.com
will list the contents of all buckets whose name starts with "data" and ends with ".example.com" in the default project. The -p option can be used to specify a project other than the default. For example:
gsutil ls -p other-project gs://data*.example.com
You can also combine bucket and object name wildcards. For example this command will remove all ".txt" files in any of your Google Cloud Storage buckets in the default project:
gsutil rm gs://*/**.txt
Other Wildcard Characters
In addition to '*', you can use these wildcards:
- Matches a single character. For example "gs://bucket/??.txt" only matches objects with two characters followed by .txt.
- Match any of the specified characters. For example "gs://bucket/[aeiou].txt" matches objects that contain a single vowel character followed by .txt
- [char range]
- Match any of the range of characters. For example "gs://bucket/[a-m].txt" matches objects that contain letters a, b, c, ... or m, and end with .txt.
You can combine wildcards to provide more powerful matches, for example:
Different Behavior For "Dot" Files In Local File System
Per standard Unix behavior, the wildcard "*" only matches files that don't start with a "." character (to avoid confusion with the "." and ".." directories present in all Unix directories). gsutil provides this same behavior when using wildcards over a file system URI, but does not provide this behavior over cloud URIs. For example, the following command will copy all objects from gs://bucket1 to gs://bucket2:
gsutil cp gs://bucket1/* gs://bucket2
but the following command will copy only files that don't start with a "." from the directory "dir" to gs://bucket1:
gsutil cp dir/* gs://bucket1
Efficiency Consideration: Using Wildcards Over Many Objects
It is more efficient, faster, and less network traffic-intensive to use wildcards that have a non-wildcard object-name prefix, like:
than it is to use wildcards as the first part of the object name, like:
This is because the request for "gs://bucket/abc*.txt" asks the server to send back the subset of results whose object name start with "abc" at the bucket root, and then gsutil filters the result list for objects whose name ends with ".txt". In contrast, "gs://bucket/*abc.txt" asks the server for the complete list of objects in the bucket root, and then filters for those objects whose name ends with "abc.txt". This efficiency consideration becomes increasingly noticeable when you use buckets containing thousands or more objects. It is sometimes possible to set up the names of your objects to fit with expected wildcard matching patterns, to take advantage of the efficiency of doing server-side prefix requests. See, for example gsutil help prod for a concrete use case example.
Efficiency Consideration: Using Mid-Path Wildcards
Suppose you have a bucket with these objects:
gs://bucket/obj1 gs://bucket/obj2 gs://bucket/obj3 gs://bucket/obj4 gs://bucket/dir1/obj5 gs://bucket/dir2/obj6
If you run the command:
gsutil ls gs://bucket/*/obj5
gsutil will perform a /-delimited top-level bucket listing and then one bucket listing for each subdirectory, for a total of 3 bucket listings:
GET /bucket/?delimiter=/ GET /bucket/?prefix=dir1/obj5&delimiter=/ GET /bucket/?prefix=dir2/obj5&delimiter=/
The more bucket listings your wildcard requires, the slower and more expensive it will be. The number of bucket listings required grows as:
- the number of wildcard components (e.g., "gs://bucket/a??b/c*/*/d" has 3 wildcard components);
- the number of subdirectories that match each component; and
- the number of results (pagination is implemented using one GET request per 1000 results, specifying markers for each).
If you want to use a mid-path wildcard, you might try instead using a recursive wildcard, for example:
gsutil ls gs://bucket/**/obj5
This will match more objects than "gs://bucket/*/obj5" (since it spans directories), but is implemented using a delimiter-less bucket listing request (which means fewer bucket requests, though it will list the entire bucket and filter locally, so that could require a non-trivial amount of network traffic).