Filename encoding and interoperability problems
To reduce the chance for filename encoding interoperability problems gsutil uses UTF-8 character encoding when uploading and downloading files. Because UTF-8 is in widespread (and growing) use, for most users nothing needs to be done to use UTF-8. Users with files stored in other encodings (such as Latin 1) must convert those filenames to UTF-8 before attempting to upload the files.
The most common place where users who have filenames that use some other encoding encounter a gsutil error is while uploading files using the recursive (-R) option on the gsutil cp , mv, or rsync commands. When this happens you'll get an error like this:
CommandException: Invalid Unicode path encountered ('dir1/dir2/file_name_with_\xf6n_bad_chars'). gsutil cannot proceed with such files present. Please remove or rename this file and try again.
Note that the invalid Unicode characters have been hex-encoded in this error message because otherwise trying to print them would result in another error.
If you encounter such an error you can either remove the problematic file(s) or try to rename them and re-run the command. If you have a modest number of such files the simplest thing to do is to think of a different name for the file and manually rename the file (using local filesystem tools). If you have too many files for that to be practical you can use a tool to convert the old character encoding to UTF-8. One such tool is native2ascii.
Unicode errors for valid Unicode filepaths can be caused by lack of Python locale configuration on Linux and Mac OSes. If your file paths are Unicode and you get encoding errors, ensure the LANG environment variable is set correctly. Typically, the LANG variable should be set to something like "en_US.UTF-8" or "de_DE.UTF-8".
Note also that there's no restriction on the character encoding used in file content - it can be UTF-8, a different encoding, or non-character data (like audio or video content). The gsutil UTF-8 character encoding requirement applies only to filenames.
Using Unicode Filenames On Windows
Windows support for Unicode in the command shell (cmd.exe or powershell) is somewhat painful, because Windows uses a Windows-specific character encoding called cp1252. To use Unicode characters you need to run this command in the command shell before the first time you use gsutil in that shell:
If you neglect to do this before using gsutil, the progress messages while uploading files with Unicode names or listing buckets with Unicode object names will look garbled (i.e., with different glyphs than you expect in the output). If you simply run the chcp command and re-run the gsutil command, the output should no longer look garbled.
gsutil attempts to translate between cp1252 encoding and UTF-8 in the main places that Unicode encoding/decoding problems have been encountered to date (traversing the local file system while uploading files, and printing Unicode names while listing buckets). However, because gsutil must perform translation, it is likely there are other erroneous edge cases when using Windows with Unicode. If you encounter problems, you might consider instead using cygwin (on Windows) or Linux or MacOS - all of which support Unicode.
Using Unicode Filenames On Mac Os
MacOS stores filenames in decomposed form (also known as NFD normalization). For example, if a filename contains an accented "e" character, that character will be converted to an "e" followed by an accent before being saved to the filesystem. As a consequence, it's possible to have different name strings for files uploaded from an operating system that doesn't enforce decomposed form (like Ubuntu) from one that does (like MacOS).
The following example shows how this behavior could lead to unexpected results. Say you create a file with non-ASCII characters on Ubuntu. Ubuntu stores that filename in its composed form. When you upload the file to the cloud, it is stored as named. But if you use gsutil rysnc to bring the file to a MacOS machine and edit the file, then when you use gsutil rsync to bring this version back to the cloud, you end up with two different objects, instead of overwriting the original. This is because MacOS converted the filename to a decomposed form, and Cloud Storage sees this as a different object name.
Cross-Platform Encoding Problems Of Which To Be Aware
Using UTF-8 for all object names and filenames will ensure that gsutil doesn't encounter character encoding errors while operating on the files. Unfortunately, it's still possible that files uploaded / downloaded this way can have interoperability problems, for a number of reasons unrelated to gsutil. For example:
- Windows filenames are case-insensitive, while Google Cloud Storage, Linux, and MacOS are not. Thus, for example, if you have two filenames on Linux differing only in case and upload both to Google Cloud Storage and then subsequently download them to Windows, you will end up with just one file whose contents came from the last of these files to be written to the filesystem.
- Mac OS performs character encoding decomposition based on tables stored in the OS, and the tables change between Unicode versions. Thus the encoding used by an external library may not match that performed by the OS. It is possible that two object names may translate to a single local filename.
- Windows console support for Unicode is difficult to use correctly.
For a more thorough list of such issues see this presentation
These problems mostly arise when sharing data across platforms (e.g., uploading data from a Windows machine to Google Cloud Storage, and then downloading from Google Cloud Storage to a machine running MacOS). Unfortunately these problems are a consequence of the lack of a filename encoding standard, and users need to be aware of the kinds of problems that can arise when copying filenames across platforms.
There is one precaution users can exercise to prevent some of these problems: When using the Windows console specify wildcards or folders (using the -R option) rather than explicitly named individual files.