AI & Machine Learning

Cloud Storage as a File System in AI Training

November 17, 2021

Oliver Zhuang

Software Engineer

Cloud Storage is a common choice for Vertex AI and AI Platform users to store their training data, models, checkpoints and logs. Now, with Cloud Storage FUSE, training jobs on both platforms can access their data on Cloud Storage as files in the local file system.

This post introduces the Cloud Storage FUSE for Vertex AI Custom Training. On AI Platform Training, the feature is very similar.

Cloud Storage FUSE provides 3 benefits over the traditional ways of accessing Cloud Storage:

Training jobs can start quickly without downloading any training data.
Training jobs can perform I/O easily at scale, without the friction of calling the Cloud Storage APIs, handling the responses, or integrating with client-side libraries.
Training jobs can leverage the optimized performance of Cloud Storage FUSE.

The problems

Traditionally, training jobs have two ways to use data from Cloud Storage.

They can use gsutil to download the entire dataset prior to training. This may take hours depending on the dataset size, which significantly slows down the start-up of the jobs.
They can call Cloud Storage APIs directly or from a client library integrated. This way greatly adds complexity to the training code and thus the cost for development and maintenance.

Cloud Storage FUSE

Cloud Storage FUSE is a File System in User Space (FUSE) mounted on Vertex AI systems.

When you start a custom training job, the job sees a directory /gcs which contains all the Cloud Storage buckets as subdirectories. The job can visit the subdirectories (ie. buckets) when certain permissions are granted.

For instance, training jobs can read from file /gcs/example-bucket/data.csv to get the training data stored in object gs://example-bucket/data.csv

Training jobs can also write to the bucket:

Permissions

Users can assign service accounts to the training jobs to configure their permissions for the Cloud Storage buckets.

If the training job is assigned without a service account, it is allowed to access all the buckets owned by the same project.
If the training job is assigned with a service account that has Cloud Storage Roles, it has the permissions given by the roles.

For instance, you may create a service account as

storage.objectAdmin to bucket A, and
storage.objectViewer to bucket B.

If you assign it to your training job, your training job will be able to

read and write in bucket A, and
read only in bucket B.

The training job will fail with error “permission denied” if it tries to write to bucket B.

Performance

The I/O is often a bottleneck for training jobs with large datasets. Here are some tips to improve the read throughput of the Cloud Storage FUSE:

Store data in large files to reduce the number of files used in the training. Fewer files mean less lookup overhead in locating and opening objects in Cloud Storage.
Use multiple threads. Higher concurrency utilizes the bandwidth better.
Keep the files warm. Files to be accessed frequently (ie warm) are generally better cached and have better performance being read.

Restrictions

Cloud Storage FUSE is not a POSIX compliant file system. Therefore, some usage in a POSIX file system would have unwanted results, which should be avoided.

Directories:

The root directory `/gcs` is not readable. If you run ls /gcs, you will get an “Input/output error”. However, it is okay to read the bucket root such as ls /gcs/example-bucket.
Renaming a directory is not atomic. A renaming operation interrupted would leave a partial result with some files in the new directory, while others in the old directory. A directory with too many direct and indirect files cannot be renamed.

Files:

Hard links are not supported.
File metadata such as ownership, permissions, mtime, extended attributes, are not supported. Do not rely on file metadata for training logic.
Flushing files pushes the entire file to Cloud Storage, which is expensive. Closing a file leads to a flush. Therefore, one should avoid frequent file closes and flushes.
Concurrent write to a file would lead to data corruption.

Logs

You can find the logs from Cloud Storage FUSE to help you diagnose the errors in training.

First, you follow the link to the Cloud Log Explorer on the training job’s page in Pantheon. In the explorer, you can run queries to inspect the logs generated from your training job.
Second, you can view the logs with “gcsfuse” in the resource.labels.taskName property. For instance, the task name “workerpool0-0.gcsfuse” indicates the log is from the Cloud Storage FUSE mounted for the first worker “0” in the first worker pool “workerpool0”.