Overview of caching in Cloud Storage FUSE

Cloud Storage FUSE provides three types of optional caching to help increase the performance of data retrieval:

File caching overview

The Cloud Storage FUSE file cache is a client-based read cache that lets repeat file reads to be served from a faster cache storage of your choice. The file cache is disabled by default.

Benefits of file caching

  • Improved performance: file caching improves latency and throughput by serving reads directly from the cache media. Small and random I/O operations can be significantly faster when served from the cache.

  • Use existing capacity: file caching can use existing provisioned machine capacity for your cache directory without incurring charges for additional storage. This includes Local SSDs that come bundled with Cloud GPUs machine types such as a2-ultragpu, a3-highgpu, Persistent Disk (which is the boot disk used by each VM), or in-memory /tmpfs.

  • Reduced charges: cache hits are served locally and don't incur Cloud Storage operation or network charges.

  • Improved total cost of ownership for AI and ML training: file caching increases Cloud GPUs and Cloud TPU utilization by loading data faster, which reduces time to training, and provides a greater price-performance ratio for AI and ML training workloads.

Enabling and configuring the file cache

When you enable the file caching feature, you set fields in a Cloud Storage FUSE configuration file. The following list describes the fields you can use to control file caching:

  • You can control the maximum capacity in your specified cache directory that the cached data can occupy by setting the max-size-mb.

    By default, the max-size-mb field is set to a value of -1, which allows the cached data to grow until it occupies all the available capacity in the directory you specify as a value to cache-dir.

  • You can specify a directory for storing file cache data by using the cache-dir field. Note that specifying a cache directory is a prerequisite for enabling the file cache.

  • You can control the time at which cached data becomes invalidated by using the ttl-secs field. By default, the ttl-secs field is set to 60, which specifies 60 seconds. We recommend increasing the value.

    For more details about controlling the invalidation of cached data, see Configuring cache data invalidation. For more information about the eviction of cached data, see Eviction.

Random & Partial Reads

If the first file read operation starts from the beginning of the file, or offset 0, the Cloud Storage FUSE file cache ingests and loads the entire file into the cache, even if you're only reading from a small range subset. This lets subsequent random or partial reads from the same object get served directly from the cache.

If a file's first read operation starts from anywhere other than offset 0, Cloud Storage FUSE, by default, doesn't trigger an asynchronous full file fetch. To change this behavior so that Cloud Storage FUSE ingests a file to the cache upon an initial random read, set the cache-file-for-range-read flag to true. We recommend that you enable the cache-file-for-range-read flag if many different random or partial read operations are performed on the same object.

Eviction

The eviction of cached metadata and data is based on a least recently used (LRU) algorithm that begins once the space threshold configured per max-size-mb limit is reached. If the entry expires based on its TTL, a Get metadata call is first made to Cloud Storage and is subject to network latencies. Since the data and metadata are managed separately, you might experience one entity being evicted or invalidated and not the other.

Performance

Cloud Storage FUSE caching works with any user-specified directory that's backed by your choice of storage, such as Local SSD, Persistent Disk, in-memory tmpfs, or Filestore. Cloud Storage FUSE cache performance matches underlying storage used by the cache with minimal overhead. To learn more about caching performance, see Cloud Storage FUSE caching performance and best practices.

Persistence

Cloud Storage FUSE caches aren't persisted on unmounts and restart when all metadata entries are evicted. However, data in the file cache isn't evicted and should be deleted by the user, or can be reused in subsequent mount operations once the metadata has been populated again.

Security

When you enable caching, Cloud Storage FUSE uses the specified cache-dir you set as the underlying directory for the cache to persist files from your Cloud Storage bucket in an unencrypted format. Any user or process that has access to this cache directory can access these files. We recommend that you restrict access to this directory.

Direct or multiple access to the file cache

Using a process other than Cloud Storage FUSE to access or modify a file in the cache directory can lead to data corruption. Cloud Storage FUSE caches are specific to each Cloud Storage FUSE running process with no awareness across different Cloud Storage FUSE processes running on the same or different machines. Subsequently, the same cache directory shouldn't be used by different Cloud Storage FUSE processes.

If multiple Cloud Storage FUSE processes need to run on the same machine, each Cloud Storage FUSE process should get its own specific cache directory, or use one of following methods to ensure your data doesn't get corrupted:

  • Mount all buckets with a shared cache: use dynamic mounting to mount all buckets you have access to in a single process with a shared cache. To learn more, see Cloud Storage FUSE dynamic mounting.

  • Enable caching on a specific bucket: you can enable caching on only a specified bucket using static mounting. To learn more, see Cloud Storage FUSE static mounting.

  • Cache only a specific folder or directory: rather than mounting an entire bucket, you can use the –only-dir option to mount and cache only a specific bucket level folder. To learn more, see Mount a directory within a bucket.

Stat caching overview

The Cloud Storage FUSE stat cache is a cache for object metadata that improves performance for operations specific to file attributes such as size, modification time, or permissions. Using stat cache improves latency by using cached data to perform operations instead of sending a stat object request to Cloud Storage. By default, the stat cache is enabled with a stat-cache-max-size-mb value of 32 MB and a ttl-secs value set to 60 seconds. We recommend increasing both values. To learn more about stat caching, see the Semantics documentation on GitHub.

Type caching overview

The Cloud Storage FUSE type cache is a metadata cache that accelerates performance for metadata operations specific to file or directory existence. Using type cache improves latency by reducing the number of requests made to Cloud Storage to check if a file or directory exists by storing this information locally. By default, the type cache is enabled with a type-cache-max-size-mb value of 4 MB and a ttl-secs value of 60 seconds by default. We recommend increasing both values. To learn more about type caching, see the Semantics documentation on GitHub.

Configuring cache invalidation

The ttl-secs field specifies how long metadata remains valid in the file cache. The field represents the time to live (TTL) of cached data, where the cached data gets invalidated once the TTL expires. When a metadata entry becomes invalid, subsequent reads are queried from Cloud Storage.

You can configure ttl-secs in a Cloud Storage FUSE config file.

When you specify a value for ttl-secs that's greater than 0, the metadata for the file cache remains valid only for the amount of time you specified. A default value of 60 is set for all cache types. For file caching, we recommend that you increase the ttl-secs value based on the expected time between repeat reads while you balance consistency needs. Based on the importance and frequency of the data changing, we recommend setting the ttl-secs value as high as your workload lets you.

Apart from specifying a value that represents the number of seconds, the ttl-secs flag also supports the values of 0 and -1.

  • ttl-secs value of 0: when you enter a value of 0, the ttl-secs flag ensures that the most up to date file is read by issuing a Get metadata call to Cloud Storage that checks the file it's serving from to ensure the cache is consistent. If the file in the cache is up to date, it's served directly from the cache. Performance is less effective than when you specify a ttl-secs value other than 0 because a call must always be made to Cloud Storage to check the metadata first. If the file is in the cache and hasn't changed, the file is served from the cache with consistency after the Get metadata call.

  • ttl-secs value of -1: when you enter a value of -1, the file is always read from the cache, if it's available, without checking for consistency. Serving files without checking for consistency can serve inconsistent data, and should only be used temporarily for workloads that run in jobs with non-changing data. For example, using a value of -1 is useful for machine learning training, where the same data is read across multiple epochs without changes.

Read path for cached data

The Cloud Storage FUSE cache accelerates repeat reads after they've been ingested to the cache. Both first-time reads and cache misses go directly to Cloud Storage and are subject to normal Cloud Storage network latencies.

Considerations

  • Using Cloud Storage FUSE with file caching, stat caching, or type caching enabled can increase performance but reduce consistency. To learn more, see the Semantics documentation on GitHub.

  • If a file cache entry hasn't yet expired based on its TTL and the file is in the cache, the entire operation is served from the local client cache without any request being issued to Cloud Storage.

  • If a file cache entry has expired based on its TTL, a Get metadata call is first made to Cloud Storage, and if the file isn't in the cache, the file is retrieved from Cloud Storage. Both operations are subject to network latencies. If the metadata entry has been invalidated, but the file is in the cache, and its object generation has not changed, the file is served from the cache only after the Get metadata call is made to check if the data is valid.

  • If a Cloud Storage FUSE client modifies a cached file or its metadata, then the file is immediately invalidated and consistency is ensured in the following read by the same client. However, if different clients access the same file or its metadata, and its entries are cached, then the cached version of the file or metadata is read and not the updated version until the file is invalidated by that specific client's TTL setting.

What's next