Storage services

There are multiple use cases where storage services can be used in artificial intelligence (AI) and machine learning (ML) workloads.

Storage use cases

Use cases where storage services might be used include the following:

High durability, medium I/O operations Low durability, high I/O operations Other operations
  • Loading model binaries for training
  • Loading model variables for inference
  • Storing model checkpoints while trainings
  • Storing scratch or temporary data
  • Loading VM images
  • Loading data for training
  • Loading model weights
  • Logging data

Storage recommendations

To ensure optimization of ML system performance, a combination of storage services from both our first party and third party catalog might be appropriate.

The following storage services are recommended for each use case as follows:

Storage service Features Use case
Filestore (Zonal tier)
  • Provides an NFS mount for Google Kubernetes Engine (GKE) and Compute Engine VM instances
  • Scales to dozens of clients and 100TB of capacity
  • Often used for /home directories
  • Small scale AI and ML training and serving
  • Provides a highly scalable, highly durable, low cost Cloud Storage object store.

    Through integration with Cloud Storage FUSE, Cloud Storage buckets can be mounted as a file system

  • Supports large scale (TBs to EBs) training data for GPU and TPU clusters
  • Supports high-throughput (up to 1.2TB/s bandwidth or greater) training and inference. To gain this throughput you need to tune Cloud Storage FUSE, use a Cloud Storage FUSE File Cache, and plan for Cloud Storage bandwidth.
  • Load binaries
  • Load data for training
  • Store checkpoints
  • Serve models
Parallelstore
  • Provides a parallel file system (PFS) scratch storage solution
  • Scales to 100TiB datasets and 125GB/s of throughput
  • Delivers ultra low-latency sub-ms solution and a high per-VM read/write throughput (20+ GB/s)
  • Has full POSIX support which enables out of the box migration of on-premises AI workloads to Google Cloud
  • Store scratch or temporary data
  • Load data for training
  • Serve models
Sycomp Storage Scale (GPFS)
  • Provides a parallel file system (PFS) persistent storage solution.

    This storage solution is a third-party storage system that is available on the Google Cloud Marketplace.

  • Scales to PB+ of training data and 1.2TB/s of throughput
  • Delivers ultra low-latency sub-ms solution and a high per-VM read/write throughput (20+ GB/s)
  • Has full POSIX support with direct access to datasets that use Cloud Storage, Sycomp Storage, or NFS.
  • Supports Hybrid deployments by using on-demand caching of on-premises data
  • Store persistent data
  • Load data for training
  • Serve models