Design storage for AI and ML workloads in Google Cloud

Last reviewed 2024-03-20 UTC

When you choose Google Cloud storage services for your artificial intelligence (AI) and machine learning (ML) workloads, you must be careful to select the correct combination of storage options for each specific job. This need for careful selection applies when you upload your dataset, train and tune your model, place the model into production, or store the dataset and model in an archive. In short, you need to select the best storage services that provide the proper latency, scale, and cost for each stage of your AI and ML workloads.

To help you make well-informed choices, this document provides design guidance on how to use and integrate the variety of storage options offered by Google Cloud for key AI and ML workloads.

Figure 1 shows a summary of the primary storage choices. As shown in the diagram, you typically choose Cloud Storage when you have larger file sizes, lower input and output operations per second (IOPS), or higher latency. However, when you require higher IOPS, smaller file sizes, or lower latency, choose Filestore instead.

Figure 1: Primary AI and ML storage considerations

Choose Cloud Storage when you have larger file sizes, lower IOPS, or higher latency. Choose Filestore when you require higher IOPS, smaller file sizes, or lower latency.

Overview of AI and ML workload stages

AI and ML workloads consist of four primary stages: prepare, train, serve, and archive. These are the four times in the lifecycle of an AI and ML workload where you need to make a decision about which storage options you should choose to use. In most cases, we recommend that you continue to use the same storage choice that you select in the prepare stage for the remaining stages. Following this recommendation helps you to reduce the copying of datasets between storage services. However, there are some exceptions to this general rule, which are described later in this guide.

Some storage solutions work better than others at each stage and might need to be combined with additional storage choices for the best results. The effectiveness of the storage choice depends on the dataset properties, scale of the required compute and storage resources, latency, and other factors. The following table describes the stages and a brief summary of the recommended storage choices for each stage. For a visual representation of this table and additional details, see the decision tree.

Table 1: Storage recommendations for the stages and steps in AI and ML workloads
Stages Steps Storage recommendations

Prepare

Data preparation

  • Upload and ingest your data.
  • Transform the data into the correct format before training the model.

Cloud Storage

  • Large files (50 MB or larger) that can tolerate higher storage latency (tens of milliseconds).

Filestore Zonal

  • Smaller datasets with smaller files (less than 50 MB) and lower storage latency (~ 1 millisecond).

Train

  1. Model development
    • Develop your model by using notebooks and applying iterative trial and error.
  2. Model training
    • Use small-to-large scale numbers of graphics processing units (Cloud GPUs) or Tensor Processing Units (Cloud TPUs) to repeatedly read the training dataset.
    • Apply an iterative process to model development and training.

Cloud Storage

  • If you select Cloud Storage in the prepare stage, it's best to train your data in Cloud Storage.

Cloud Storage with Local SSD or Filestore

  • If you select Cloud Storage in the prepare stage but need to support small I/O requests or small files, you can supplement your training tasks. To do so, move some of your data from Cloud Storage to Local SSD or Filestore Zonal.

Filestore

  • If you select Filestore in the prepare stage, it's best to train your data in Filestore.
  • Create a Local SSD cache to supplement your Filestore training tasks.
  1. Checkpointing and restart
    • Save state periodically during model training by creating a checkpoint so that the training can restart after a node failure.
    • Make this selection based on the I/O pattern and the amount of data that needs to be saved at the checkpoint.

Cloud Storage

  • If you select Cloud Storage in the prepare stage, it's best to use Cloud Storage for checkpointing and restart.
  • Good for throughput, and workloads that need large numbers of threads.

Filestore Zonal

  • If you select Filestore in the prepare stage, it's best to use Filestore for checkpointing and restart.
  • Good for latency, high per-client throughput, and low numbers of threads.

Serve

  • Store the model.
  • Load the model into an instance running Cloud GPUs or Cloud TPUs at startup.
  • Store results of model inference, such as generated images.
  • Optionally, store and load the dataset used for model inference.

Cloud Storage

  • If you train your model in Cloud Storage, it's best to use Cloud Storage to serve your model.
  • Save the content generated by your model in Cloud Storage.

Filestore

  • If you train your model in Filestore, it's best to use Filestore for serving your model.
  • If you need durability and low latency when generating small files, choose Filestore Zonal (zonal) or Filestore Enterprise (regional).

Archive

  • Retain the training data and the model for extended time periods.

Cloud Storage

  • Optimize storage costs with multiple storage classes, Autoclass, or object lifecycle management.
  • If you use Filestore, you can use Filestore snapshots and backups, or copy the data to Cloud Storage.

For more details about the underlying assumptions for this table, see the following sections:

Criteria

To narrow your choices of which storage options to use for your AI and ML workloads, start by answering these questions:

  • Are your AI and ML I/O request sizes and file sizes small, medium, or large in size?
  • Are your AI and ML workloads sensitive to I/O latency and time to first byte (TTFB)?
  • Do you require high read and write throughput for single clients, aggregated clients, or both?
  • What is the largest number of Cloud GPUs or Cloud TPUs that your single largest AI and ML training workload requires?

In addition to answering the previous questions, you also need to be aware of the compute options and accelerators that you can choose to help optimize your AI and ML workloads.

Compute platform considerations

Google Cloud supports three primary methods for running AI and ML workloads:

For both Compute Engine and GKE, we recommend using the HPC Toolkit to deploy repeatable and turnkey clusters that follow Google Cloud best practices.

Accelerator considerations

When you select storage choices for AI and ML workloads, you also need to select the accelerator processing options that are appropriate for your task. Google Cloud supports two accelerator choices: NVIDIA Cloud GPUs and the custom-developed Google Cloud TPUs. Both types of accelerator are application-specific integrated circuits (ASICs) that are used to process machine learning workloads more efficiently than standard processors.

There are some important storage differences between Cloud GPUs and Cloud TPU accelerators. Instances that use Cloud GPUs support Local SSD with up to 200 GBps remote storage throughput available. Cloud TPU nodes and VMs don't support Local SSD, and rely exclusively on remote storage access.

For more information about accelerator-optimized machine types, see Accelerator-optimized machine family. For more information about Cloud GPUs, see Cloud GPUs platforms. For more information about Cloud TPUs, see Introduction to Cloud TPU. For more information about choosing between Cloud TPUs and Cloud GPUs, see When to use Cloud TPUs.

Storage options

As summarized previously in Table 1, use object storage or file storage with your AI and ML workloads and then supplement this storage option with block storage. Figure 2 shows three typical options that you can consider when selecting the initial storage choice for your AI and ML workload: Cloud Storage, Filestore, and Google Cloud NetApp Volumes.

Figure 2: AI and ML appropriate storage services offered by Google Cloud

The three options that you can consider when selecting the initial storage choice for your AI and ML workloads are Cloud Storage, Filestore, and NetApp Volumes.

If you need object storage, choose Cloud Storage. Cloud Storage provides the following:

  • A storage location for unstructured data and objects.
  • APIs, such as the Cloud Storage JSON API, to access your storage buckets.
  • Persistent storage to save your data.
  • Throughput of terabytes per second, but requires higher storage latency.

If you need file storage, you have two choices–Filestore and NetApp Volumes–which offer the following:

  • Filestore
    • Enterprise, high-performance file storage based on NFS.
    • Persistent storage to save your data.
    • Low storage latency, and throughput of 26 GBps.
  • NetApp Volumes
    • File storage compatible with NFS and Server Message Block (SMB).
    • Can be managed with the option to use NetApp ONTAP storage-software tool.
    • Persistent storage to save your data.
    • Throughput of 4.5 GBps.

Use the following storage options as your first choice for AI and ML workloads:

Use the following storage options to supplement your AI and ML workloads:

If you need to transfer data between these storage options, you can use the data transfer tools.

Cloud Storage

Cloud Storage is a fully managed object storage service that focuses on data preparation, AI model training, data serving, backup, and archiving for unstructured data. Some of the benefits of Cloud Storage include the following:

  • Unlimited storage capacity that scales to exabytes on a global basis
  • Ultra-high throughput performance
  • Regional and dual-region storage options for AI and ML workloads

Cloud Storage scales throughput to terabytes per second and beyond, but it has relatively higher latency (tens of milliseconds) than Filestore or a local file system. Individual thread throughput is limited to approximately 100-200 MB per second, which means that high throughput can only be achieved by using hundreds to thousands of individual threads. Additionally, high throughput also requires the use of large files and large I/O requests.

Cloud Storage supports client libraries in a variety of programming languages, but it also supports Cloud Storage FUSE. Cloud Storage FUSE lets you mount Cloud Storage buckets to your local file system. Cloud Storage FUSE enables your applications to use standard file system APIs to read from a bucket or write to a bucket. You can store and access your training data, models, and checkpoints with the scale, affordability, and performance of Cloud Storage.

To learn more about Cloud Storage, use the following resources:

Filestore

Filestore is a fully managed NFS file-based storage service. The Filestore service tiers used for AI and ML workloads include the following:

  • Enterprise tier: Used for mission-critical workloads requiring regional availability.
  • Zonal tier: Used for high-performance applications that require zonal availability with high IOPS and throughput performance requirements.
  • Basic tier: Used for file sharing, software development, web hosting, and basic AI and ML workloads.

Filestore delivers low latency I/O performance. It's a good choice for datasets with either small I/O access requirements or small files. However, Filestore can also handle large I/O or large file use cases as needed. Filestore can scale up to approximately 100 TB in size. For AI training workloads that read data repeatedly, you can improve read throughput by using FS-Cache with Local SSD.

For more information about Filestore, see the Filestore overview. For more information about Filestore service tiers, see Service tiers. For more information about Filestore performance, see Optimize and test instance performance.

Google Cloud NetApp Volumes

NetApp Volumes is a fully managed service with advanced data management features that support NFS, SMB, and multiprotocol environments. NetApp Volumes supports low latency, multi-tebibyte volumes, and gigabytes per second of throughput.

For more information about NetApp Volumes, see What is Google Cloud NetApp Volumes? For more information about NetApp Volumes performance, see Expected performance.

Block storage

After you select your primary storage choice, you can use block storage to supplement performance, transfer data between storage options, and take advantage of low latency operations. You have two storage options with block storage: Local SSD and Persistent Disk.

Local SSD

Local SSD provides local storage directly to a VM or a container. Most Google Cloud machine types that contain Cloud GPUs include some amount of Local SSD. Because Local SSD disks are attached physically to the Cloud GPUs, they provide low latency access with potentially millions of IOPS. In contrast, Cloud TPU-based instances don't include Local SSD.

Although Local SSD delivers high performance, each storage instance is ephemeral. Thus, the data stored on a Local SSD drive is lost when you stop or delete the instance. Because of the ephemeral nature of Local SSD, consider other types of storage when your data requires better durability.

However, when the amount of training data is very small, it's common to copy the training data from Cloud Storage to the Local SSD of a GPU. The reason is that Local SSD provides lower I/O latency and reduces training time.

For more information about Local SSD, see About Local SSDs. For more information about the amount of Local SSD capacity available with Cloud GPUs instance types, see GPU platforms.

Persistent Disk

Persistent Disk is a network block storage service with a comprehensive suite of data persistence and management capabilities. In addition to its use as a boot disk, you can use Persistent Disk with AI workloads, such as scratch storage. Persistent Disk is available in the following options:

  • Standard, which provides efficient and reliable block storage.
  • Balanced, which provides cost-effective and reliable block storage.
  • SSD, which provides fast and reliable block storage.
  • Extreme, which provides the highest performance block storage option with customizable IOPS.

For more information about Persistent Disk, see Persistent Disk.

Data transfer tools

When you perform AI and ML tasks, there are times when you need to copy your data from one location to another. For example, if your data starts in Cloud Storage, you might move it elsewhere to train the model, then copy the checkpoint snapshots or trained model back to Cloud Storage. You could also perform most of your tasks in Filestore, then move your data and model into Cloud Storage for archive purposes. This section discusses your options for moving data between storage services in Google Cloud.

Storage Transfer Service

With the Storage Transfer Service, you can transfer your data between Cloud Storage, Filestore, and NetApp Volumes. This fully-managed service also lets you copy data between your on-premises file storage and object storage repositories, your Google Cloud storage, and from other cloud providers. The Storage Transfer Service lets you copy your data securely from the source location to the target location, as well as perform periodic transfers of changed data. It also provides data integrity validation, automatic retries, and load balancing.

For more information about Storage Transfer Service, see What is Storage Transfer Service?

Command-line interface options

When you move data between Filestore and Cloud Storage, you can use the following tools:

  • gcloud storage (recommended): Create and manage Cloud Storage buckets and objects with optimal throughput and a full suite of gcloud CLI commands.
  • gsutil: Manage and maintain Cloud Storage components. Requires fine-tuning to achieve better throughput.

Map your storage choices to the AI and ML stages

This section expands upon the summary provided in Table 1 to explore the specific recommendations and guidance for each stage of an AI and ML workload. The goal is to help you understand the rationale for these choices and select the best storage options for each AI and ML stage. This analysis results in three primary recommendations that are explored in the section, Storage recommendations for AI and ML.

The following figure provides a decision tree that shows the recommended storage options for the four main stages of an AI and ML workload. The diagram is followed by a detailed explanation of each stage and the choices that you can make at each stage.

Figure 3: Storage choices for each AI and ML stage

A decision tree that shows the recommended storage options for the four main stages of an AI and ML workload.

Prepare

At this initial stage, you need to select whether you want to use Cloud Storage or Filestore as your persistent source of truth for your data. You can also select potential optimizations for data-intensive training. Know that different teams in your organization can have varying workload and dataset types that might result in those teams making different storage decisions. To accommodate these varied needs, you can mix and match your storage choices between Cloud Storage and Filestore accordingly.

Cloud Storage for the prepare stage

  • Your workload contains large files of 50 MB or more.
  • Your workload requires lower IOPS.
  • Your workload can tolerate higher storage latency in the tens of milliseconds.

  • You need to gain access to the dataset through Cloud Storage APIs, or Cloud Storage FUSE and a subset of file APIs.

To optimize your workload in Cloud Storage, you can select regional storage and place your bucket in the same region as your compute resources. However, if you need higher reliability, or if you use accelerators located in two different regions, you'll want to select dual-region storage.

Filestore for the prepare stage

You should select Filestore to prepare your data if any of the following conditions apply:

  • Your workload contains smaller file sizes of less than 50 MB.
  • Your workload requires higher IOPS.
  • Your workload needs lower latency of less than 1 millisecond to meet storage requirements for random I/O and metadata access.
  • Your users need a desktop-like experience with full POSIX support to view and manage the data.
  • Your users need to perform other tasks, such as software development.

Other considerations for the prepare stage

If you find it hard to choose an option at this stage, consider the following points to help you make your decision:

  • If you want to use other AI and ML frameworks, such as Dataflow, Spark, or BigQuery on the dataset, then Cloud Storage is a logical choice because of the custom integration it has with these types of frameworks.
  • Filestore has a maximum capacity of approximately 100 TB. If you need to train your model with datasets larger than this, or if you can't break the set into multiple 100 TB instances, then Cloud Storage is a better option.

During the data preparation phase, many users reorganize their data into large chunks to improve access efficiency and avoid random read requests. To further reduce the I/O performance requirements on the storage system, many users use pipelining, training optimization to increase the number of I/O threads, or both.

Train

At the train stage, you typically reuse the primary storage option that you selected for the prepare stage. If your primary storage choice can't handle the training workload alone, you might need to supplement the primary storage. You can add supplemental storage as needed, such as Local SSDs, to balance the workload.

In addition to providing recommendations for using either Cloud Storage or Filestore at this stage, this section also provides you with more details about these recommendations. The details include the following:

  • Guidance for file sizes and request sizes
  • Suggestions on when to supplement your primary storage choice
  • An explanation of the implementation details for the two key workloads at this stage—data loading, and checkpointing and restart

Cloud Storage for the train stage

The main reasons to select Cloud Storage when training your data include the following:

  • If you use Cloud Storage when you prepare your data, it's best to train your data in Cloud Storage.
  • Cloud Storage is a good choice for throughput, workloads that don't require high single-VM throughput, or workloads that use many threads to increase throughput as needed.

Cloud Storage with Local SSD or Filestore for the train stage

The main reason to select Cloud Storage with Local SSD or Filestore when training your data occurs when you need to support small I/O requests or small files. In this case, you can supplement your Cloud Storage training task by moving some of the data to Local SSD or Filestore Zonal.

Filestore for the train stage

The main reasons to select Filestore when training your data include the following:

  • If you use Filestore when you prepare your data, in most cases, you should continue to train your data in Filestore.
  • Filestore is a good choice for low latency, high per-client throughput, and applications that use a low number of threads but still require high performance.
  • If you need to supplement your training tasks in Filestore, consider creating a Local SSD cache as needed.

File sizes and request sizes

Once the dataset is ready for training, there are two main options that can help you evaluate the different storage options.

Data sets containing large files and accessed with large request sizes

In this option, the training job consists primarily of larger files of 50 MB or more. The training job ingests the files with 1 MB to 16 MB per request. We generally recommend Cloud Storage with Cloud Storage FUSE for this option because the files are large enough that Cloud Storage should be able to keep the accelerators supplied. Keep in mind that you might need hundreds to thousands of threads to achieve maximum performance with this option.

However, if you require full POSIX APIs for other applications, or your workload isn't appropriate for the high number of required threads, then Filestore is a good alternative.

Data sets containing small-to-medium sized files, or accessed with small request sizes

With this option, you can classify your training job in one of two ways:

  • Many small-to-medium sized files of less than 50 MB.
  • A dataset with larger files, but the data is read sequentially or randomly with relatively small read request sizes (for example, less than 1 MB). An example of this use case is when the system reads less than 100 KB at a time from a multi-gigabyte or multi-terabyte file.

If you already use Filestore for its POSIX capabilities, then we recommend keeping your data in Filestore for training. Filestore offers low I/O latency access to the data. This lower latency can reduce the overall training time and might lower the cost of training your model.

If you use Cloud Storage to store your data, then we recommend that you copy your data to Local SSD or Filestore prior to training.

Data loading

During data loading, Cloud GPUs or Cloud TPUs import batches of data repeatedly to train the model. This phase can be cache friendly, depending on the size of the batches and the order in which you request them. Your goal at this point is to train the model with maximum efficiency but at the lowest cost.

If the size of your training data scales to petabytes, the data might need to be re-read multiple times. Such a scale requires intensive processing by a GPU or TPU accelerator. However, you need to ensure that your Cloud GPUs and Cloud TPUs aren't idle, but process your data actively. Otherwise, you pay for an expensive, idle accelerator while you copy the data from one location to another.

For data loading, consider the following:

  • Parallelism: There are numerous ways to parallelize training, and each can have an impact on the overall storage performance required and the necessity of caching data locally on each instance.
  • Maximum number of Cloud GPUs or Cloud TPUs for a single training job: As the number of accelerators and VMs increases, the impact on the storage system can be significant and might result in increased costs if the Cloud GPUs or Cloud TPUs are idle. However, there are ways to minimize costs as you increase the number of accelerators. Depending on the type of parallelism that you use, you can minimize costs by increasing the aggregate read throughput requirements that are needed to avoid idle accelerators.

To support these improvements in either Cloud Storage or Filestore, you need to add Local SSD to each instance so that you can offload I/O from the overloaded storage system.

However, preloading data into each instance's Local SSD from Cloud Storage has its own challenges. You risk incurring increased costs for the idle accelerators while the data is being transferred. If your data transfer times and accelerator idle costs are high, you might be able to lower costs by using Filestore with Local SSD instead.

  • Number of Cloud GPUs per instance: When you deploy more Cloud GPUs to each instance, you can increase the inter-Cloud GPUs throughput with NVLink. However, the available Local SSD and storage networking throughput doesn't always increase linearly.
  • Storage and application optimizations: Storage options and applications have specific performance requirements to be able to run optimally. Be sure to balance these storage and application system requirements with your data loading optimizations, such as keeping your Cloud GPUs or Cloud TPUs busy and operating efficiently.

Checkpointing and restart

For checkpointing and restart, training jobs need to periodically save their state so they can recover quickly from instance failures. When the failure happens, jobs must restart, ingest the latest checkpoint, and then resume training. The exact mechanism used to create and ingest checkpoints is typically specific to a framework, such as TensorFlow or PyTorch. Some users have built complex frameworks to increase the efficiency of checkpointing. These complex frameworks allow them to perform a checkpoint more frequently.

However, most users typically use shared storage, such as Cloud Storage or Filestore. When saving checkpoints, you only need to save three to five checkpoints at any one point in time. Checkpoint workloads tend to consist of mostly writes, several deletes, and, ideally, infrequent reads when failures occur. During recovery, the I/O pattern includes intensive and frequent writes, frequent deletes, and frequent reads of the checkpoint.

You also need to consider the size of the checkpoint that each GPU or TPU needs to create. The checkpoint size determines the write throughput that is required to complete the training job in a cost-effective and timely manner.

To minimize costs, consider increasing the following items:

  • The frequency of checkpoints
  • The aggregate write throughput that is required for checkpoints
  • Restart efficiency

Serve

When you serve your model, which is also known as AI inference, the primary I/O pattern is read-only to load the model into Cloud GPUs or Cloud TPU memory. Your goal at this stage is to run your model in production. The model is much smaller than the training data, so you can replicate and scale the model across multiple instances. High availability and protection against zonal and regional failures are important at this stage, so you must ensure that your model is available for a variety of failure scenarios.

For many generative AI use cases, input data to the model might be quite small and might not need to be stored persistently. In other cases, you might need to run large volumes of data over the model (for example, scientific datasets). In this case, you need to select an option that can keep the Cloud GPUs or Cloud TPUs supplied during the analysis of the dataset, as well as select a persistent location to store the inference results.

There are two primary choices when you serve your model.

Cloud Storage for the serve stage

The main reasons to select Cloud Storage when serving your data include the following:

  • When you train your model in Cloud Storage, you can save on migration costs by leaving the model in Cloud Storage when you serve it.
  • You can save your generated content in Cloud Storage.
  • Cloud Storage is a good choice when AI inferencing occurs in multiple regions.
  • You can use dual-region and multi-region buckets to provide model availability across regional failures.

Filestore for the serve stage

The main reasons to select Filestore when serving your data include the following:

  • When you train your model in Filestore, you can save on migration costs by leaving the model in Filestore when you serve it.
  • Because its service level agreement (SLA) provides 99.99% availability, the Filestore Enterprise service tier is a good choice for high availability when you want to serve your model between multiple zones in a region.
  • The Filestore Zonal service tiers might be a reasonable lower-cost choice, but only if high availability is not a requirement for your AI and ML workload.
  • If you require cross-region recovery, you can store the model in a remote backup location or a remote Cloud Storage bucket, and then restore the model as needed.
  • Filestore offers a durable and highly available option that gives low latency access to your model when you generate small files or require file APIs.

Archive

The archive stage has an I/O pattern of "write once, read rarely." Your goal is to store the different sets of training data and the different versions of models that you generated. You can use these incremental versions of data and models for backup and disaster recovery purposes. You must also store these items in a durable location for a long period of time. Although you might not require access to the data and models very often, you do want these items to be available when you need them.

Because of its extreme durability, expansive scale, and low cost, the best option for storing object data over a long period of time is Cloud Storage. Depending on the frequency of when you access the dataset, model, and backup files, Cloud Storage offers cost optimization through different storage classes with the following approaches:

  • Place your frequently accessed data in Standard storage.
  • Keep data that you access monthly in Nearline storage.
  • Store data that you access every three months in Coldline storage.
  • Preserve data that you access once a year in Archive storage.

Using object lifecycle management, you can create policies to move data to colder storage classes or to delete data based on specific criteria. If you're not sure how often you'll access your data, you can use the Autoclass feature to move data between storage classes automatically, based on your access pattern.

If your data is in Filestore, moving the data to Cloud Storage for archive purposes often makes sense. However, you can provide additional protection for your Filestore data by creating Filestore backups in another region. You can also take Filestore snapshots for local file and file system recovery. For more information about Filestore backups, see Backups overview. For more information about Filestore snapshots, see Snapshots overview.

Storage recommendations for AI and ML

This section summarizes the analysis provided in the previous section, Map your storage choices to the AI and ML stages. It provides details about the three primary storage option combinations that we recommend for most AI and ML workloads. The three options are as follows:

Select Cloud Storage

Cloud Storage provides the lowest cost-per-capacity storage offering when compared to all other storage offerings. It scales to large numbers of clients, provides regional and dual-region accessibility and availability, and can be accessed through Cloud Storage FUSE. You should select regional storage when your compute platform for training is in the same region, and choose dual-region storage if you need higher reliability or use Cloud GPUs or Cloud TPUs located in two different regions.

Cloud Storage is the best choice for long term data retention, and for workloads with lower storage performance requirements. However, other options such as Filestore and Local SSD are valuable alternatives in specific cases where you require full POSIX support or Cloud Storage becomes a performance bottleneck.

Select Cloud Storage with Local SSD or Filestore

For data-intensive training or checkpoint and restart workloads, it can make sense to use a faster storage offering during the I/O intensive training phase. Typical choices include copying the data to a Local SSD or Filestore. This action reduces the overall job runtime by keeping the Cloud GPUs or Cloud TPUs supplied with data and prevents the instances from stalling while checkpoint operations complete. In addition, the more frequently you create checkpoints, the more checkpoints you have available as backups. This increase in the number of backups also increases the overall rate at which the useful data arrives (also known as goodput). This combination of optimizing the processors and increasing goodput lowers the overall costs of training your model.

There are trade-offs to consider when utilizing Local SSD or Filestore. The following section describes some some advantages and disadvantages for each.

Local SSD advantages

  • High throughput and IOPS once the data has been transferred
  • Low to minimal extra cost

Local SSD disadvantages

  • Cloud GPUs or Cloud TPUs remain idle while the data loads.
  • Data transfer must happen on every job for every instance.
  • Is only available for some Cloud GPUs instance types.
  • Provides limited storage capacity.
  • Supports checkpointing, but you must manually transfer the checkpoints to a durable storage option such as Cloud Storage.

Filestore advantages

  • Provides shared NFS storage that enables data to be transferred once and then shared across multiple jobs and users.
  • There is no idle Cloud GPUs or Cloud TPUs time because the data is transferred before you pay for the Cloud GPUs or Cloud TPUs.
  • Has a large storage capacity.
  • Supports fast checkpointing for thousands of VMs.
  • Supports Cloud GPUs, Cloud TPUs, and all other Compute Engine instance types.

Filestore disadvantages

  • High upfront cost; but the increased compute efficiency has the potential to reduce the overall training costs.

Select Filestore with optional Local SSD

Filestore is the best choice for AI and ML workloads that need low latency and full POSIX support. Beyond being the recommended choice for small file or small I/O training jobs, Filestore can deliver a responsive experience for AI and ML notebooks, software development, and many other applications. You can also deploy Filestore in a zone for high performance training and persistent storage of checkpoints. Deploying Filestore in a zone also offers fast restart upon failure. Alternatively, you can deploy Filestore regionally to support highly available inference jobs. The optional addition of FS-Cache to support Local SSD caching enables fast repeated reads of training data to optimize workloads.

What's next

For more information on storage options and AI and ML, see the following resources:

Contributors

Authors:

  • Dean Hildebrand | Technical Director, Office of the CTO
  • Sean Derrington | Group Outbound Product Manager, Storage
  • Richard Hendricks | Architecture Center Staff

Other contributor: Kumar Dhanagopal | Cross-Product Solution Developer