This document provides a reference architecture that shows how you can use Parallelstore to optimize performance for artificial intelligence (AI) or machine learning (ML) workloads. Parallelstore is a parallel file system storage service that helps you to reduce costs, improve resource utilization, and accelerate training times for your AI and ML workloads.
The intended audience for this document includes architects and technical practitioners who design, provision, and manage storage for their AI and ML workloads on Google Cloud. The document assumes that you have an understanding of the ML lifecycle, processes, and capabilities.
Parallelstore is a fully managed, high-performance scratch file system in Google Cloud that's built on the Distributed Asynchronous Object Storage (DAOS) architecture. Parallelstore is ideal for AI and ML workloads that use up to 100 TiB of storage capacity and that need to provide low-latency (sub-millisecond) access with high throughput and high input/output operations per second (IOPS).
Parallelstore offers several advantages for AI and ML workloads, such as the following:
- Lower total cost of ownership (TCO) for training: Parallelstore accelerates training time by efficiently delivering data to compute nodes. This functionality helps to reduce the total cost of ownership for AI and ML model training.
- Lower TCO for serving: Parallelstore's high-performance capabilities enable faster model loading and optimized inference serving. These capabilities help to lower compute costs and improve resource utilization.
- Efficient resource utilization: Parallelstore lets you combine training, checkpointing, and serving within a single instance. This resource utilization helps to maximize the efficient use of read and write throughput in a single, high-performance storage system.
Architecture
The following diagram shows a sample architecture for using Parallelstore to optimize performance of a model training workload and serving workload:
The workloads that are shown in the preceding architecture are described in detail in later sections. The architecture includes the following components:
Component | Purpose |
---|---|
Google Kubernetes Engine (GKE) cluster | GKE manages the compute hosts on which your AI and ML model training and serving processes execute. GKE manages the underlying infrastructure of clusters, including the control plane, nodes, and all system components. |
Kubernetes Scheduler | The GKE control plane schedules workloads and
manages their lifecycle, scaling, and upgrades. The Kubernetes node
agent (kubelet ), which isn't shown in the diagram, communicates
with the control plane. The kubelet is responsible for starting and
running containers scheduled on the GKE nodes.
You can deploy GPUs for batch and AI workloads
with Dynamic Workload Scheduler, which lets you request GPUs without a large
commitment. For more information about the scheduler, see AI/ML orchestration on
GKE. |
Virtual Private Cloud (VPC) network | All of the Google Cloud resources that are in the architecture use a single VPC network. Depending on your requirements, you can choose to build an architecture that uses multiple networks. For more information about how to configure a VPC network for Parallelstore, see Configure a VPC network. |
Cloud Load Balancing | In this architecture, Cloud Load Balancing efficiently distributes incoming inference requests from application users to the serving containers in the GKE cluster. The use of Cloud Load Balancing helps to ensure high availability, scalability, and optimal performance for the AI and ML application. For more information, see Understanding GKE load balancing. |
Graphics Processing Unit (GPU) or Tensor Processing Units (TPUs) | GPUs and TPUs are specialized machine accelerators that improve the performance of your AI and ML workload. For more information about how to choose an appropriate processor type, see Accelerator options later in this document. |
Parallelstore | Parallelstore accelerates AI and ML training and serving by providing a high-performance, parallel file system that's optimized for low latency and high throughput. Compared to using Cloud Storage alone, using Parallelstore significantly reduces training time and improves the responsiveness of your models during serving. These improvements are especially realized in demanding workloads that require fast and consistent access to shared data. |
Cloud Storage | Cloud Storage provides persistent and cost-effective storage for your AI and ML workloads. Cloud Storage serves as the central repository for your raw training datasets, model checkpoints, and final trained models. Using Cloud Storage helps to ensure data durability, long-term availability, and cost-efficiency for data that isn't actively being used in computations. |
Training workload
In the preceding architecture, the following are the steps in the data flow during model training:
- Upload training data to Cloud Storage: You upload training data to a Cloud Storage bucket, which serves as a secure and scalable central repository and source of truth.
- Copy data to Parallelstore: The training data corpus is transferred through a bulk API import to a Parallelstore instance from Cloud Storage. Transferring the training data lets you take advantage of Parallelstore's high-performance file system capabilities to optimize data loading and processing speeds during model training.
- Run training jobs in GKE: The model training process runs on GKE nodes. By using Parallelstore as the data source instead of loading data from Cloud Storage directly, the GKE nodes can access and load training data with significantly increased speed and efficiency. Using Parallelstore helps to reduce data loading times and accelerate the overall training process, especially for large datasets and complex models. Depending on your workload requirements, you can use GPUs or TPUs. For information about how to choose an appropriate processor type, see Accelerator options later in this document.
- Save training checkpoints to Parallelstore: During the training process, checkpoints are saved to Parallelstore based on metrics or intervals that you define. The checkpoints capture the state of the model at frequent intervals.
- Save checkpoints and model to Cloud Storage: We recommend
that you use a
bulk API export from the Parallelstore instance
to
save some checkpoints
and the trained model to Cloud Storage. This practice ensures
fault tolerance and enables future use cases like resuming training from a
specific point, deploying the model for production, and conducting further
experiments. As a best practice, store checkpoints in a different bucket
from your training data.
- Restore checkpoints or model: When your AI and ML workflow requires that you restore checkpoints or model data, you need to locate the asset that you want to restore in Cloud Storage. Select the asset to restore based on timestamp, performance metric, or a specific version. Use API import to transfer the asset from Cloud Storage to Parallelstore, and then load the asset into your training container. You can then use the restored checkpoint or model to resume training, fine-tune parameters, or evaluate performance on a validation set.
Serving workload
In the preceding architecture, the following are the steps in the data flow during model serving:
- Load model for serving: After training is complete, your pods load the trained model to the serving nodes. If the Parallelstore instance that you used during training has sufficient IOPS capacity, you can accelerate model loading and reduce costs by using the training instance to serve the model. Reusing the training instance enables efficient resource sharing between training and serving. However, to maintain optimal performance and compatibility, use an accelerator type (GPU or TPU) for training that's consistent with the accelerator type that's available on the serving GKE nodes.
- Inference request: Application users send inference requests through the AI and ML application. These requests are directed to the Cloud Load Balancing service. Cloud Load Balancing distributes the incoming requests across the serving containers in the GKE cluster. This distribution ensures that no single container is overwhelmed and that requests are processed efficiently.
- Serving inference requests: During production, the system efficiently handles inference requests by utilizing the model serving cache. The compute nodes interact with the cache by first checking for a matching prediction. If a matching prediction is found, it's returned directly, which helps to optimize response times and resource usage. Otherwise, the model processes the request, generates a prediction, and stores it in the cache for future efficiency.
- Response delivery: The serving containers send the responses back through Cloud Load Balancing. Cloud Load Balancing routes the responses back to the appropriate application users, which completes the inference request cycle.
Products used
This reference architecture uses the following Google Cloud products:
- Virtual Private Cloud (VPC): A virtual system that provides global, scalable networking functionality for your Google Cloud workloads. VPC includes VPC Network Peering, Private Service Connect, private services access, and Shared VPC.
- Google Kubernetes Engine (GKE): A Kubernetes service that you can use to deploy and operate containerized applications at scale using Google's infrastructure.
- Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
- Parallelstore: A fully managed parallel file system for AI, high performance computing (HPC), and data-intensive applications.
Use cases
Parallelstore is ideal for AI and ML workloads with up to 100 TiB of storage capacity and that need to provide low-latency (sub-millisecond) access with high throughput and high IOPS. The following sections provide examples of use cases for which you can use Parallelstore.
Text-based processing and text generation
Large language models (LLMs) are specialized AI models that are designed specifically for understanding and processing text-based data. LLMs are trained on massive text datasets, enabling them to perform a variety of tasks, including machine translation, question answering, and text summarization. Training LLM models demands low-latency access to the datasets for efficient request processing and text generation. Parallelstore excels in data-intensive applications by providing the high throughput and low latency that's needed for both training and inference, leading to more responsive LLM-powered applications.
High-resolution image or video processing
Traditional AI and ML applications or multi-modal generative models that process high-resolution images or videos, such as medical imaging analysis or autonomous driving systems, require large storage capacity and rapid data access. Parallelstore's high-performance scratch file system allows for fast data loading to accelerate the application performance. For example, Parallelstore can temporarily hold and process large volumes of patient data, such as MRI and CT scans, that are pulled from Cloud Storage. This functionality enables AI and ML models to quickly analyze the data for diagnosis and treatment.
Design alternatives
The following sections present alternative design approaches that you can consider for your AI and ML application in Google Cloud.
Platform alternative
Instead of hosting your model training and serving workflow on GKE, you can consider Compute Engine with Slurm. Slurm is a highly configurable and open source workload and resource manager. Using Compute Engine with Slurm is particularly well-suited for large-scale model training and simulations. We recommend using Compute Engine with Slurm if you need to integrate proprietary AI and ML intellectual property (IP) into a scalable environment with the flexibility and control to optimize performance for specialized workloads.
On Compute Engine, you provision and manage your virtual machines (VMs), which gives you granular control over instance types, storage, and networking. You can tailor your infrastructure to your exact needs, including the selection of specific VM machine types. You can also use the accelerator-optimized machine family for enhanced performance with your AI and ML workloads. For more information about machine type families that are available on Compute Engine, see Machine families resource and comparison guide.
Slurm offers a powerful option for managing AI and ML workloads and it lets you control the configuration and management of the compute resources. To use this approach, you need expertise in Slurm administration and Linux system management.
Accelerator options
Machine accelerators are specialized processors that are designed to speed up the computations required for AI and ML workloads. You can choose either Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs).
- GPU accelerators provide excellent performance for a wide range of tasks, including graphic rendering, deep learning training, and scientific computing. Google Cloud has a wide selection of GPUs to match a range of performance and price points. For information about GPU models and pricing, see GPU pricing.
- TPUs are custom-designed AI accelerators, which are optimized for training and inference of large AI models. They are ideal for a variety of use cases, such as chatbots, code generation, media content generation, synthetic speech, vision services, recommendation engines, personalization models, among others. For more information about TPU models and pricing, see TPU pricing.
Serving storage alternatives
Cloud Storage FUSE with a multi-regional or dual-region bucket provides the highest level of availability because your trained AI and ML models are stored in Cloud Storage and multiple regions. Although Cloud Storage FUSE achieves a lower throughput per VM than Parallelstore, Cloud Storage FUSE lets you take advantage of the scalability and cost-effectiveness of Cloud Storage. To accelerate model loading and improve performance, especially for demanding workloads, you can use existing or new Parallelstore instances in each region. For information about how to improve performance with Cloud Storage FUSE, see Optimize Cloud Storage FUSE CSI driver for GKE performance.
Google Cloud Hyperdisk ML is a high-performance block storage solution that's designed to accelerate large-scale AI and ML workloads that require read-only access to large datasets. Hyperdisk ML can be provisioned with higher aggregate throughput, but it achieves a lower throughput per VM compared to Parallelstore.
Additionally, Hyperdisk ML volumes can only be accessed by GPU or TPU VMs in the same zone. Therefore, for regional GKE clusters that serve from multiple zones, you must provision separate Hyperdisk ML volumes in each zone. This placement differs from Parallelstore, where you need only one instance per region. It's also important to note that Hyperdisk ML is read-only. For more information about using Hyperdisk ML in AI and ML workloads, see Accelerate AI/ML data loading with Hyperdisk ML.
Design considerations
To design a Parallelstore deployment that optimizes the performance and cost-efficiency of your AI and ML workloads on Google Cloud, use the guidelines in the following sections. The guidelines describe recommendations to consider when you use Parallelstore as part of a hybrid solution that combines multiple storage options for specific tasks within your workflow.
Training
AI and ML model training requires that you iteratively feed data to your model, adjust its parameters, and evaluate its performance with each iteration. This process can be computationally intensive and it generates a high volume of I/O requests due to the constant need to read training data and write updated model parameters.
To maximize the performance benefits during training, we recommend the following:
- Caching: Use Parallelstore as a high-performance cache on top of Cloud Storage.
- Prefetching: Import data to Parallelstore from Cloud Storage to minimize latency during training. You can also use GKE Volume Populator to pre-populate PersistentVolumesClaims with data from Cloud Storage.
- Cost optimization: Export your data to a lower-cost Cloud Storage class after training in order to minimize long-term storage expenses. Because your persistent data is stored in Cloud Storage, you can destroy and recreate Parallelstore instances as needed for your training jobs.
- GKE integration: Integrate with the GKE container storage interface (CSI) driver for simplified management. For information about how to connect a GKE cluster to a Parallelstore instance, see Google Kubernetes Engine Parallelstore CSI driver.
- A3 VM performance: Deliver more than 20 GB/s (approximately 2.5 GB/s per GPU) on A3 variants for optimal data delivery.
- Concurrent access: Use the Parallelstore instance to accommodate full duplex read and writes.
When you deploy Parallelstore for training, consider the following:
- Scratch file system: Configure checkpointing intervals throughout the training process. Parallelstore is a scratch file system, which means that data is stored temporarily. At the 100 TiB range, the estimated mean time to data loss is two months. At the 23 TiB range, the estimated mean time to data loss is twelve months or more.
- File and directory striping: Optimize file and directory striping for your predominant file size to maximize performance.
- Cost optimization: Optimize costs by appropriately staging data in Cloud Storage instead of in Parallelstore.
- Zone selection: Optimize cost and performance by locating GPU or TPU compute clients and storage nodes in the same zone.
For more information about how to configure your Parallelstore environment to optimize performance, see Performance considerations.
Checkpointing
Checkpointing is a critical aspect of AI and ML model training. Checkpointing lets you save the state of your model at various points during the process, so that you can resume training from a saved checkpoint in case of interruptions, system failures, or to explore different hyperparameter configurations. When you use Parallelstore for training, it's crucial to also use it for checkpointing in order to take advantage of its high write throughput and to minimize training time. This approach ensures efficient utilization of resources and helps lower the TCO for your GPU resources by keeping both training and checkpointing as fast as possible.
To optimize your checkpointing workflow with Parallelstore, consider these best practices:
- Fast checkpointing: Take advantage of fast checkpoint writes with Parallelstore. You can achieve a throughput of 0.5 GB/s per TiB of capacity and more than 12 GB/s per A3 VM.
- Selective checkpoint storage: Export selected checkpoints from Parallelstore to Cloud Storage for long-term storage and disaster recovery.
- Concurrent operations: Benefit from read and write full duplexing by using Parallelstore simultaneously for training and checkpoint writes.
Serving
Serving involves deploying your trained AI and ML models to handle inference requests. To achieve optimal performance, it's crucial to minimize the time that it takes to load these models into memory. Although Parallelstore is primarily designed for training workloads, you can use Parallelstore's high throughput per VM (more than 20 GB/s) and aggregate cluster throughput in order to minimize model load times across thousands of VMs. To track key metrics that enable you to identify bottlenecks and ensure optimal efficiency, use Cloud Monitoring.
When you deploy Parallelstore for serving, consider the following:
- High throughput: Maximize Parallelstore performance by using Cloud Monitoring to help ensure that you deploy sufficient capacity to achieve up to 125 GB/s throughput at 100 TiB.
- Potential for service interruptions: Because Parallelstore is a scratch file system, it can experience occasional service interruptions. The mean time to data loss is approximately 2 months for a 100 TiB cluster.
- Restore data: If a service interruption occurs, you need to restore Parallelstore data from your latest Cloud Storage backup. Data is transferred at a speed of approximately 16 GB/s.
- Shared instances: Using one Parallelstore instance for training and serving maximizes resource utilization and can be cost-efficient. However, there can be potential resource contention if both workloads have high throughput demands. If spare IOPS are available after training, using the same instance can accelerate model loading for serving. Use Cloud Monitoring to help ensure that you allocate sufficient resources to meet your throughput demands.
- Separate instances: Using separate instances provides performance isolation, enhances security by isolating training data, and improves data protection. Although access control lists can manage security within a single instance, separate instances offer a more robust security boundary.
Placement options
To minimize latency and maximize performance, create your Parallelstore instance in a region that's geographically close to your GPU or TPU compute clients.
- For training and checkpointing: For optimal results, ensure that the clients and Parallelstore instances are in the same zone. This colocation minimizes data transfer times and maximizes the utilization of Parallelstore's write throughput.
- For serving: Although colocating with compute clients in the same zone is ideal, having one Parallelstore instance per region is sufficient. This approach avoids extra costs that are associated with deploying multiple instances and helps to maximize compute performance. However, if you require additional capacity or throughput, you might consider deploying more than one instance per region.
Deploying Parallelstore in two regions can significantly improve performance by keeping data geographically closer to the GPUs or TPUs that are used for serving. This placement reduces latency and allows for faster data access during inference. If a regional outage occurs, both training and serving applications will become unavailable to users.
To ensure high availability and reliability, you should instantiate a replica of this architecture in a different region. When you create a geographically redundant architecture, your AI and ML application can continue operating even if one region experiences an outage. To back up and restore your cluster data and Cloud Storage data and restore them in a different region as needed, you can use Backup for GKE.
For information about the supported locations for Parallelstore instances, see Supported locations.
Deployment
To create and deploy this reference architecture, we recommend that you use Cluster Toolkit. Cluster Toolkit is a modular, Terraform-based toolkit that's designed for deployment of repeatable AI and ML environments on Google Cloud. To define your environment, use the GKE and Parallelstore training blueprint. To provision and manage Parallelstore instances for your clusters, reference the Parallelstore module.
For information about how to manually deploy Parallestore, see Create a Parallelstore instance. To further improve scalability and enhance performance with dynamic provisioning, you can create and use a volume backed by a Parallelstore instance in GKE.
What's next
- Learn more about how to use parallel file systems for HPC workloads.
- Learn more about best practices for implementing machine learning on Google Cloud.
- Learn more about how to design storage for AI and ML workloads in Google Cloud.
- Learn more about how to train a TensorFlow model with Keras on GKE.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.
Contributors
Author: Samantha He | Technical Writer
Other contributors:
- Dean Hildebrand | Technical Director, Office of the CTO
- Kumar Dhanagopal | Cross-Product Solution Developer
- Sean Derrington | Group Outbound Product Manager, Storage