Choose a deployment strategy

This document provides recommendations of which accelerator, consumption type, storage service, and deployment tool is best suited for different artificial intelligence (AI), machine learning (ML), and high performance computing (HPC) workloads. Use this document to help you identify the best deployment for your workload.

Workloads overview

AI Hypercomputer architecture supports the following use cases:

AI and ML

The following three broad types of AI and ML workloads are supported:

  • Pre-training foundation models: this involves building a language model using a large dataset. The result of pre-training foundation models is a new model that is good at performing general tasks.

    Models are categorized based on their size as follows:

    • Frontier model: these are ML models that span hundreds of billions to trillions of parameters or higher. These include large language models (LLMs) such as Gemini.
    • Large model: these are models that span tens to hundreds of billions of parameters or higher.
  • Fine-tuning: this involves taking a trained model and adapting it to perform specific tasks by using specialized data sets or other techniques. Fine-tuning is generally performed on large models.

  • Inference or serving: this involves taking a trained or fine-tuned model and making it available for consumption by users or applications.

    Inference workloads are categorized based on the size of the models as follows:

    • Multi-host foundation model inference: performing inference with trained ML models that span hundreds of billions to trillions of parameters or higher. For these inference workloads the computational load is shared across multiple host machines.
    • Single-host foundation model inference: performing inference with trained ML models that span tens to hundred of billions of parameters. For these inference workloads the computational load is confined to a single host machine.
    • Large model inference: performing inference with trained or fine-tuned ML models that span tens to hundreds of billions of parameters.

HPC

High performance computing (HPC) is the practice of aggregating computing resources to gain performance greater than that of a single workstation, server, or computer. HPC is used to solve problems in academic research, science, design, simulation, and business intelligence.

Recommendations for pre-training models

Pre-training foundation models requires bringing up large clusters of accelerators, continuously reading large volumes of data, and learning from the data by weight adjustment through a series of forward and backward passes. These training jobs run weeks, if not months at a time.

The following sections outline the accelerators, recommended consumption type, and storage service to use when pre-training models.

Recommended accelerators

To pre-train foundational models on Google Cloud, we recommend using the A3 accelerator-optimized machine series and deploying these machines using an orchestrator. To deploy these large clusters of accelerators, we also recommend using Cluster Toolkit. To get you started with these clusters, a link to a deployment guide for each recommended machine type is provided.

Workloads Recommendations Cluster deployment guide
Machine type Orchestrator
  • Frontier model training
  • Large model training
A3 Ultra GKE Deploy an A3 Ultra cluster with GKE
Slurm Deploy an A3 Ultra Slurm cluster
  • Frontier model training
  • Large model training
A3 Mega GKE Deploy an A3 Mega cluster with GKE
Slurm Deploy an A3 Mega Slurm cluster
  • Large model training
A3 High GKE Deploy an A3 High cluster with GKE
Slurm Deploy an A3 High Slurm cluster

Recommended consumption type

For a high level of assurance in obtaining large clusters of accelerators at minimum costs, we recommend using a reservation and requesting these reservation for a long duration. If using A3 Ultra, you must use block reservations. For more information about consumption types, see Consumption options.

Recommended storage services

For pre-training models, training data needs to be ready continuously and quickly. We also recommend frequent and fast checkpointing of the model being trained. For most of these needs, we recommend Parallelstore and Cloud Storage. For more information about storage options, Storage options.

Recommendations for fine-tuning models

Fine-tuning large foundational models involves smaller clusters of accelerators, reading moderate volumes of data and adjusting the model to perform specific tasks. These fine-tuning jobs run for days, if not weeks.

The following sections outline the accelerators, recommended consumption type, and storage service to use when fine-tuning models.

Recommended accelerators

To fine-tune models on Google Cloud, we recommend using the A3 accelerator-optimized machine series and deploying these machines using an orchestrator. To deploy these clusters of accelerators, we also recommend using Cluster Toolkit. To get you started with these clusters, a link to a cluster deployment guide for each recommended machine type is provided.

Workloads Recommendations Cluster deployment guide
Machine type Orchestrator
Fine-tuning large models A3 Mega GKE Deploy an A3 Mega cluster with GKE
Slurm Deploy an A3 Mega Slurm cluster
Fine-tuning large models A3 High GKE Deploy an A3 High cluster with GKE
Slurm Deploy an A3 High Slurm cluster

Recommended consumption type

For fine-tuning workloads we recommend using a reservation or Dynamic Workload Scheduler (DWS) to provision resources. For more information about consumption options, see Consumption options.

Recommended storage services

For fine tuning models, the amount of data needed can be significant especially when it comes to read speeds for fine-tuning performance. We also recommend frequent and fast checkpointing of the model being fine-tuned. Similar to pre-training, for most of these needs, we recommend Parallelstore and Cloud Storage. For more information about storage options, see Storage options.

Recommendations for inference

The following sections outline the accelerators, recommended consumption type, and storage service to use when performing inference.

Recommended accelerators

The recommended accelerators for inference depend on whether you're performing multi-host frontier or large model inference, or single-host frontier inference.

Recommended accelerators (multi-host)

To perform multi-host frontier or large model inference on Google Cloud, we recommend using the A3 accelerator-optimized machine series and deploying these machines using an orchestrator. To deploy these clusters of accelerators, we also recommend using Cluster Toolkit. To get you started with these clusters, a link to a cluster deployment guide for each recommended machine type is provided.

Workloads Recommendations Cluster deployment guide
Machine type Orchestrator
Multi-host frontier inference A3 Ultra GKE Deploy an A3 Ultra cluster with GKE
Slurm Deploy an A3 Ultra Slurm cluster
Multi-host frontier inference A3 Mega GKE Deploy an A3 Mega cluster with GKE
Slurm Deploy an A3 Mega Slurm cluster
Large model inference A3 High GKE Deploy an A3 High cluster with GKE
Slurm Deploy an A3 High Slurm cluster

Recommended accelerators (single host)

The following table outlines the recommended accelerators to use when performing single-host frontier inference. To get you started with these VMs, a link to a VM deployment guide for each recommended machine type is provided.

Workloads Recommendations VM deployment guide
Machine type Orchestrator
Single-host frontier inference A3 Ultra N/A Create a A3 Ultra VM
Single-host frontier inference A3 Mega N/A
Single-host frontier inference A3 High N/A Create a A3 High VM

Recommended consumption type

For inference workloads we recommend using a reservation or Dynamic Workload Scheduler (DWS) to provision resources. If using A3 Ultra, you must use block reservations. For more information about consumption options, see Consumption options.

Recommended storage services

For inference, quickly loading the inference binaries and weights across many servers requires fast data reads. We recommend Cloud Storage as an appropriate storage service for loading inference binaries and weights. For more information about storage options, see Storage options.

Recommendations for HPC

For HPC workloads, any accelerator-optimized machine series or compute-optimized machine series works well. If using an accelerator-optimized machine series, the best fit depends on the amount of computation that must be offloaded to the GPU. To get a detailed list of recommendations for HPC workloads, see Best practices for running HPC workloads.

To deploy HPC environments, a wider array of cluster blueprints are available. To get started, see Cluster blueprint catalog.

Summary of recommendations

The following is a summary of the recommendations for which accelerator, consumption type, and storage service we recommend for different workloads


Resource

Recommendation
Model pre-training
Machine family
  • Use one of the following A3 accelerator-optimized machine types: A3 Ultra, A3 Mega, or A3 High

Consumption type
  • Use reservations
Storage
  • Use a Google Cloud managed service such as Parallelstore or Cloud Storage
Model fine-tuning
Machine family
  • Use one of the following A3 accelerator-optimized machine types: A3 Mega or A3 High

Consumption type
  • Use reservations or Dynamic Workload Scheduler
Storage
  • Use a Google Cloud managed service such as Parallelstore or Cloud Storage
Inference
Machine family
  • Use one of the following A3 accelerator-optimized machine types: A3 Ultra, A3 Mega, or A3 High

Consumption type
  • Use reservations or Dynamic Workload Scheduler
Storage
  • Use a Google Cloud managed service such as Cloud Storage
HPC
See the summary section of the Best practices for running HPC workloads