Deploy generative AI models

This page provides guidance for deploying a generative AI model to an endpoint for online prediction.

Check the Model Garden

If the model is in Model Garden, you can deploy it by clicking Deploy (available for some models) or Open Notebook.

Go to Model Garden

Otherwise, you can do one of the following:

Serving predictions with NVIDIA NIM

NVIDIA Inference Microservices (NIM) are pre-trained and optimized AI models that are packaged as microservices. They're designed to simplify the deployment of high-performance, production-ready AI into applications.

NVIDIA NIM can be used together with Artifact Registry and Vertex AI Prediction to deploy generative AI models for online prediction.

Settings for custom containers

This section describes fields in your model's containerSpec that you may need to specify when importing generative AI models.

You can specify these fields by using the Vertex AI REST API or the gcloud ai models upload command. For more information, see Container-related API fields.

sharedMemorySizeMb

Some generative AI models require more shared memory. Shared memory is an Inter-process communication (IPC) mechanism that allows multiple processes to access and manipulate a common block of memory. The default shared memory size is 64MB.

Some model servers, such as vLLM or Nvidia Triton, use shared memory to cache internal data during model inferences. Without enough shared memory, some model servers cannot serve predictions for generative models. The amount of shared memory needed, if any, is an implementation detail of your container and model. Consult your model server documentation for guidelines.

Also, because shared memory can be used for cross GPU communication, using more shared memory can improve performance for accelerators without NVLink capabilities (for example, L4), if the model container requires communication across GPUs.

For information on how to specify a custom value for shared memory, see Container-related API fields.

startupProbe

A startup probe is an optional probe that is used to detect when the container has started. This probe is used to delay the health probe and liveness checks until the container has started, which helps prevent slow starting containers from getting shut down prematurely.

For more information, see Health checks.

healthProbe

The health probe checks whether a container is ready to accept traffic. If health probe is not provided, Vertex AI will use the default health checks which issues a HTTP request to the container's port and looks for a 200 OK response from the model server.

If your model server responds with 200 OK before the model is fully loaded, which is possible, especially for large models, then the health check will succeed prematurely and Vertex AI will route traffic to the container before it is ready.

In these cases, specify a custom health probe that succeeds only after the model is fully loaded and ready to accept traffic.

For more information, see Health checks.

Limitations

Consider the following limitations when deploying generative AI models:

  • Generative AI models can only be deployed to a single machine. Multi-host deployment isn't supported.
  • For very large models that don't fit in the largest supported vRAM, such as Llama 3.1 405B, we recommend quantizing them to fit.