This page provides guidance for deploying a generative AI model to an endpoint for online prediction.
Check the Model Garden
If the model is in Model Garden, you can deploy it by clicking Deploy (available for some models) or Open Notebook.
Otherwise, you can do one of the following:
If your model is similar to one in the Model Garden, you might be able to directly reuse one of the model garden containers.
Build your own custom container that adheres to Custom container requirements for prediction before importing your model into the Vertex AI Model Registry. After it's imported, it becomes a
model
resource that you can deploy to an endpoint.You can use the Dockerfiles and scripts that we use to build our Model Garden containers as a reference or starting point to build your own custom containers.
Serving predictions with NVIDIA NIM
NVIDIA Inference Microservices (NIM) are pre-trained and optimized AI models that are packaged as microservices. They're designed to simplify the deployment of high-performance, production-ready AI into applications.
NVIDIA NIM can be used together with Artifact Registry and Vertex AI Prediction to deploy generative AI models for online prediction.
Settings for custom containers
This section describes fields in your model's
containerSpec
that you may need to
specify when importing generative AI models.
You can specify these fields by using the Vertex AI REST API or the
gcloud ai models upload
command.
For more information, see
Container-related API fields.
sharedMemorySizeMb
Some generative AI models require more shared memory. Shared memory is an Inter-process communication (IPC) mechanism that allows multiple processes to access and manipulate a common block of memory. The default shared memory size is 64MB.
Some model servers, such as vLLM or Nvidia Triton, use shared memory to cache internal data during model inferences. Without enough shared memory, some model servers cannot serve predictions for generative models. The amount of shared memory needed, if any, is an implementation detail of your container and model. Consult your model server documentation for guidelines.
Also, because shared memory can be used for cross GPU communication, using more shared memory can improve performance for accelerators without NVLink capabilities (for example, L4), if the model container requires communication across GPUs.
For information on how to specify a custom value for shared memory, see Container-related API fields.
startupProbe
A startup probe is an optional probe that is used to detect when the container has started. This probe is used to delay the health probe and liveness checks until the container has started, which helps prevent slow starting containers from getting shut down prematurely.
For more information, see Health checks.
healthProbe
The health probe checks whether a container is ready to accept traffic. If health probe is not provided, Vertex AI will use the default health checks which issues a HTTP request to the container's port and looks for a
200 OK
response from the model server.If your model server responds with
200 OK
before the model is fully loaded, which is possible, especially for large models, then the health check will succeed prematurely and Vertex AI will route traffic to the container before it is ready.In these cases, specify a custom health probe that succeeds only after the model is fully loaded and ready to accept traffic.
For more information, see Health checks.
Limitations
Consider the following limitations when deploying generative AI models:
- Generative AI models can only be deployed to a single machine. Multi-host deployment isn't supported.
- For very large models that don't fit in the largest supported vRAM, such as Llama 3.1 405B, we recommend quantizing them to fit.