Deploy generative AI models

This guide shows you how to deploy generative AI models to a Vertex AI endpoint for online predictions.

Some generative AI models, such as Gemini, have managed APIs and are ready to accept prompts without deployment. For a list of models with managed APIs, see Foundational model APIs.

Other generative AI models must be deployed to an endpoint before they can accept prompts. The following table compares the types of models that require deployment.

Model Type Description Deployment Process Use Case
Tuned Model A foundation model that you fine-tune with your data. Automatic deployment to a shared public endpoint after the tuning job completes. Serving a customized model trained on your specific data.
Model without Managed API A pre-trained model from Model Garden (for example, Llama 2) that you deploy yourself. Manual deployment with the Deploy button or a Jupyter Notebook. Serving open or third-party models that don't have a ready-to-use API.

When you deploy a model to an endpoint, Vertex AI associates compute resources and a URI with the model so that it can serve prompt requests.

The following diagram summarizes the workflow for deploying a model:

Deploy a tuned model

Tuned models are automatically uploaded to the Vertex AI Model Registry and deployed to a Vertex AI shared public endpoint. Tuned models don't appear in the Model Garden because you tune them with your data. For more information, see Overview of model tuning.

After the endpoint becomes active, it can accept prompt requests at its URI. The API call format for a tuned model is the same as the foundation model that you used for tuning. For example, if you tune your model on Gemini, your prompt request should follow the Gemini API.

Send prompt requests to your tuned model's endpoint, not the managed API. The tuned model's endpoint has the following format:

https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID

To get the endpoint ID, see View or manage an endpoint. For more information on how to format prompt requests, see the Model API reference.

Deploy a model without a managed API

To use a model from the Model Garden that doesn't have a managed API, you must upload the model to the Model Registry and deploy it to an endpoint. This process is similar to deploying a custom-trained model for online prediction.

To deploy one of these models, go to the Model Garden and select the model that you want to deploy.

Go to Model Garden

Each model card displays one or more of the following deployment options:

  • Deploy button: A guided, UI-based workflow in the Google Cloud console.

    • Pros: Simple, no code required, good for standard configurations.
    • Cons: Less flexible for complex or automated setups.
    • Details:
      • For deployment on Vertex AI, you can use the suggested settings or customize them, including advanced options like selecting a Compute Engine reservation.
      • Some models also support deployment to Google Kubernetes Engine, an unmanaged solution for greater control. For more information, see Serve a model with a single GPU in GKE.
      • If you don't see a Deploy button, use the Open Notebook option.
  • Open Notebook button: A Jupyter Notebook with sample code for deployment.

    • Pros: Highly customizable, good for automation (CI/CD), provides code transparency.
    • Cons: Requires familiarity with Python and the Vertex AI SDK.
    • Details: The notebook contains sample code and instructions to upload the model to Model Registry, deploy it to an endpoint, and send a prompt request. Every model card in the Model Garden has this option.

After deployment, the endpoint becomes active and can accept prompt requests at its URI. The API format is predict, and the structure of each instance in the request body depends on the model. For more information, see the following resources:

Before you deploy, verify that you have sufficient machine quota. To view your current quota or request an increase, go to the Quotas page in the Google Cloud console.

Then, filter by the quota name Custom Model Serving to see the quotas for online prediction. To learn more, see View and manage quotas.

Go to Quotas

Reserve capacity with Compute Engine reservations

You can deploy Model Garden models on VM resources that you allocate through Compute Engine reservations. Reservations help make capacity available when you need it. For more information, see Use reservations with prediction.

View or manage a model

You can view and manage all models that you've uploaded in the Model Registry.

Go to Model Registry

For tuned models, you can also view the model and its tuning job on the Tune and Distill page.

Go to Tune and Distill

In the Model Registry, tuned models are categorized as Large Models and have labels that specify the foundation model and the tuning job. For models deployed with the Deploy button, the Source is Model Garden. Updates to models in the Model Garden don't apply to the models that you've uploaded to the Model Registry.

For more information, see Introduction to Vertex AI Model Registry.

View or manage an endpoint

To view and manage your endpoint, go to the Vertex AI Online prediction page. By default, the endpoint's name is the same as the model's name.

Go to Online prediction

For more information, see Deploy a model to an endpoint.

Monitor model endpoint traffic

To monitor traffic to your endpoint in Metrics Explorer, do the following:

  1. In the Google Cloud console, go to the Metrics Explorer page.

    Go to Metrics Explorer

  2. Select your project.

  3. In the Select a metric field, enter Vertex AI Endpoint.

  4. Select the Vertex AI Endpoint > Prediction metric category. Under Active metrics, select one or more of the following metrics:

    • prediction/online/error_count
    • prediction/online/prediction_count
    • prediction/online/prediction_latencies
    • prediction/online/response_count
  5. Click Apply.

  6. To refine your view, you can filter or aggregate the metrics:

    • Filter: To view a subset of your data, use the Filter drop-down menu. For example, filter by endpoint_id = gemini-2p0-flash-001. In a model name, replace decimal points with p.
    • Aggregation: To combine data points, use the Aggregation drop-down menu. For example, you can view the Sum of response_code.
  7. Optional: To set up alerts for your endpoint, see Manage alerting policies.

To view the metrics you add to your project using a dashboard, see Dashboards overview.

Limitations

  • You can only deploy a tuned Gemini model to a shared public endpoint. You can't deploy it to dedicated public endpoints, Private Service Connect endpoints, or private endpoints.

Pricing

  • Tuned models: You are billed per token at the same rate as the foundation model that you used for tuning. There is no cost for the endpoint because Vertex AI implements tuning as a small adapter on top of the foundation model. For more information, see pricing for Generative AI on Vertex AI.

  • Models without managed APIs: You are billed for the machine hours that your endpoint uses at the same rate as Vertex AI online predictions. You are not billed per token. For more information, see pricing for predictions in Vertex AI.

What's next