This guide shows you how to deploy generative AI models to a Vertex AI endpoint for online predictions. Some generative AI models, such as Gemini, have managed APIs and are ready to accept prompts without deployment. For a list of models with managed APIs, see Foundational model APIs. Other generative AI models must be deployed to an endpoint before they can accept prompts. The following table compares the types of models that require deployment. When you deploy a model to an endpoint, Vertex AI associates compute resources and a URI with the model so that it can serve prompt requests. The following diagram summarizes the workflow for deploying a model: Tuned models are automatically uploaded to the Vertex AI Model Registry and deployed to a Vertex AI shared public After the endpoint becomes active, it can accept prompt requests at its URI. The API call format for a tuned model is the same as the foundation model that you used for tuning. For example, if you tune your model on Gemini, your prompt request should follow the Gemini API. Send prompt requests to your tuned model's endpoint, not the managed API. The tuned model's endpoint has the following format: To get the endpoint ID, see View or manage an endpoint. For more information on how to format prompt requests, see the Model API reference. To use a model from the Model Garden that doesn't have a managed API, you must upload the model to the Model Registry and deploy it to an endpoint. This process is similar to deploying a custom-trained model for online prediction. To deploy one of these models, go to the Model Garden and select the model that you want to deploy. Each model card displays one or more of the following deployment options: Deploy button: A guided, UI-based workflow in the Google Cloud console. Open Notebook button: A Jupyter Notebook with sample code for deployment. After deployment, the endpoint becomes active and can accept prompt requests at its URI. The API format is Before you deploy, verify that you have sufficient machine quota. To view your current quota or request an increase, go to the Quotas page in the Google Cloud console. Then, filter by the quota name You can deploy Model Garden models on VM resources that you allocate through Compute Engine reservations. Reservations help make capacity available when you need it. For more information, see Use reservations with prediction. You can view and manage all models that you've uploaded in the Model Registry. For tuned models, you can also view the model and its tuning job on the Tune and Distill page. In the Model Registry, tuned models are categorized as Large Models and have labels that specify the foundation model and the tuning job. For models deployed with the Deploy button, the For more information, see Introduction to Vertex AI Model Registry. To view and manage your endpoint, go to the Vertex AI Online prediction page. By default, the endpoint's name is the same as the model's name. For more information, see Deploy a model to an endpoint. To monitor traffic to your endpoint in Metrics Explorer, do the following: In the Google Cloud console, go to the Metrics Explorer page. Select your project. In the Select a metric field, enter Select the Vertex AI Endpoint > Prediction metric category. Under Active metrics, select one or more of the following metrics: Click Apply. To refine your view, you can filter or aggregate the metrics: Optional: To set up alerts for your endpoint, see Manage alerting policies. To view the metrics you add to your project using a dashboard, see Dashboards overview. Tuned models: You are billed per token at the same rate as the foundation model that you used for tuning. There is no cost for the endpoint because Vertex AI implements tuning as a small adapter on top of the foundation model. For more information, see pricing for Generative AI on Vertex AI. Models without managed APIs: You are billed for the machine hours that your endpoint uses at the same rate as Vertex AI online predictions. You are not billed per token. For more information, see pricing for predictions in Vertex AI.
Model Type
Description
Deployment Process
Use Case
Tuned Model
A foundation model that you fine-tune with your data.
Automatic deployment to a shared public endpoint after the tuning job completes.
Serving a customized model trained on your specific data.
Model without Managed API
A pre-trained model from Model Garden (for example, Llama 2) that you deploy yourself.
Manual deployment with the Deploy button or a Jupyter Notebook.
Serving open or third-party models that don't have a ready-to-use API.
Deploy a tuned model
endpoint
. Tuned models don't appear in the Model Garden because you tune them with your data. For more information, see Overview of model tuning.
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID
Deploy a model without a managed API
predict
, and the structure of each instance
in the request body depends on the model. For more information, see the following resources:Custom Model Serving
to see the quotas for
online prediction. To learn more, see View and manage quotas.Reserve capacity with Compute Engine reservations
View or manage a model
Source
is Model Garden. Updates to models in the Model Garden don't apply to the models that you've uploaded to the Model Registry.View or manage an endpoint
Monitor model endpoint traffic
Vertex AI Endpoint
.
prediction/online/error_count
prediction/online/prediction_count
prediction/online/prediction_latencies
prediction/online/response_count
endpoint_id = gemini-2p0-flash-001
. In a model name, replace decimal points with p
.response_code
.Limitations
Pricing
What's next
Deploy generative AI models
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-15 UTC.