Some generative AI models, such as Gemini, have managed APIs and are ready to accept prompts without deployment. For a list of models with managed APIs, see Foundational model APIs. Other generative AI models must be deployed to an endpoint before
they're ready to accept prompts. There are two types of generative models that
must be deployed: Tuned models, which you create by tuning a
supported foundation model with your own data. Generative models that don't have managed APIs. In the
Model Garden, these are models that aren't labeled as
API available or Vertex AI Studio—for example, Llama 2. When you deploy a model to an endpoint, Vertex AI associates compute
resources and a URI with the model so that it can serve prompt requests. Tuned models are automatically uploaded to the
Vertex AI Model Registry
and deployed to a Vertex AI shared public
Once the endpoint is active, it is ready to accept prompt requests at its URI.
The format of the API call for a tuned model is the same as the foundation model
it was tuned from. For example, if your model is tuned on Gemini, then your
prompt request should follow the Gemini API. Make sure you send prompt requests to your tuned model's endpoint instead of the
managed API. The tuned model's endpoint is in the format: To get the endpoint ID, see View or manage an endpoint. For more information on formatting prompt requests, see the
Model API reference. To use a model from the Model Garden that doesn't have a managed
API, you must upload the model to Model Registry and
deploy it to an endpoint before you can send prompt requests. This is similar to
uploading and deploying a custom trained model for online prediction
in Vertex AI. To deploy one of these models, go to the Model Garden and select
the model you'd like to deploy. Each model card displays one or more of the following deployment options: Deploy button: Most of the generative models in
the Model Garden have a Deploy button that walks you
through deploying to Vertex AI. If you don't see a Deploy
button, go to the next bullet. For deployment on Vertex AI, you can use the
suggested settings or modify them. You can also set Advanced deployment
settings to, for example, select a Compute Engine
reservation. Open Notebook button: This option opens a Jupyter notebook. Every model
card displays this option. The Jupyter notebook includes instructions and
sample code for uploading the model to Model Registry,
deploying the model to an endpoint, and sending a prompt request. Once deployment is complete and the endpoint is active, it is ready to accept
prompt requests at its URI. The format of the API is
Make sure you have enough machine quota to deploy your model. To view your
current quota or request more quota, in the Google Cloud console, go to the
Quotas page. Then, filter by the quota name You can deploy Model Garden models on VM resources that have been
allocated through Compute Engine reservations. Reservations help ensure
that capacity is available when your model predictions requests need them. For
more information, see Use reservations with prediction. For tuned models, you can view the model and its tuning job on the Tune and
Distill page in the Google Cloud console. You can also view and manage all of your uploaded models in
Model Registry. In Model Registry, a tuned model is categorized as a
Large Model, and has labels that specify the foundation model and the pipeline
or tuning job that was used for tuning. Models that are deployed with the Deploy button will indicate Model Garden
as its For more information, see Introduction to Vertex AI Model Registry. To view and manage your endpoint, go to the Vertex AI
Online prediction page. By default, the endpoint's name is the same as the
model's name. For more information, see Deploy a model to an endpoint. Use the following instructions to monitor traffic to your endpoint in the Metrics Explorer. In the Google Cloud console, go to the Metrics Explorer page. Select the project you want to view metrics for. From the Metric drop-down menu, click Select a metric. In the Filter by resource or metric name search bar, enter
Select the Vertex AI Endpoint > Prediction metric category. Under Active metrics, select any of the following metrics: Click Apply. To add more than one metric, click Add query. You can filter or aggregate your metrics using the following drop-down menus: To select and view a subset of your data based on specified criteria, use
the Filter drop-down menu. For example, To combine multiple data points into a single value and see a summarized
view of your metrics, use the Aggregation drop-down menu. For example, you can aggregate the Sum of Optionally, you can set up alerts for your endpoint. For more information,
see Manage alerting policies. To view the metrics you add to your project using a dashboard, see
Dashboards overview. For tuned models, you are billed per token at the same rate as the foundation
model your model was tuned from. There is no cost for the endpoint because
tuning is implemented as a small adapter on top of the foundation model. For
more information, see pricing for Generative AI on Vertex AI. For models without managed APIs, you are billed for the machine hours that are
used by your endpoint at the same rate as Vertex AI online
predictions. You are not billed per token. For more information, see
pricing for predictions in Vertex AI.
Deploy a tuned model
endpoint
. Tuned models don't
appear in the Model Garden because they are tuned with your data.
For more information, see
Overview of model tuning.
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID
Deploy a generative model that doesn't have a managed API
predict
and the format
of each instance
in the request body depends on the model. For more information, see the
following resources:Custom Model Serving
to see the quotas for
online prediction. To learn more, see View and manage quotas.Ensure capacity for deployed models with Compute Engine reservations
View or manage a model
Source
.
Note that, if the model is updated in the Model Garden, your
uploaded model in Model Registry is not updated.View or manage an endpoint
Monitor model endpoint traffic
Vertex AI Endpoint
.
prediction/online/error_count
prediction/online/prediction_count
prediction/online/prediction_latencies
prediction/online/response_count
endpoint_id = gemini-2p0-flash-001
(decimal points in a model name should be replaced with p
).response_code
.Limitations
Pricing
What's next
Deploy generative AI models
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.