[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[],[],null,["# Deploy generative AI models\n\nSome generative AI models, such as [Gemini](/vertex-ai/generative-ai/docs/overview), have managed APIs and are ready to accept prompts without deployment. For a list of models with managed APIs, see [Foundational model APIs](/vertex-ai/generative-ai/docs/learn/models#foundation_model_apis).\n\nOther generative AI models must be deployed to an endpoint before\nthey're ready to accept prompts. There are two types of generative models that\nmust be deployed:\n\n- [Tuned models](#deploy_a_tuned_model), which you create by tuning a\n supported foundation model with your own data.\n\n- [Generative models that don't have managed APIs](#not-managed). In the\n Model Garden, these are models that aren't labeled as\n **API available** or **Vertex AI Studio**---for example, Llama 2.\n\nWhen you deploy a model to an endpoint, Vertex AI associates compute\nresources and a URI with the model so that it can serve prompt requests.\n\nDeploy a tuned model\n--------------------\n\nTuned models are automatically uploaded to the\n[Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction)\nand deployed to a Vertex AI shared public\n[`endpoint`](/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints). Tuned models don't\nappear in the Model Garden because they are tuned with your data.\nFor more information, see\n[Overview of model tuning](/vertex-ai/generative-ai/docs/models/tune-models).\n\nOnce the endpoint is *active* , it is ready to accept prompt requests at its URI.\nThe format of the API call for a tuned model is the same as the foundation model\nit was tuned from. For example, if your model is tuned on Gemini, then your\nprompt request should follow the [Gemini API](/vertex-ai/generative-ai/docs/model-reference/gemini).\n\nMake sure you send prompt requests to your tuned model's endpoint instead of the\nmanaged API. The tuned model's endpoint is in the format: \n\n```\nhttps://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID\n```\n\nTo get the endpoint ID, see [View or manage an endpoint](#view_or_manage_an_endpoint).\n\nFor more information on formatting prompt requests, see the\n[Model API reference](/vertex-ai/generative-ai/docs/model-reference/overview).\n\nDeploy a generative model that doesn't have a managed API\n---------------------------------------------------------\n\nTo use a model from the Model Garden that doesn't have a managed\nAPI, you must upload the model to Model Registry and\ndeploy it to an endpoint before you can send prompt requests. This is similar to\nuploading and [deploying a custom trained model for online prediction](/vertex-ai/docs/general/deployment)\nin Vertex AI.\n\nTo deploy one of these models, go to the Model Garden and select\nthe model you'd like to deploy.\n\n[Go to Model Garden](https://console.cloud.google.com/vertex-ai/model-garden)\n\nEach model card displays one or more of the following deployment options:\n\n- **Deploy** button: Most of the generative models in\n the Model Garden have a **Deploy** button that walks you\n through deploying to Vertex AI. If you don't see a **Deploy**\n button, go to the next bullet.\n\n For deployment on Vertex AI, you can use the\n suggested settings or modify them. You can also set **Advanced** deployment\n settings to, for example, select a Compute Engine\n [reservation](#reservation).\n | **Note:** Some models also support deployment to Google Kubernetes Engine which is an unmanaged solution that provides you even more control. For more information, see [Serve a model with a single GPU in GKE](/kubernetes-engine/docs/tutorials/online-ml-inference).\n- **Open Notebook** button: This option opens a Jupyter notebook. Every model\n card displays this option. The Jupyter notebook includes instructions and\n sample code for uploading the model to Model Registry,\n deploying the model to an endpoint, and sending a prompt request.\n\nOnce deployment is complete and the endpoint is *active* , it is ready to accept\nprompt requests at its URI. The format of the API is\n[`predict`](/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/predict) and the format\nof each [`instance`](/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/predict#body.request_body.FIELDS.instances)\nin the request body depends on the model. For more information, see the\nfollowing resources:\n\n- [Request body for online prediction](/vertex-ai/docs/predictions/get-online-predictions#request-body-details)\n- [Format your input for online prediction](/vertex-ai/docs/predictions/get-online-predictions#formatting-prediction-input)\n\nMake sure you have enough machine quota to deploy your model. To view your\ncurrent quota or request more quota, in the Google Cloud console, go to the\n**Quotas** page.\n\n[Go to Quotas](https://console.cloud.google.com/iam-admin/quotas)\n\nThen, filter by the quota name `Custom Model Serving` to see the quotas for\nonline prediction. To learn more, see [View and manage quotas](/docs/quotas/view-manage).\n\n### Ensure capacity for deployed models with Compute Engine reservations\n\nYou can deploy Model Garden models on VM resources that have been\nallocated through Compute Engine reservations. Reservations help ensure\nthat capacity is available when your model predictions requests need them. For\nmore information, see [Use reservations with prediction](/vertex-ai/docs/predictions/use-reservations).\n\nView or manage a model\n----------------------\n\nFor tuned models, you can view the model and its tuning job on the **Tune and\nDistill** page in the Google Cloud console.\n\n[Go to Tune and Distill](https://console.cloud.google.com/vertex-ai/generative/language/tuning)\n\nYou can also view and manage all of your uploaded models in\nModel Registry.\n\n[Go to Model Registry](https://console.cloud.google.com/vertex-ai/models)\n\nIn Model Registry, a tuned model is categorized as a\n*Large Model*, and has labels that specify the foundation model and the pipeline\nor tuning job that was used for tuning.\n\nModels that are deployed with the **Deploy** button will indicate **Model Garden**\nas its [`Source`](/vertex-ai/docs/reference/rest/v1/projects.locations.models#Model.FIELDS.model_source_info).\nNote that, if the model is updated in the Model Garden, your\nuploaded model in Model Registry is not updated.\n\nFor more information, see [Introduction to Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction).\n\nView or manage an endpoint\n--------------------------\n\nTo view and manage your endpoint, go to the Vertex AI\n**Online prediction** page. By default, the endpoint's name is the same as the\nmodel's name.\n\n[Go to Online prediction](https://console.cloud.google.com/vertex-ai/online-prediction/endpoints)\n\nFor more information, see [Deploy a model to an endpoint](/vertex-ai/docs/general/deployment).\n\nMonitor model endpoint traffic\n------------------------------\n\nTo learn how to monitor model endpoint traffic, see\n[Monitor models](/vertex-ai/generative-ai/docs/learn/model-observability#monitor-traffic).\n\nLimitations\n-----------\n\n- A tuned Gemini model can only be deployed to a shared public endpoint. Deployment to dedicated public endpoints, Private Service Connect endpoints, and private endpoints isn't supported.\n\nPricing\n-------\n\nFor tuned models, you are billed per token at the same rate as the foundation\nmodel your model was tuned from. There is no cost for the endpoint because\ntuning is implemented as a small adapter on top of the foundation model. For\nmore information, see [pricing for Generative AI on Vertex AI](/vertex-ai/generative-ai/pricing).\n\nFor models without managed APIs, you are billed for the machine hours that are\nused by your endpoint at the same rate as Vertex AI online\npredictions. You are not billed per token. For more information, see\n[pricing for predictions in Vertex AI](/vertex-ai/pricing#prediction-prices).\n\nWhat's next\n-----------\n\n- [Overview of model tuning](/vertex-ai/generative-ai/docs/models/tune-models)\n- [Model API reference](/vertex-ai/generative-ai/docs/model-reference/overview)\n- [Deploy a model to an endpoint](/vertex-ai/docs/general/deployment)"]]