이 가이드에서는 Vertex AI 추론과 함께 플렉스 시작 VM을 사용할 때의 이점과 제한사항을 설명합니다. 이 가이드에서는 Flex-start VM을 사용하는 모델을 배포하는 방법도 설명합니다.
개요
동적 워크로드 스케줄러를 기반으로 하는 flex-start VM을 사용하면 추론 작업 실행 비용을 줄일 수 있습니다.
flex-start VM은 상당한 할인을 제공하며 단기 워크로드에 적합합니다.
최대 7일까지 원하는 기간 동안 Flex-start VM이 필요한 시간을 지정할 수 있습니다. 요청한 시간이 지나면 배포된 모델이 자동으로 배포 해제됩니다. 시간이 만료되기 전에 모델을 수동으로 배포 취소할 수도 있습니다.
자동 배포 해제
특정 기간 동안 Flex-start VM을 요청하면 해당 기간이 지난 후 모델이 자동으로 배포 해제됩니다. 예를 들어 5시간 동안 Flex-start VM을 요청하면 제출 후 5시간이 지나면 모델이 자동으로 배포 해제됩니다. 워크로드가 실행되는 시간에 대해서만 비용이 청구됩니다.
제한사항 및 요구사항
유연한 시작 VM을 사용할 때는 다음 제한사항과 요구사항을 고려하세요.
최대 기간: 플렉스 시작 VM의 최대 사용 기간은 7일입니다. 더 긴 기간의 배포 요청은 거부됩니다.
TPU 지원: TPU 포드에서 Flex-start VM을 사용하는 것은 지원되지 않습니다.
할당량: 작업을 시작하기 전에 Vertex AI 선점형 할당량이 충분한지 확인합니다.
자세한 내용은 비율 할당량을 참고하세요.
큐에 추가된 프로비저닝: 큐에 추가된 프로비저닝과 함께 flex-start VM을 사용하는 것은 지원되지 않습니다.
노드 재활용: 노드 재활용은 지원되지 않습니다.
결제
워크로드가 7일 미만으로 실행되는 경우 Flex-start VM을 사용하면 비용을 절감할 수 있습니다.
Flex-start VM을 사용하면 작업 기간과 선택한 머신 유형에 따라 요금이 청구됩니다. 워크로드가 활성 상태로 실행되는 시간에 대해서만 요금이 청구됩니다. 작업이 대기열에 있는 시간이나 요청된 기간이 만료된 후의 시간에 대해서는 요금이 청구되지 않습니다.
결제는 다음 두 SKU에 분산됩니다.
라벨이 vertex-ai-online-prediction인 Compute Engine SKU입니다. 동적 워크로드 스케줄러 가격 책정을 참고하세요.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[],[],null,["| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nThis guide describes the benefits and limitations of using\nFlex-start VMs with Vertex AI inference. This guide also\ndescribes how to deploy a model that uses Flex-start VMs.\n\nOverview\n\nYou can reduce the cost of running your inference jobs by using\nFlex-start VMs, which are powered by\n[Dynamic Workload Scheduler](/blog/products/compute/introducing-dynamic-workload-scheduler).\nFlex-start VMs offer significant discounts and are well-suited for\nshort-duration workloads.\n\nYou can specify how long you need a Flex-start VM, for\nany duration up to seven days. After the requested time expires, your\ndeployed model is automatically undeployed. You can also manually undeploy\nthe model before the time expires.\n\nAutomatic undeployment\n\nIf you request a Flex-start VM for a specific duration,\nyour model is automatically undeployed after that time period. For example,\nif you request a Flex-start VM for five hours, the model\nis automatically undeployed five hours after submission. You are only charged\nfor the amount of time your workload is running.\n\nLimitations and requirements\n\nConsider the following limitations and requirements when you use\nFlex-start VMs:\n\n- **Maximum duration**: Flex-start VMs have a maximum usage duration of seven days. Any deployment request for a longer duration will be rejected.\n- **TPU support**: Using Flex-start VMs with TPU Pods isn't supported.\n- **Quota** : Make sure you have sufficient Vertex AI preemptible quota before launching your job. To learn more, see [Rate quotas](/vertex-ai/docs/quotas#serving).\n- **Queued provisioning**: Using Flex-start VMs with queued provisioning isn't supported.\n- **Node recycling**: Node recycling isn't supported.\n\nBilling\n\nIf your workload runs for less than seven days, using Flex-start VMs\ncan reduce your costs.\n\nWhen you use Flex-start VMs, you're billed based on the duration\nof your job and the machine type that you select. You are only charged for\nthe time your workload is actively running. You don't pay for the time that\nthe job is in a queue or for any time after the requested duration has expired.\n\nBilling is distributed across two SKUs:\n\n- The Compute Engine SKU, with the label `vertex-ai-online-prediction`. See\n [Dynamic Workload Scheduler pricing](https://cloud.google.com/products/dws/pricing).\n\n- The Vertex AI management fee SKU. See\n [Vertex AI pricing](/vertex-ai/pricing#prediction).\n\nGet inferences by using Flex-start VMs\n\nTo use Flex-start VMs when you deploy a model to get inferences,\nyou can use the REST API.\n\n\nBefore using any of the request data,\nmake the following replacements:\n\n- \u003cvar translate=\"no\"\u003eLOCATION_ID\u003c/var\u003e: The region where you are using Vertex AI.\n- \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: Your [project ID](/resource-manager/docs/creating-managing-projects#identifiers).\n- \u003cvar translate=\"no\"\u003eENDPOINT_ID\u003c/var\u003e: The ID for the endpoint.\n- \u003cvar translate=\"no\"\u003eMODEL_ID\u003c/var\u003e: The ID for the model to be deployed.\n- \u003cvar translate=\"no\"\u003eDEPLOYED_MODEL_NAME\u003c/var\u003e: A name for the `DeployedModel`. You can use the display name of the `Model` for the `DeployedModel` as well.\n- \u003cvar translate=\"no\"\u003eMACHINE_TYPE\u003c/var\u003e: Optional. The machine resources used for each node of this deployment. Its default setting is `n1-standard-2`. [Learn more about machine types.](/vertex-ai/docs/predictions/configure-compute)\n- \u003cvar translate=\"no\"\u003eACCELERATOR_TYPE\u003c/var\u003e: Optional. The type of accelerator to attach to the machine. [Learn more](/vertex-ai/docs/predictions/configure-compute#gpus).\n- \u003cvar translate=\"no\"\u003eACCELERATOR_COUNT\u003c/var\u003e: Optional. The number of accelerators for each replica to use.\n- \u003cvar translate=\"no\"\u003eMAX_RUNTIME_DURATION\u003c/var\u003e: The maximum duration for the flex-start deployment. The deployed model is automatically undeployed after this duration. Specify the duration in seconds, ending with an `s`. For example, `3600s` for one hour. The maximum value is `604800s` (7 days).\n- \u003cvar translate=\"no\"\u003ePROJECT_NUMBER\u003c/var\u003e: Your project's automatically generated [project number](/resource-manager/docs/creating-managing-projects#identifiers).\n\n\nHTTP method and URL:\n\n```\nPOST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel\n```\n\n\nRequest JSON body:\n\n```\n{\n \"deployedModel\": {\n \"model\": \"projects/PROJECT/locations/LOCATION/models/MODEL_ID\",\n \"displayName\": \"DEPLOYED_MODEL_NAME\",\n \"enableContainerLogging\": true,\n \"dedicatedResources\": {\n \"machineSpec\": {\n \"machineType\": \"MACHINE_TYPE\",\n \"acceleratorType\": \"ACCELERATOR_TYPE\",\n \"acceleratorCount\": ACCELERATOR_COUNT\n },\n \"flexStart\": {\n \"maxRuntimeDuration\": \"MAX_RUNTIME_DURATION\"\n },\n \"minReplicaCount\": 2,\n \"maxReplicaCount\": 2\n },\n },\n}\n```\n\nTo send your request, expand one of these options:\n\ncurl (Linux, macOS, or Cloud Shell) **Note:** The following command assumes that you have logged in to the `gcloud` CLI with your user account by running [`gcloud init`](/sdk/gcloud/reference/init) or [`gcloud auth login`](/sdk/gcloud/reference/auth/login) , or by using [Cloud Shell](/shell/docs), which automatically logs you into the `gcloud` CLI . You can check the currently active account by running [`gcloud auth list`](/sdk/gcloud/reference/auth/list).\n\n\nSave the request body in a file named `request.json`,\nand execute the following command:\n\n```\ncurl -X POST \\\n -H \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n -H \"Content-Type: application/json; charset=utf-8\" \\\n -d @request.json \\\n \"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel\"\n```\n\nPowerShell (Windows) **Note:** The following command assumes that you have logged in to the `gcloud` CLI with your user account by running [`gcloud init`](/sdk/gcloud/reference/init) or [`gcloud auth login`](/sdk/gcloud/reference/auth/login) . You can check the currently active account by running [`gcloud auth list`](/sdk/gcloud/reference/auth/list).\n\n\nSave the request body in a file named `request.json`,\nand execute the following command:\n\n```\n$cred = gcloud auth print-access-token\n$headers = @{ \"Authorization\" = \"Bearer $cred\" }\n\nInvoke-WebRequest `\n -Method POST `\n -Headers $headers `\n -ContentType: \"application/json; charset=utf-8\" `\n -InFile request.json `\n -Uri \"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID:deployModel\" | Select-Object -Expand Content\n```\n\nYou should receive a JSON response similar to the following:\n\n```\n{\n \"name\": \"projects/PROJECT_ID/locations/LOCATION/endpoints/ENDPOINT_ID/operations/OPERATION_ID\",\n \"metadata\": {\n \"@type\": \"type.googleapis.com/google.cloud.aiplatform.v1beta1.DeployModelOperationMetadata\",\n \"genericMetadata\": {\n \"createTime\": \"2020-10-19T17:53:16.502088Z\",\n \"updateTime\": \"2020-10-19T17:53:16.502088Z\"\n }\n }\n}\n```\n\nWhat's next\n\n- [Use Spot VMs with Vertex AI\n inference](/vertex-ai/docs/predictions/use-spot-vms).\n\n- [Use reservations with Vertex AI\n inference](/vertex-ai/docs/predictions/use-reservations)."]]