[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Perform roll out operations for GKE Inference Gateway\n\n[Autopilot](/kubernetes-engine/docs/concepts/autopilot-overview) [Standard](/kubernetes-engine/docs/concepts/choose-cluster-mode)\n\n*** ** * ** ***\n\n|\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nThis page shows you how to perform incremental roll out operations, which\ngradually deploy new versions of your inference infrastructure, for\nGKE Inference Gateway. This gateway lets you perform safe and controlled\nupdates to your inference infrastructure. You can update nodes, base models, and\nLoRA adapters with minimal service disruption. This page also provides guidance\non traffic splitting and rollbacks to help ensure reliable deployments.\n\nThis page is for GKE Identity and account admins and\nDevelopers who want to perform roll out operations for\nGKE Inference Gateway.\n\nThe following use cases are supported:\n\n- [Node (compute, accelerator) update roll out](#node-update-rollout)\n- [Base model update roll out](#basemodel-rollout)\n- [LoRA adapter update roll out](#lora-adapter-rollout)\n\nUpdate a node roll out\n----------------------\n\nNode update roll outs safely migrate inference workloads to new node hardware or\naccelerator configurations. This process happens in a controlled manner without\ninterrupting model service. Use node update roll outs to minimize service\ndisruption during hardware upgrades, driver updates, or security issue\nresolution.\n\n1. **Create a new `InferencePool`** : deploy an `InferencePool` configured with the\n updated node or hardware specifications.\n\n2. **Split traffic using an `HTTPRoute`** : configure an `HTTPRoute` to distribute\n traffic between the existing and new `InferencePool` resources. Use the `weight`\n field in `backendRefs` to manage the traffic percentage directed to the new\n nodes.\n\n3. **Maintain a consistent `InferenceModel`** : retain the existing\n `InferenceModel` configuration to ensure uniform model behavior across both\n node configurations.\n\n4. **Retain original resources** : keep the original `InferencePool` and nodes\n active during the roll out to enable rollbacks if needed.\n\nFor example, you can create a new `InferencePool` named `llm-new`. Configure\nthis pool with the same model configuration as your existing `llm`\n`InferencePool`. Deploy the pool on a new set of nodes within your cluster. Use\nan `HTTPRoute` object to split traffic between the original `llm` and the new\n`llm-new` `InferencePool`. This technique lets you incrementally update your\nmodel nodes.\n\nThe following diagram illustrates how GKE Inference Gateway\nperforms a node update roll out.\n**Figure:**Node update roll out process\n\nTo perform a node update roll out, perform the following steps:\n\n1. Save the following sample manifest as `routes-to-llm.yaml`:\n\n apiVersion: gateway.networking.k8s.io/v1\n kind: `HTTPRoute`\n metadata:\n name: routes-to-llm\n spec:\n parentRefs:\n - name: my-inference-gateway\n rules:\n backendRefs:\n - name: llm\n kind: InferencePool\n weight: 90\n - name: llm-new\n kind: InferencePool\n weight: 10\n\n2. Apply the sample manifest to your cluster:\n\n kubectl apply -f routes-to-llm.yaml\n\nThe original `llm` `InferencePool` receives most of the traffic, while the\n`llm-new` `InferencePool` receives the rest of the traffic. Increase the traffic weight gradually\nfor the `llm-new` `InferencePool` to complete the node update roll out.\n\nRoll out a base model\n---------------------\n\nBase model updates roll out in phases to a new base LLM, retaining compatibility\nwith existing LoRA adapters. You can use base model update roll outs to upgrade to\nimproved model architectures or to address model-specific issues.\n\nTo roll out a base model update:\n\n1. **Deploy new infrastructure** : Create new nodes and a new `InferencePool` configured with the new base model that you chose.\n2. **Configure traffic distribution** : Use an `HTTPRoute` to split traffic between the existing `InferencePool` (which uses the old base model) and the new `InferencePool` (using the new base model). The `backendRefs weight` field controls the traffic percentage allocated to each pool.\n3. **Maintain `InferenceModel` integrity** : keep your `InferenceModel` configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.\n4. **Preserve rollback capability** : retain the original nodes and `InferencePool` during the roll out to facilitate a rollback if necessary.\n\nYou create a new `InferencePool` named `llm-pool-version-2`. This pool deploys\na new version of the base model on a new set of nodes. By\nconfiguring an `HTTPRoute`, as shown in the provided example, you can\nincrementally split traffic between the original `llm-pool` and\n`llm-pool-version-2`. This lets you control base model updates in your\ncluster.\n\nTo perform a base model update roll out, perform the following steps:\n\n1. Save the following sample manifest as `routes-to-llm.yaml`:\n\n apiVersion: gateway.networking.k8s.io/v1\n kind: HTTPRoute\n metadata:\n name: routes-to-llm\n spec:\n parentRefs:\n - name: my-inference-gateway\n rules:\n backendRefs:\n - name: llm-pool\n kind: InferencePool\n weight: 90\n - name: llm-pool-version-2\n kind: InferencePool\n weight: 10\n\n2. Apply the sample manifest to your cluster:\n\n kubectl apply -f routes-to-llm.yaml\n\nThe original `llm-pool` `InferencePool` receives most of the traffic, while the\n`llm-pool-version-2` `InferencePool` receives the rest. Increase the traffic\nweight gradually for the `llm-pool-version-2` `InferencePool` to complete the\nbase model update roll out.\n\nRoll out LoRA adapter updates\n-----------------------------\n\nLoRA adapter update roll outs let you deploy new versions of fine-tuned models\nin phases, without altering the underlying base model or infrastructure. Use\nLoRA adapter update roll outs to test improvements, bug fixes, or new features\nin your LoRA adapters.\n\nTo update a LoRA adapter, follow these steps:\n\n1. **Make adapters available** : Ensure that the new LoRA adapter versions are\n available on the model servers. For more information, see\n [Adapter roll out](https://gateway-api-inference-extension.sigs.k8s.io/guides/adapter-rollout/).\n\n2. **Modify the `InferenceModel` configuration** : in your existing `InferenceModel`\n configuration, define multiple versions of your LoRA adapter. Assign a\n unique `modelName` to each version (for example,\n `llm-v1`, `llm-v2`).\n\n3. **Distribute traffic** : use the `weight` field in the `InferenceModel`\n specification to control the traffic distribution among the different LoRA\n adapter versions.\n\n4. **Maintain a consistent `poolRef`** : ensure that all LoRA adapter\n versions reference the same `InferencePool`. This prevents node\n or `InferencePool` redeployments. Retain previous LoRA adapter\n versions in the `InferenceModel` configuration to enable\n rollbacks.\n\nThe following example shows two LoRA adapter versions, `llm-v1` and `llm-v2`.\nBoth versions use the same base model. You define `llm-v1` and `llm-v2` within\nthe same `InferenceModel`. You assign weights to incrementally shift traffic\nfrom `llm-v1` to `llm-v2`. This control allows a controlled roll out without\nrequiring any changes to your nodes or `InferencePool` configuration.\n\nTo roll out LoRA adapter updates, run the following command:\n\n1. Save the following sample manifest as `inferencemodel-sample.yaml`:\n\n apiVersion: inference.networking.x-k8s.io/v1alpha2\n kind: InferenceModel\n metadata:\n name: inferencemodel-sample\n spec:\n versions:\n - modelName: llm-v1\n criticality: Critical\n weight: 90\n poolRef:\n name: llm-pool\n - modelName: llm-v2\n criticality: Critical\n weight: 10\n poolRef:\n name: llm-pool\n\n2. Apply the sample manifest to your cluster:\n\n kubectl apply -f inferencemodel-sample.yaml\n\nThe `llm-v1` version receives most of the traffic, while the `llm-v2` version\nreceives the rest. Increase the traffic weight gradually for the `llm-v2`\nversion to complete the LoRA adapter update roll out.\n\nWhat's next\n-----------\n\n- [Customize GKE Inference Gateway configuration](/kubernetes-engine/docs/how-to/customize-gke-inference-gateway-configurations)\n- [Serve an LLM with GKE Inference Gateway](/kubernetes-engine/docs/tutorials/serve-with-gke-inference-gateway)"]]