GKE で DeepSeek-R1 671B や Llama 3.1 405B などの LLM を提供する

Autopilot Standard

概要

このガイドでは、Google Kubernetes Engine（GKE）で複数のノードで画像処理装置（GPU）を使用して、DeepSeek-R1 671B や Llama 3.1 405B などの最先端の大規模言語モデル（LLM）をサービングする方法について説明します。

このガイドでは、ポータブルなオープンソース技術（Kubernetes、vLLM、LeaderWorkerSet（LWS）API）を使用して、GKE に AI / ML ワークロードをデプロイしてサービングする方法について説明します。これにより、GKE の詳細な制御、スケーラビリティ、復元力、ポータビリティ、費用対効果を活用できます。

このページを読む前に、次のことをよく理解しておいてください。

背景

このセクションでは、このガイドで使用されている重要なテクノロジーについて説明します。このガイドの例として使用されている 2 つの LLM（DeepSeek-R1 と Llama 3.1 405B）も含まれます。

DeepSeek-R1

DeepSeek-R1 は、DeepSeek が開発した 6,710 億個のパラメータを備えた大規模言語モデルで、さまざまなテキストベースのタスクにおける論理的推論、数学的推論、リアルタイムの問題解決を目的として設計されています。GKE は、DeepSeek-R1 のコンピューティング需要を処理し、スケーラブルなリソース、分散コンピューティング、効率的なネットワーキングによってその機能をサポートしています。

詳細については、DeepSeek のドキュメントをご覧ください。

Llama 3.1 405B

Llama 3.1 405B は、テキスト生成、翻訳、質問応答など、さまざまな自然言語処理タスク用に設計された Meta の大規模言語モデルです。GKE は、この規模のモデルの分散トレーニングとサービングの実現に欠かせない強固なインフラストラクチャを提供します。

詳細については、Llama のドキュメントをご覧ください。

GKE マネージド Kubernetes サービス

Google Cloud には、AI/ML ワークロードのデプロイと管理に適した GKE など、幅広いサービスが用意されています。GKE は、コンテナ化されたアプリケーションのデプロイ、スケーリング、管理を簡素化するマネージド Kubernetes サービスです。GKE は、LLM のコンピューティング需要を処理するために必要なインフラストラクチャ（スケーラブルなリソース、分散コンピューティング、効率的なネットワーキングなど）を提供します。

Kubernetes の主なコンセプトについて詳しくは、Kubernetes の学習を開始するをご覧ください。GKE の詳細と、GKE が Kubernetes のスケーリング、自動化、管理にどのように役立つかについては、GKE の概要をご覧ください。

GPU

画像処理装置（GPU）を使用すると、ML やデータ処理などの特定のワークロードを高速化できます。GKE には、これらの強力な GPU を搭載したノードが用意されています。これにより、ML タスクとデータ処理タスクで最適なパフォーマンスを実現するようにクラスタを構成できます。GKE には、NVIDIA H100、L4、A100 GPU を搭載したマシンタイプをはじめとして、ノード構成用のさまざまなマシンタイプオプションが用意されています。

詳しくは、GKE での GPU についてをご覧ください。

LeaderWorkerSet（LWS）

LeaderWorkerSet（LWS）は、AI / ML マルチノード推論ワークロードの一般的なデプロイパターンに対応する Kubernetes deployment API です。マルチノードサービングは、分散推論ワークロードを処理するために、それぞれが異なるノードで実行される可能性のある複数の Pod を活用します。LWS を使用すると、複数の Pod をグループとして扱うことができるため、分散モデルサービングの管理が簡素化されます。

vLLM とマルチホストサービング

コンピューティング負荷の高い LLM を提供する場合は、vLLM を使用して、GPU 間でワークロードを実行することをおすすめします。

vLLM は、GPU のサービングスループットを向上できる、高度に最適化されたオープンソースの LLM サービングフレームワークであり、次のような機能を備えています。

PagedAttention による Transformer の実装の最適化
サービングスループットを全体的に向上させる連続的なバッチ処理
複数の GPU での分散サービング

1 つの GPU ノードに収まらない特に計算負荷の高い LLM では、複数の GPU ノードを使用してモデルをサービングできます。vLLM は、次の 2 つの方法による複数の GPU 間でのワークロードの実行をサポートしています。

テンソル並列処理では、Transformer レイヤの行列乗算を複数の GPU に分割します。ただし、この方法では GPU 間の通信が必要になるため、高速なネットワークが必要であり、ノード間でワークロードを実行する場合は適していません。
パイプライン並列処理では、モデルをレイヤ（垂直方向）で分割します。この方法では、GPU 間の通信を常に行う必要がないため、ノードをまたいでモデルを実行する場合に適しています。

マルチノードサービングでは、どちらの戦略も使用できます。たとえば、それぞれ 8 個の H100 GPU が割り当てられた 2 つのノードを使用する場合、次の方法はどちらでも使用できます。

2 つのノード間でモデルをシャーディングする 2 方向パイプライン並列処理
各ノードの 8 個の GPU 間でモデルをシャーディングする 8 方向テンソル並列処理

詳細については、vLLM のドキュメントをご覧ください。

目標

Autopilot モードまたは Standard モードの GKE クラスタで環境を準備する。
クラスタ内の複数のノードに vLLM をデプロイする。
vLLM を使用して、curl を介してモデルをサービングする。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/iam.securityAdmin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
IAM に移動
プロジェクトを選択します。
[ アクセスを許可] をクリックします。
[新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Google アカウントのメールアドレスです。
[ロールを選択] リストでロールを選択します。
追加のロールを付与するには、 [別のロールを追加] をクリックして各ロールを追加します。
[保存] をクリックします。

Hugging Face アカウントを作成します（まだ作成していない場合）。
プロジェクトに NVIDIA_H100_MEGA 用の十分な割り当てがあることを確認します。このチュートリアルでは、8 個の NVIDIA H100 80GB GPUs を搭載した a3-highgpu-8g マシンタイプを使用します。GPU と割り当ての管理方法の詳細については、GPU についてと数量に基づく割り当てをご覧ください。

モデルへのアクセス権を取得する

Llama 3.1 405B モデルまたは DeepSeek-R1 モデルを使用できます。

DeepSeek-R1

アクセストークンを生成する

Hugging Face トークンをまだ生成していない場合は、新しいトークンを生成します。

[Your Profile] > [Settings] > [Access Tokens] の順にクリックします。
[New Token] を選択します。
任意の名前と、少なくとも Read ロールを指定します。
[Generate a token] を選択します。

Llama 3.1 405B

アクセストークンを生成する

Hugging Face トークンをまだ生成していない場合は、新しいトークンを生成します。

[Your Profile] > [Settings] > [Access Tokens] の順にクリックします。
[New Token] を選択します。
任意の名前と、少なくとも Read ロールを指定します。
[Generate a token] を選択します。

環境を準備する

このチュートリアルでは、Cloud Shell を使用してGoogle Cloudでホストされているリソースを管理します。Cloud Shell には、このチュートリアルに必要な kubectl や gcloud CLI などのソフトウェアがプリインストールされています。

Cloud Shell を使用して環境を設定するには、次の操作を行います。

Google Cloud コンソールで（Cloud Shell をアクティブにする）をクリックして、Google Cloud コンソールで Cloud Shell セッションを起動します。これにより、 Google Cloud コンソールの下部ペインでセッションが起動します。
デフォルトの環境変数を設定します。
```
gcloud config set project PROJECT_ID
gcloud config set billing/quota_project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION
export ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
```
次の値を置き換えます。
- PROJECT_ID: 実際の Google Cloud プロジェクト ID。
- CLUSTER_NAME: GKE クラスタの名前。
- REGION: GKE クラスタのリージョン。
- ZONE: NVIDIA H100 Tensor Core GPU をサポートするゾーン。

GKE クラスタを作成する

GKE Autopilot クラスタまたは GKE Standard クラスタの複数の GPU ノードで vLLM を使用してモデルをサービングできます。フルマネージドの Kubernetes エクスペリエンスを実現するには、Autopilot クラスタを使用することをおすすめします。ワークロードに最適な GKE の運用モードを選択するには、GKE の運用モードを選択するをご覧ください。

Autopilot

Cloud Shell で、次のコマンドを実行します。

  gcloud container clusters create-auto ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --cluster-version=${CLUSTER_VERSION}

Standard

2 つの CPU ノードを含む GKE Standard クラスタを作成します。

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=2 \
    --location=REGION \
    --machine-type=e2-standard-16

2 つのノードと 8 つの H100 で構成される A3 ノードプールを作成します。

gcloud container node-pools create gpu-nodepool \
    --node-locations=ZONE \
    --num-nodes=2 \
    --machine-type=a3-highgpu-8g \
    --accelerator=type=nvidia-h100-80gb,count=8,gpu-driver-version=LATEST \
    --placement-type=COMPACT \
    --cluster=CLUSTER_NAME
    --location=${REGION}

クラスタと通信を行うように `kubectl` を構成します。

次のコマンドを使用して、クラスタと通信するように kubectl を構成します。

gcloud container clusters get-credentials CLUSTER_NAME --location=REGION

Hugging Face の認証情報用の Kubernetes Secret を作成する

次のコマンドを使用して、Hugging Face トークンを含む Kubernetes Secret を作成します。

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

LeaderWorkerSet をインストールする

LWS をインストールするには、次のコマンドを実行します。

kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

次のコマンドを使用して、LeaderWorkerSet コントローラが lws-system Namespace で実行されていることを確認します。

kubectl get pod -n lws-system

出力は次のようになります。

NAME                                     READY   STATUS    RESTARTS   AGE
lws-controller-manager-546585777-crkpt   1/1     Running   0          4d21h
lws-controller-manager-546585777-zbt2l   1/1     Running   0          4d21h

vLLM モデルサーバーをデプロイする

vLLM モデルサーバーをデプロイする手順は次のとおりです。

デプロイする LLM に応じてマニフェストを適用します。

DeepSeek-R1

マニフェスト vllm-deepseek-r1-A3.yaml を調べます。


apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: vllm/vllm-openai:v0.8.5
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.8.5
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

次のコマンドを実行してマニフェストを適用します。
```
kubectl apply -f vllm-deepseek-r1-A3.yaml
```

Llama 3.1 405B

マニフェスト vllm-llama3-405b-A3.yaml を調べます。


apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: vllm/vllm-openai:v0.8.5
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.8.5
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

次のコマンドを実行してマニフェストを適用します。
```
kubectl apply -f vllm-llama3-405b-A3.yaml
```

モデルのチェックポイントのダウンロードが完了するまで待ちます。このオペレーションは完了するまでに数分かかることがあります。

次のコマンドを使用して、実行中のモデルサーバーのログを表示します。

kubectl logs vllm-0 -c vllm-leader

出力は次のようになります。

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

モデルをサービングする

次のコマンドを実行して、モデルへのポート転送を設定します。

kubectl port-forward svc/vllm-leader 8080:8080

curl を使用してモデルを操作する

curl を使用してモデルを操作する手順は次のとおりです。

DeepSeek-R1

新しいターミナルで、サーバーにリクエストを送信します。

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

出力例を以下に示します。

{
"id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
"object": "text_completion",
"created": 1738269669,
"model": "deepseek-ai/DeepSeek-R1",
"choices": [
  {
    "index": 0,
    "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
    "logprobs": null,
    "finish_reason": "stop",
    "stop_reason": null,
    "prompt_logprobs": null
  }
],
"usage": {
  "prompt_tokens": 76,
  "total_tokens": 544,
  "completion_tokens": 468,
  "prompt_tokens_details": null
}
}

Llama 3.1 405B

新しいターミナルで、サーバーにリクエストを送信します。

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

出力例を以下に示します。

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

カスタムオートスケーラーを設定する

このセクションでは、カスタム Prometheus 指標を使用するように水平 Pod 自動スケーリングを設定します。vLLM サーバーから Google Cloud Managed Service for Prometheus の指標を使用します。

詳細については、Google Cloud Managed Service for Prometheus をご覧ください。これは GKE クラスタでデフォルトで有効になっています。

クラスタにカスタム指標の Stackdriver アダプタを設定します。

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

カスタム指標の Stackdriver アダプタが使用するサービスアカウントに Monitoring 閲覧者のロールを追加します。
```
gcloud projects add-iam-policy-binding projects/PROJECT_ID \
    --role roles/monitoring.viewer \
    --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter
```
注: GKE クラスタで使用するサービスアカウントにモニタリング指標の書き込みロールがあることを確認します。このチュートリアルでは、デフォルトの Compute Engine サービスアカウントを使用します。

次のマニフェストを vllm_pod_monitor.yaml として保存します。


apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
 name: vllm-pod-monitoring
spec:
 selector:
   matchLabels:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
 endpoints:
 - path: /metrics
   port: 8080
   interval: 15s

マニフェストをクラスタに適用します。
```
kubectl apply -f vllm_pod_monitor.yaml
```

vLLM エンドポイントに負荷を生成する

vLLM サーバーに負荷をかけて、GKE がカスタム vLLM 指標で自動スケーリングする方法を確認します。

モデルへのポート転送を設定します。

kubectl port-forward svc/vllm-leader 8080:8080

bash スクリプト（load.sh）を実行して、N 個の並列リクエストを vLLM エンドポイントに送信します。

#!/bin/bash
N=PARALLEL_PROCESSES
export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
for i in $(seq 1 $N); do
  while true; do
    curl http://$vllm_service:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.1-70B", "prompt": "Write a story about san francisco", "max_tokens": 100, "temperature": 0}'
  done &  # Run in the background
done
wait

PARALLEL_PROCESSES は、実行する並列プロセスの数に置き換えます。

bash スクリプトを実行します。
```
nohup ./load.sh &
```

Google Cloud Managed Service for Prometheus が指標を取り込むことを確認する

Google Cloud Managed Service for Prometheus が指標をスクレイピングし、vLLM エンドポイントに負荷をかけると、Cloud Monitoring で指標を表示できます。

Google Cloud コンソールで、Metrics Explorer のページに移動します。

Metrics Explorer に移動
[< > PromQL] をクリックします。
次のクエリを入力して、トラフィック指標を確認します。
```
vllm:gpu_cache_usage_perc{cluster='CLUSTER_NAME'}
```

次の画像は、ロードスクリプト実行後のグラフの例です。このグラフは、vLLM エンドポイントに追加された負荷に応じて、Google Cloud Managed Service for Prometheus がトラフィック指標を取り込んでいることを示しています。

vLLM サーバーでキャプチャされたトラフィック指標

HorizontalPodAutoscaler 構成をデプロイする

自動スケーリングする指標を決定する場合は、vLLM に次の指標を使用することをおすすめします。

num_requests_waiting: この指標は、モデルサーバーのキューで待機しているリクエストの数に関連しています。この数は、kv キャッシュがいっぱいになると著しく増加します。
gpu_cache_usage_perc: この指標は kv キャッシュの使用率に関連しており、モデルサーバーで特定の推論サイクルで処理されるリクエスト数に直接関連しています。

スループットと費用を最適化する場合、また、モデルサーバーの最大スループットでレイテンシ目標を達成できる場合は、num_requests_waiting を使用することをおすすめします。

キューベースのスケーリングでは要件を満たせない、レイテンシの影響を受けやすいワークロードがある場合は、gpu_cache_usage_perc を使用することをおすすめします。

詳細については、GPU を使用して大規模言語モデル（LLM）推論ワークロードを自動スケーリングするためのベストプラクティスをご覧ください。

HPA 構成の averageValue ターゲットを選択する場合は、自動スケーリングに使用する指標をテストで決定する必要があります。テストを最適化する方法については、ブログ投稿 GPU のコストを削減: GKE の推論ワークロード向けのスマートな自動スケーリングをご覧ください。このブログ投稿で使用した profile-generator は vLLM でも機能します。

num_requests_waiting を使用して HorizontalPodAutoscaler 構成をデプロイする手順は次のとおりです。

次のマニフェストを vllm-hpa.yaml として保存します。


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: lws-hpa
spec:
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: prometheus.googleapis.com|vllm:num_requests_waiting|gauge
      target:
        type: AverageValue
        averageValue: 5
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: vllm

Google Cloud Managed Service for Prometheus の vLLM 指標は vllm:metric_name 形式に従います。

ベストプラクティス:

スループットをスケーリングするには num_requests_waiting を使用します。レイテンシの影響を受けやすい GPU のユースケースには gpu_cache_usage_perc を使用します。

HorizontalPodAutoscaler 構成をデプロイします。
```
kubectl apply -f vllm-hpa.yaml
```
GKE は、デプロイする別の Pod をスケジュールします。これにより、ノードプールオートスケーラーがトリガーされ、2 番目の vLLM レプリカをデプロイする前に 2 番目のノードを追加します。

Pod の自動スケーリングの進行状況を確認します。

kubectl get hpa --watch

出力は次のようになります。

NAME      REFERENCE              TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
lws-hpa   LeaderWorkerSet/vllm   0/1       1         2         1          6d1h
lws-hpa   LeaderWorkerSet/vllm   1/1       1         2         1          6d1h
lws-hpa   LeaderWorkerSet/vllm   0/1       1         2         1          6d1h
lws-hpa   LeaderWorkerSet/vllm   4/1       1         2         1          6d1h
lws-hpa   LeaderWorkerSet/vllm   0/1       1         2         2          6d1h

Hyperdisk ML でモデルの読み込み時間を短縮する

このようなタイプの LLM では、vLLM がそれぞれの新しいレプリカでダウンロード、読み込み、ウォームアップを完了するまでに時間がかかる可能性があります。たとえば、Llama 3.1 405B では、このプロセスに約 90 分かかります。モデルを Hyperdisk ML ボリュームに直接ダウンロードし、そのボリュームを各 Pod にマウントすることで、この時間を短縮できます（Llama 3.1 405B の場合は 20 分）。この処理を完了するため、このチュートリアルでは Hyperdisk ML ボリュームと Kubernetes Job を使用します。Kubernetes の Job コントローラは、1 つ以上の Pod を作成し、特定のタスクが正常に実行されるようにします。

モデルの読み込み時間を短縮するには、次の操作を行います。

次のマニフェストの例を producer-pvc.yaml として保存します。

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: producer-pvc
spec:
  storageClassName: hyperdisk-ml
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 800Gi

次のマニフェストの例を producer-job.yaml として保存します。

DeepSeek-R1


apiVersion: batch/v1
kind: Job
metadata:
  name: producer-job
spec:
  template:  # Template for the Pods the Job will create
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/machine-family
                operator: In
                values:
                - "c3"
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - "ZONE"
      containers:
      - name: copy
        resources:
          requests:
            cpu: "32"
          limits:
            cpu: "32"
        image: python:3.11-alpine
        command:
        - sh
        - -c
        - "pip install 'huggingface_hub==0.24.6' && \
          huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir-use-symlinks=False --local-dir=/data/DeepSeek-R1 --include *.safetensors *.json *.py"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
          - mountPath: "/data"
            name: volume
      restartPolicy: Never
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: producer-pvc
  parallelism: 1         # Run 1 Pods concurrently
  completions: 1         # Once 1 Pods complete successfully, the Job is done
  backoffLimit: 4        # Max retries on failure

Llama 3.1 405B


apiVersion: batch/v1
kind: Job
metadata:
  name: producer-job
spec:
  template:  # Template for the Pods the Job will create
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/machine-family
                operator: In
                values:
                - "c3"
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - "ZONE"
      containers:
      - name: copy
        resources:
          requests:
            cpu: "32"
          limits:
            cpu: "32"
        image: python:3.11-alpine
        command:
        - sh
        - -c
        - "pip install 'huggingface_hub==0.24.6' && \
          huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct --local-dir-use-symlinks=False --local-dir=/data/Meta-Llama-3.1-405B-Instruct --include *.safetensors *.json"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
          - mountPath: "/data"
            name: volume
      restartPolicy: Never
      volumes:
        - name: volume
          persistentVolumeClaim:
            claimName: producer-pvc
  parallelism: 1         # Run 1 Pods concurrently
  completions: 1         # Once 1 Pods complete successfully, the Job is done
  backoffLimit: 4        # Max retries on failure

前の手順で作成した 2 つのファイルを使用して、Hyperdisk ML で AI / ML データの読み込みを高速化するの説明に従って操作します。

この手順を完了すると、Hyperdisk ML ボリュームが作成され、モデルデータが自動的に入力されます。

vLLM マルチノード GPU サーバーのデプロイメントをデプロイします。このデプロイでは、モデルデータに新しく作成された Hyperdisk ML ボリュームが使用されます。

DeepSeek-R1



apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: vllm/vllm-openai:v0.8.5
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model /models/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: deepseek-r1
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: deepseek-r1
          persistentVolumeClaim:
            claimName: hdml-static-pvc
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.8.5
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: deepseek-r1
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: deepseek-r1
          persistentVolumeClaim:
            claimName: hdml-static-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

Llama 3.1 405B



apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: vllm/vllm-openai:v0.8.5
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model /models/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: llama3-405b
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: llama3-405b
          persistentVolumeClaim:
            claimName: hdml-static-pvc
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: vllm/vllm-openai:v0.8.5
            command:
              - sh
              - -c
              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: llama3-405b
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
        - name: llama3-405b
          persistentVolumeClaim:
            claimName: hdml-static-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

クリーンアップ

このチュートリアルで使用したリソースについて、Google Cloud アカウントに課金されないようにするには、リソースを含むプロジェクトを削除するか、プロジェクトを維持して個々のリソースを削除します。

デプロイされたリソースを削除する

このガイドで作成したリソースについて Google Cloud アカウントに課金されないようにするには、次のコマンドを実行します。

ps -ef | grep load.sh | awk '{print $2}' | xargs -n1 kill -9

gcloud container clusters delete CLUSTER_NAME \
  --location=ZONE

次のステップ

GKE での GPU の詳細を確認する。
vLLM の GitHub リポジトリとドキュメントを確認する。
LWS の GitHub リポジトリを確認する。

GKE で DeepSeek-R1 671B や Llama 3.1 405B などの LLM を提供する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

概要

背景

DeepSeek-R1

Llama 3.1 405B

GKE マネージド Kubernetes サービス

GPU

LeaderWorkerSet（LWS）

vLLM とマルチホスト サービング

目標

始める前に

Check for the roles

Grant the roles

モデルへのアクセス権を取得する

DeepSeek-R1

アクセス トークンを生成する

Llama 3.1 405B

アクセス トークンを生成する

環境を準備する

GKE クラスタを作成する

Autopilot

Standard

クラスタと通信を行うように kubectl を構成します。

Hugging Face の認証情報用の Kubernetes Secret を作成する

LeaderWorkerSet をインストールする

vLLM モデルサーバーをデプロイする

DeepSeek-R1

Llama 3.1 405B

モデルをサービングする

curl を使用してモデルを操作する

DeepSeek-R1

Llama 3.1 405B

カスタム オートスケーラーを設定する

vLLM エンドポイントに負荷を生成する

Google Cloud Managed Service for Prometheus が指標を取り込むことを確認する

HorizontalPodAutoscaler 構成をデプロイする

Hyperdisk ML でモデルの読み込み時間を短縮する

DeepSeek-R1

Llama 3.1 405B

DeepSeek-R1

Llama 3.1 405B

クリーンアップ

デプロイされたリソースを削除する

次のステップ

GKE で DeepSeek-R1 671B や Llama 3.1 405B などの LLM を提供する

vLLM とマルチホストサービング

アクセストークンを生成する

アクセストークンを生成する

クラスタと通信を行うように `kubectl` を構成します。

カスタムオートスケーラーを設定する