このページは Cloud Translation API によって翻訳されました。

ベアメタルで DeepSeek-R1 671B や Llama 3.1 405B などの LLM のサービスを提供する

概要

このガイドでは、複数のノードで画像処理装置（GPU）を使用して、ベアメタル上の Google Distributed Cloud（ソフトウェアのみ）で DeepSeek-R1 671B や Llama 3.1 405B などの最先端の大規模言語モデル（LLM）のサービスを提供する方法について説明します。

このガイドでは、移植可能なオープンソーステクノロジー（Kubernetes、vLLM、LeaderWorkerSet（LWS）API）を使用して、ベアメタルクラスタに AI / ML ワークロードをデプロイしてサービスを提供する方法について説明します。Google Distributed Cloud は、GKE を拡張してオンプレミス環境での利用を実現し、GKE のきめ細かい制御、スケーラビリティ、復元力、移植性、費用対効果のメリットを提供します。

背景

このセクションでは、このガイドで使用されている重要なテクノロジーについて説明します。このガイドの例として使用されている 2 つの LLM（DeepSeek-R1 と Llama 3.1 405B）も含まれます。

DeepSeek-R1

DeepSeek-R1 は、DeepSeek が開発した 6,710 億個のパラメータを備えた大規模言語モデルで、さまざまなテキストベースのタスクにおける論理的推論、数学的推論、リアルタイムの問題解決を目的として設計されています。Google Distributed Cloud は、DeepSeek-R1 のコンピューティング需要を処理し、スケーラブルなリソース、分散コンピューティング、効率的なネットワーキングによってその機能をサポートしています。

詳細については、DeepSeek のドキュメントをご覧ください。

Llama 3.1 405B

Llama 3.1 405B は、テキスト生成、翻訳、質問応答など、さまざまな自然言語処理タスク用に設計された Meta の大規模言語モデルです。Google Distributed Cloud は、この規模のモデルの分散トレーニングとサービス提供の実現に必要な堅牢なインフラストラクチャを提供します。

詳細については、Llama のドキュメントをご覧ください。

Google Distributed Cloud マネージド Kubernetes サービス

Google Distributed Cloud には、独自のデータセンターでの AI / ML ワークロードのデプロイと管理に適したベアメタル用 Google Distributed Cloud（ソフトウェアのみ）をはじめとする幅広いサービスが用意されています。Google Distributed Cloud は、コンテナ化されたアプリケーションのデプロイ、スケーリング、管理を簡素化するマネージド Kubernetes Service です。Google Distributed Cloud は、LLM のコンピューティング需要を処理するために必要なインフラストラクチャ（スケーラブルなリソース、分散コンピューティング、効率的なネットワーキングなど）を提供します。

Kubernetes の主なコンセプトについて詳しくは、Kubernetes の学習を開始するをご覧ください。Google Distributed Cloud の詳細と、それが Kubernetes のスケーリング、自動化、管理にどのように役立つかについては、ベアメタル向け Google Distributed Cloud（ソフトウェアのみ）の概要をご覧ください。

GPU

画像処理装置（GPU）を使用すると、ML やデータ処理などの特定のワークロードを高速化できます。Google Distributed Cloud は、これらの強力な GPU を搭載したノードをサポートしているため、ML タスクとデータ処理タスクで最適なパフォーマンスが得られるようにクラスタを構成できます。Google Distributed Cloud には、NVIDIA H100、L4、A100 GPU を搭載したマシンタイプをはじめとして、ノード構成用にさまざまなマシンタイプオプションが用意されています。

詳細については、NVIDIA GPU を設定して使用するをご覧ください。

LeaderWorkerSet（LWS）

LeaderWorkerSet（LWS）は、AI / ML マルチノード推論ワークロードの一般的なデプロイパターンに対応する Kubernetes deployment API です。マルチノードサービスの提供は、分散推論ワークロードを処理するために、それぞれが異なるノードで実行される可能性のある複数の Pod を活用します。LWS を使用すると、複数の Pod をグループとして扱うことができるため、分散モデル提供の管理が簡素化されます。

vLLM とマルチホストサービング

コンピューティング負荷の高い LLM を提供する場合は、vLLM を使用して、GPU 間でワークロードを実行することをおすすめします。

vLLM は、GPU のサービングスループットを向上できる、高度に最適化されたオープンソースの LLM サービングフレームワークであり、次のような機能を備えています。

PagedAttention による Transformer の実装の最適化
サービングスループットを全体的に向上させる連続的なバッチ処理
複数の GPU での分散サービス提供

1 つの GPU ノードに収まらない特に計算負荷の高い LLM では、複数の GPU ノードを使用してモデルのサービスを提供できます。vLLM は、次の 2 つの方法による複数の GPU 間でのワークロードの実行をサポートしています。

テンソル並列処理では、Transformer レイヤの行列乗算を複数の GPU に分割します。ただし、この方法では GPU 間の通信が必要になるため、高速なネットワークが必要であり、ノード間でワークロードを実行する場合は適していません。
パイプライン並列処理では、モデルをレイヤ（垂直方向）で分割します。この方法では、GPU 間の通信を行う必要がないため、ノードをまたいでモデルを実行する場合に適しています。

マルチノードサービングでは、どちらの戦略も使用できます。たとえば、それぞれ 8 個の H100 GPU が割り当てられた 2 つのノードを使用する場合、次の方法はどちらでも使用できます。

2 つのノード間でモデルをシャーディングする 2 方向パイプライン並列処理
各ノードの 8 個の GPU 間でモデルをシャーディングする 8 方向テンソル並列処理

詳細については、vLLM のドキュメントをご覧ください。

Hugging Face の認証情報用の Kubernetes Secret を作成する

次のコマンドを使用して、Hugging Face トークンを含む Kubernetes Secret を作成します。

kubectl create secret generic hf-secret \
    --kubeconfig KUBECONFIG \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

KUBECONFIG は、LLM をホストするクラスタの kubeconfig ファイルのパスに置き換えます。

独自の vLLM マルチノードイメージを作成する

vLLM のクロスノード通信を容易にするには、Ray を使用します。LeaderWorkerSet リポジトリには、vLLM で Ray を構成するための bash スクリプトを含む Dockerfile が用意されています。

独自の vLLM マルチノードイメージを作成するには、LeaderWorkerSet リポジトリのクローンを作成し、提供された Dockerfile（クロスノード通信用に Ray を構成）を使用して Docker イメージをビルドし、そのイメージを Artifact Registry に push して Google Distributed Cloud にデプロイする必要があります。

コンテナをビルドする

コンテナをビルドする手順は次のとおりです。

LeaderWorkerSet リポジトリのクローンを作成します。
```
git clone https://github.com/kubernetes-sigs/lws.git
```

イメージを構築します。

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile.GPU . -t vllm-multihost

イメージを Artifact Registry に push する

Kubernetes Deployment でイメージにアクセスできるようにするには、 Google Cloud プロジェクト内の Artifact Registry にイメージを保存します。

docker image tag vllm-multihost ${IMAGE_NAME}
docker push ${IMAGE_NAME}

LeaderWorkerSet をインストールする

LWS をインストールするには、次のコマンドを実行します。

kubectl apply --server-side \
    --kubeconfig KUBECONFIG \
    -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

次のコマンドを使用して、LeaderWorkerSet コントローラが lws-system Namespace で実行されていることを確認します。

kubectl get pod -n lws-system --kubeconfig KUBECONFIG

出力は次のようになります。

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

vLLM モデルサーバーをデプロイする

vLLM モデルサーバーをデプロイする手順は次のとおりです。

デプロイする LLM に応じて、マニフェストを作成して適用します。

DeepSeek-R1

vLLM モデルサーバーの YAML マニフェスト vllm-deepseek-r1-A3.yaml を作成します。

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

次のコマンドを実行してマニフェストを適用します。

kubectl apply -f vllm-deepseek-r1-A3.yaml \
    --kubeconfig KUBECONFIG

Llama 3.1 405B

vLLM モデルサーバーの YAML マニフェスト vllm-llama3-405b-A3.yaml を作成します。

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

次のコマンドを実行してマニフェストを適用します。

kubectl apply -f vllm-llama3-405b-A3.yaml \
    --kubeconfig KUBECONFIG

次のコマンドを使用して、実行中のモデルサーバーのログを表示します。

kubectl logs vllm-0 -c vllm-leader \
    --kubeconfig KUBECONFIG

出力は次のようになります。

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

モデルをサービングする

次のコマンドを実行して、モデルへのポート転送を設定します。

kubectl port-forward svc/vllm-leader 8080:8080 \
    --kubeconfig KUBECONFIG

curl を使用してモデルを操作する

curl を使用してモデルを操作する手順は次のとおりです。

DeepSeek-R1

新しいターミナルで、サーバーにリクエストを送信します。

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

出力例を以下に示します。

{
  "id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
  "object": "text_completion",
  "created": 1738269669,
  "model": "deepseek-ai/DeepSeek-R1",
  "choices": [
    {
      "index": 0,
      "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 76,
    "total_tokens": 544,
    "completion_tokens": 468,
    "prompt_tokens_details": null
  }
}

Llama 3.1 405B

新しいターミナルで、サーバーにリクエストを送信します。

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

出力例を以下に示します。

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}

ベアメタルで DeepSeek-R1 671B や Llama 3.1 405B などの LLM のサービスを提供する

概要

背景

DeepSeek-R1

Llama 3.1 405B

Google Distributed Cloud マネージド Kubernetes サービス

GPU

LeaderWorkerSet（LWS）

vLLM とマルチホスト サービング

Hugging Face の認証情報用の Kubernetes Secret を作成する

独自の vLLM マルチノード イメージを作成する

コンテナをビルドする

イメージを Artifact Registry に push する

LeaderWorkerSet をインストールする

vLLM モデルサーバーをデプロイする

DeepSeek-R1

Llama 3.1 405B

モデルをサービングする

curl を使用してモデルを操作する

DeepSeek-R1

Llama 3.1 405B

vLLM とマルチホストサービング

独自の vLLM マルチノードイメージを作成する