このページは Cloud Translation API によって翻訳されました。

KubeRay を使用して GKE の TPU で LLM をサービングする

Autopilot Standard

このチュートリアルでは、Ray Operator アドオンと vLLM サービングフレームワークを使用して、Google Kubernetes Engine（GKE）で Tensor Processing Unit（TPU）を使用して大規模言語モデル（LLM）を提供する方法について説明します。

このチュートリアルでは、次のように TPU v5e または TPU Trillium（v6e）で LLM モデルを提供できます。

単一ホストの TPU v5e での Llama 3 8B instruct。
単一ホストの TPU v5e での Mistral 7B instruct v0.3。
単一ホストの TPU Trillium（v6e）で Llama 3.1 70B。

このガイドは、生成 AI をご利用のお客様、GKE の新規または既存のユーザー、ML エンジニア、MLOps（DevOps）エンジニア、プラットフォーム管理者で、Kubernetes コンテナオーケストレーション機能を活用して、vLLM をデプロイした TPU で Ray を使用してモデルを提供することに関心のある方を対象としています。

背景

このセクションでは、このガイドで使用されている重要なテクノロジーについて説明します。

GKE マネージド Kubernetes サービス

Google Cloud には、AI/ML ワークロードのデプロイと管理に適した GKE など、幅広いサービスが用意されています。GKE は、コンテナ化されたアプリケーションのデプロイ、スケーリング、管理を簡素化するマネージド Kubernetes サービスです。GKE は、LLM のコンピューティング需要を処理するために必要なインフラストラクチャ（スケーラブルなリソース、分散コンピューティング、効率的なネットワーキングなど）を提供します。

Kubernetes の主なコンセプトについて詳しくは、Kubernetes の学習を開始するをご覧ください。GKE の詳細と、GKE が Kubernetes のスケーリング、自動化、管理にどのように役立つかについては、GKE の概要をご覧ください。

Ray Operator

GKE の Ray Operator アドオンは、ML ワークロードのサービング、トレーニング、微調整を行うエンドツーエンドの AI/ML プラットフォームを提供します。このチュートリアルでは、Ray のフレームワークである Ray Serve を使用して、Hugging Face の一般的な LLM を提供します。

TPU

TPU は、Google が独自に開発した特定用途向け集積回路（ASIC）であり、TensorFlow、PyTorch、JAX などのフレームワークを使用して構築された ML モデルと AI モデルを高速化するために使用されます。

このチュートリアルでは、低レイテンシでプロンプトをサービングするための各モデルの要件に基づいて構成された TPU トポロジを使用して、TPU v5e ノードまたは TPU Trillium（v6e）ノードで LLM モデルを提供する方法について説明します。

vLLM

vLLM は、TPU のサービングスループットを向上させることができる、高度に最適化されたオープンソースの LLM サービングフレームワークであり、次のような機能を備えています。

PagedAttention による Transformer の実装の最適化
サービングスループットを全体的に向上させる連続的なバッチ処理
複数の GPU でのテンソル並列処理と分散サービング

詳細については、vLLM のドキュメントをご覧ください。

目標

このチュートリアルでは、次の手順について説明します。

TPU ノードプールを含む GKE クラスタを作成します。
単一ホストの TPU スライスを使用して RayCluster カスタムリソースをデプロイします。GKE は、RayCluster カスタムリソースを Kubernetes Pod としてデプロイします。
LLM を提供します。
モデルを操作します。

必要に応じて、Ray Serve フレームワークでサポートされている次のモデル提供リソースと手法を構成できます。

RayService カスタムリソースをデプロイします。
モデル構成で複数のモデルを作成します。

始める前に

作業を始める前に、次のタスクを完了済みであることを確認してください。

Google Kubernetes Engine API を有効にする。

Google Kubernetes Engine API の有効化

このタスクに Google Cloud CLI を使用する場合は、gcloud CLI をインストールして初期化する。すでに gcloud CLI をインストールしている場合は、gcloud components update を実行して最新のバージョンを取得する。
注: gcloud CLI がすでにインストールされている場合には、必ず compute/region プロパティを設定してください。主にゾーンクラスタを使用する場合は、代わりに compute/zone を設定します。デフォルトのロケーションを設定することで、gcloud CLI のエラー（One of [--zone, --region] must be supplied: Please specify location など）を防止できます。クラスタのロケーションが設定したデフォルトと異なる場合は、特定のコマンドでロケーションの指定が必要になることがあります。

Hugging Face アカウントを作成する（まだ作成していない場合）。
Hugging Face トークンがあることを確認します。
使用する Hugging Face モデルにアクセスできることを確認します。通常、このアクセス権は、契約に署名し、Hugging Face モデルページでモデル所有者にアクセスをリクエストすることで取得できます。
次の IAM ロールがあることを確認してください。
- roles/container.admin
- roles/iam.serviceAccountAdmin
- roles/container.clusterAdmin
- roles/artifactregistry.writer

環境を準備する

Google Cloud プロジェクトに、単一ホスト TPU v5e または単一ホスト TPU Trillium（v6e）用の十分な割り当てがあることを確認します。割り当てを管理するには、TPU の割り当てをご覧ください。
Google Cloud コンソールで、Cloud Shell インスタンスを起動します。
Cloud Shell を開く

サンプルリポジトリのクローンを作成します。

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples

作業ディレクトリに移動します。
```
cd ai-ml/gke-ray/rayserve/llm
```
GKE クラスタの作成にデフォルトの環境変数を設定します。
Llama-3-8B-Instruct
```
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc
```
次のように置き換えます。
- HUGGING_FACE_TOKEN: Hugging Face アクセストークン。
- REGION: TPU 割り当てがあるリージョン。使用する TPU バージョンがこのリージョンで使用可能であることを確認します。詳細については、GKE での TPU の提供状況をご覧ください。
- ZONE: 使用可能な TPU 割り当てがあるゾーン。
- VLLM_IMAGE: vLLM TPU イメージ。公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 イメージを使用することも、独自の TPU イメージをビルドすることもできます。
Mistral-7B
```
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
export TOKENIZER_MODE=mistral
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc
```
次のように置き換えます。
- HUGGING_FACE_TOKEN: Hugging Face アクセストークン。
- REGION: TPU 割り当てがあるリージョン。使用する TPU バージョンがこのリージョンで使用可能であることを確認します。詳細については、GKE での TPU の提供状況をご覧ください。
- ZONE: 使用可能な TPU 割り当てがあるゾーン。
- VLLM_IMAGE: vLLM TPU イメージ。公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 イメージを使用することも、独自の TPU イメージをビルドすることもできます。
Llama 3.1 70B
```
export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Llama-3.1-70B"
export MAX_MODEL_LEN=8192
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc
```
次のように置き換えます。
- HUGGING_FACE_TOKEN: Hugging Face アクセストークン。
- REGION: TPU 割り当てがあるリージョン。使用する TPU バージョンがこのリージョンで使用可能であることを確認します。詳細については、GKE での TPU の提供状況をご覧ください。
- ZONE: 使用可能な TPU 割り当てがあるゾーン。
- VLLM_IMAGE: vLLM TPU イメージ。公開 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 イメージを使用することも、独自の TPU イメージをビルドすることもできます。

vLLM コンテナイメージを pull します。

sudo usermod -aG docker ${USER}
newgrp docker
docker pull ${VLLM_IMAGE}

クラスタを作成する

Ray Operator アドオンを使用して、GKE Autopilot クラスタまたは Standard クラスタで Ray を使用して TPU で LLM を提供できます。

ベストプラクティス:

フルマネージドの Kubernetes エクスペリエンスを実現するには、Autopilot クラスタを使用します。ワークロードに最適な GKE の運用モードを選択するには、GKE の運用モードを選択するをご覧ください。

Cloud Shell を使用して、Autopilot クラスタまたは Standard クラスタを作成します。

Autopilot

Ray Operator アドオンを有効にして GKE Autopilot クラスタを作成します。

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --release-channel=rapid \
    --location=${COMPUTE_REGION}

Standard

Ray Operator アドオンを有効にして Standard クラスタを作成します。

gcloud container clusters create ${CLUSTER_NAME} \
    --release-channel=rapid \
    --location=${COMPUTE_ZONE} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --machine-type="n1-standard-4" \
    --addons=RayOperator,GcsFuseCsiDriver

単一ホストの TPU スライスノードプールを作成します。

Llama-3-8B-Instruct

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE は、ct5lp-hightpu-8t マシンタイプの TPU v5e ノードプールを作成します。

Mistral-7B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE は、ct5lp-hightpu-8t マシンタイプの TPU v5e ノードプールを作成します。

Llama 3.1 70B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct6e-standard-8t \
    --num-nodes=1

GKE は、ct6e-standard-8t マシンタイプの TPU v6e ノードプールを作成します。

クラスタと通信するように kubectl を構成する

クラスタと通信するように kubectl を構成するには、次のコマンドを実行します。

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

Standard

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

Hugging Face の認証情報用の Kubernetes Secret を作成する

Hugging Face トークンを含む Kubernetes Secret を作成するには、次のコマンドを実行します。

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

Cloud Storage バケットを作成する

vLLM デプロイの起動時間を短縮し、ノードあたりの必要なディスク容量を最小限に抑えるには、Cloud Storage FUSE CSI ドライバを使用して、ダウンロードしたモデルとコンパイルキャッシュを Ray ノードにマウントします。

Cloud Shell で、次のコマンドを実行します。

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

このコマンドにより、Hugging Face からダウンロードしたモデルファイルを格納する Cloud Storage バケットが作成されます。

バケットにアクセスする Kubernetes ServiceAccount を設定する

Kubernetes ServiceAccount を作成します。

kubectl create serviceaccount ${KSA_NAME} \
    --namespace ${NAMESPACE}

Kubernetes ServiceAccount に Cloud Storage バケットに対する読み取り / 書き込みアクセス権を付与します。
```
gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
    --role "roles/storage.objectUser"
```
GKE は、LLM 用に次のリソースを作成します。
1. ダウンロードしたモデルとコンパイルキャッシュを保存する Cloud Storage バケット。Cloud Storage FUSE CSI ドライバがバケットのコンテンツを読み取ります。
2. ファイルキャッシュが有効になっているボリュームと、Cloud Storage FUSE の並列ダウンロード機能。
ベストプラクティス:
モデルコンテンツ（重み付けファイルなど）の予想サイズに応じて、tmpfs または Hyperdisk / Persistent Disk を基盤とするファイルキャッシュを使用します。このチュートリアルでは、RAM を基盤とする Cloud Storage FUSE ファイルキャッシュを使用します。

RayCluster カスタムリソースをデプロイする

RayCluster カスタムリソースをデプロイします。通常、これは 1 つのシステム Pod と複数のワーカー Pod で構成されます。

Llama-3-8B-Instruct

Llama 3 カスタムリソースを作成して、8B 指示用にファインチューニング済みのモデルをデプロイするには、次の操作を行います。

マニフェスト ray-cluster.tpu-v5e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、2x4 トポロジに TPU v5e 単一ホストを含む workergroup を使用して RayCluster カスタムリソースを作成します。

Mistral-7B

Mistral-7B モデルをデプロイする RayCluster カスタムリソースを作成するには、次の操作を行います。

マニフェスト ray-cluster.tpu-v5e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、2x4 トポロジに TPU v5e 単一ホストを含む workergroup を使用して RayCluster カスタムリソースを作成します。

Llama 3.1 70B

Llama 3.1 70B モデルをデプロイする RayCluster カスタムリソースを作成します。手順は次のとおりです。

マニフェスト ray-cluster.tpu-v6e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、2x4 トポロジに TPU v6e 単一ホストを含む workergroup を使用して RayCluster カスタムリソースを作成します。

RayCluster カスタムリソースに接続する

RayCluster カスタムリソースが作成されたら、RayCluster リソースに接続してモデルの提供を開始できます。

GKE が RayCluster Service を作成したことを確認します。

kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
    --output wide

出力は次のようになります。

NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###

STATUS が ready になり、HEAD POD IP 列と HEAD SERVICE IP 列に IP アドレスが表示されるまで待ちます。

Ray ヘッドへの port-forwarding セッションを確立します。

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 10001:10001"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &

Ray クライアントがリモートの RayCluster カスタムリソースに接続できることを確認します。

docker run --net=host -it ${VLLM_IMAGE} \
ray list nodes --address http://localhost:8265

出力は次のようになります。

======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                           memory: #.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           node:__internal_head__: 1.0
                                                                                           object_store_memory: #.### GiB
1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                           TPU: 8.0
                                                                                           TPU-v#e-8-head: 1.0
                                                                                           accelerator_type:TPU-V#E: 1.0
                                                                                           memory: ###.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           object_store_memory: ##.### GiB
                                                                                           tpu-group-0: 1.0

vLLM を使用してモデルをデプロイする

vLLM を使用してモデルをデプロイします。

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

Ray ダッシュボードを表示する

Ray Serve デプロイとその関連ログは、Ray ダッシュボードで確認できます。

Cloud Shell タスクバーの右上にある [ ウェブでプレビュー] ボタンをクリックします。
[ポートを変更] をクリックし、ポート番号を 8265 に設定します。
[変更してプレビュー] をクリックします。
Ray ダッシュボードで、[Serve] タブをクリックします。

Serve デプロイのステータスが HEALTHY になると、モデルは入力の処理を開始できます。

モデルをサービングする

このガイドでは、プロンプトからテキストコンテンツを作成できる手法であるテキスト生成をサポートするモデルについて説明します。

Llama-3-8B-Instruct

サーバーへのポート転送を設定します。

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve エンドポイントにプロンプトを送信します。

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

出力の例を表示するには、次のセクションを開いてください。

{"prompt": "What
are the top 5 most popular programming languages? Be brief.", "text": " (Note:
This answer may change over time.)\n\nAccording to the TIOBE Index, a widely
followed measure of programming language popularity, the top 5 languages
are:\n\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nThese rankings are
based on a combination of search engine queries, web traffic, and online
courses. Keep in mind that other sources may have slightly different rankings.
(Source: TIOBE Index, August 2022)", "token_ids": [320, 9290, 25, 1115, 4320,
1253, 2349, 927, 892, 9456, 11439, 311, 279, 350, 3895, 11855, 8167, 11, 264,
13882, 8272, 6767, 315, 15840, 4221, 23354, 11, 279, 1948, 220, 20, 15823,
527, 1473, 16, 13, 13210, 198, 17, 13, 13325, 198, 18, 13, 8102, 198, 19, 13,
356, 23792, 20, 13, 356, 27585, 9673, 33407, 527, 3196, 389, 264, 10824, 315,
2778, 4817, 20126, 11, 3566, 9629, 11, 323, 2930, 14307, 13, 13969, 304, 4059,
430, 1023, 8336, 1253, 617, 10284, 2204, 33407, 13, 320, 3692, 25, 350, 3895,
11855, 8167, 11, 6287, 220, 2366, 17, 8, 128009]}

Mistral-7B

サーバーへのポート転送を設定します。

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve エンドポイントにプロンプトを送信します。

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

出力の例を表示するには、次のセクションを開いてください。

{"prompt": "What are the top 5 most popular programming languages? Be brief.",
"text": "\n\n1. JavaScript: Widely used for web development, particularly for
client-side scripting and building dynamic web page content.\n\n2. Python:
Known for its simplicity and readability, it's widely used for web
development, machine learning, data analysis, and scientific computing.\n\n3.
Java: A general-purpose programming language used in a wide range of
applications, including Android app development, web services, and
enterprise-level applications.\n\n4. C#: Developed by Microsoft, it's often
used for Windows desktop apps, game development (Unity), and web development
(ASP.NET).\n\n5. TypeScript: A superset of JavaScript that adds optional
static typing and other features for large-scale, maintainable JavaScript
applications.", "token_ids": [781, 781, 29508, 29491, 27049, 29515, 1162,
1081, 1491, 2075, 1122, 5454, 4867, 29493, 7079, 1122, 4466, 29501, 2973,
7535, 1056, 1072, 4435, 11384, 5454, 3652, 3804, 29491, 781, 781, 29518,
29491, 22134, 29515, 1292, 4444, 1122, 1639, 26001, 1072, 1988, 3205, 29493,
1146, 29510, 29481, 13343, 2075, 1122, 5454, 4867, 29493, 6367, 5936, 29493,
1946, 6411, 29493, 1072, 11237, 22031, 29491, 781, 781, 29538, 29491, 12407,
29515, 1098, 3720, 29501, 15460, 4664, 17060, 4610, 2075, 1065, 1032, 6103,
3587, 1070, 9197, 29493, 3258, 13422, 1722, 4867, 29493, 5454, 4113, 29493,
1072, 19123, 29501, 5172, 9197, 29491, 781, 781, 29549, 29491, 1102, 29539,
29515, 9355, 1054, 1254, 8670, 29493, 1146, 29510, 29481, 3376, 2075, 1122,
9723, 25470, 14189, 29493, 2807, 4867, 1093, 2501, 1240, 1325, 1072, 5454,
4867, 1093, 2877, 29521, 29491, 12466, 1377, 781, 781, 29550, 29491, 6475,
7554, 29515, 1098, 26434, 1067, 1070, 27049, 1137, 14401, 12052, 1830, 25460,
1072, 1567, 4958, 1122, 3243, 29501, 6473, 29493, 9855, 1290, 27049, 9197,
29491, 2]}

Llama 3.1 70B

サーバーへのポート転送を設定します。

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve エンドポイントにプロンプトを送信します。

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

出力の例を表示するには、次のセクションを開いてください。

{"prompt": "What are
the top 5 most popular programming languages? Be brief.", "text": " This is a
very subjective question, but there are some general guidelines to follow when
selecting a language. For example, if you\u2019re looking for a language
that\u2019s easy to learn, you might want to consider Python. It\u2019s one of
the most popular languages in the world, and it\u2019s also relatively easy to
learn. If you\u2019re looking for a language that\u2019s more powerful, you
might want to consider Java. It\u2019s a more complex language, but it\u2019s
also very popular. Whichever language you choose, make sure you do your
research and pick one that\u2019s right for you.\nThe most popular programming
languages are:\nWhy is C++ so popular?\nC++ is a powerful and versatile
language that is used in many different types of software. It is also one of
the most popular programming languages, with a large community of developers
who are always creating new and innovative ways to use it. One of the reasons
why C++ is so popular is because it is a very efficient language. It allows
developers to write code that is both fast and reliable, which is essential
for many types of software. Additionally, C++ is very flexible, meaning that
it can be used for a wide range of different purposes. Finally, C++ is also
very popular because it is easy to learn. There are many resources available
online and in books that can help anyone get started with learning the
language.\nJava is a versatile language that can be used for a variety of
purposes. It is one of the most popular programming languages in the world and
is used by millions of people around the globe. Java is used for everything
from developing desktop applications to creating mobile apps and games. It is
also a popular choice for web development. One of the reasons why Java is so
popular is because it is a platform-independent language. This means that it
can be used on any type of computer or device, regardless of the operating
system. Java is also very versatile and can be used for a variety of different
purposes.", "token_ids": [1115, 374, 264, 1633, 44122, 3488, 11, 719, 1070,
527, 1063, 4689, 17959, 311, 1833, 994, 27397, 264, 4221, 13, 1789, 3187, 11,
422, 499, 3207, 3411, 369, 264, 4221, 430, 753, 4228, 311, 4048, 11, 499,
2643, 1390, 311, 2980, 13325, 13, 1102, 753, 832, 315, 279, 1455, 5526, 15823,
304, 279, 1917, 11, 323, 433, 753, 1101, 12309, 4228, 311, 4048, 13, 1442,
499, 3207, 3411, 369, 264, 4221, 430, 753, 810, 8147, 11, 499, 2643, 1390,
311, 2980, 8102, 13, 1102, 753, 264, 810, 6485, 4221, 11, 719, 433, 753, 1101,
1633, 5526, 13, 1254, 46669, 4221, 499, 5268, 11, 1304, 2771, 499, 656, 701,
3495, 323, 3820, 832, 430, 753, 1314, 369, 499, 627, 791, 1455, 5526, 15840,
15823, 527, 512, 10445, 374, 356, 1044, 779, 5526, 5380, 34, 1044, 374, 264,
8147, 323, 33045, 4221, 430, 374, 1511, 304, 1690, 2204, 4595, 315, 3241, 13,
1102, 374, 1101, 832, 315, 279, 1455, 5526, 15840, 15823, 11, 449, 264, 3544,
4029, 315, 13707, 889, 527, 2744, 6968, 502, 323, 18699, 5627, 311, 1005, 433,
13, 3861, 315, 279, 8125, 3249, 356, 1044, 374, 779, 5526, 374, 1606, 433,
374, 264, 1633, 11297, 4221, 13, 1102, 6276, 13707, 311, 3350, 2082, 430, 374,
2225, 5043, 323, 15062, 11, 902, 374, 7718, 369, 1690, 4595, 315, 3241, 13,
23212, 11, 356, 1044, 374, 1633, 19303, 11, 7438, 430, 433, 649, 387, 1511,
369, 264, 7029, 2134, 315, 2204, 10096, 13, 17830, 11, 356, 1044, 374, 1101,
1633, 5526, 1606, 433, 374, 4228, 311, 4048, 13, 2684, 527, 1690, 5070, 2561,
2930, 323, 304, 6603, 430, 649, 1520, 5606, 636, 3940, 449, 6975, 279, 4221,
627, 15391, 3S74, 264, 33045, 4221, 430, 649, 387, 1511, 369, 264, 8205, 315,
10096, 13, 1102, 374, 832, 315, 279, 1455, 5526, 15840, 15823, 304, 279, 1917,
323, 374, 1511, 555, 11990, 315, 1274, 2212, 279, 24867, 13, 8102, 374, 1511,
369, 4395, 505, 11469, 17963, 8522, 311, 6968, 6505, 10721, 323, 3953, 13,
1102, 374, 1101, 264, 5526, 5873, 369, 3566, 4500, 13, 3861, 315, 279, 8125,
3249, 8102, 374, 779, 5526, 374, 1606, 433, 374, 264, 5452, 98885, 4221, 13,
1115, 3445, 430, 433, 649, 387, 1511, 389, 904, 955, 315, 6500, 477, 3756, 11,
15851, 315, 279, 10565, 1887, 13, 8102, 374, 1101, 1633, 33045, 323, 649, 387,
1511, 369, 264, 8205, 315, 2204, 10096, 13, 128001]}

追加構成

必要に応じて、Ray Serve フレームワークでサポートされている次のモデル提供リソースと手法を構成できます。

RayService カスタムリソースをデプロイします。このチュートリアルの前の手順では、RayService ではなく RayCluster を使用します。本番環境には RayService をおすすめします。
モデル構成で複数のモデルを作成します。Ray Serve フレームワークでサポートされているモデルの多重化とモデルの構成を構成します。モデル構成を使用すると、複数の LLM の入力と出力を連結し、モデルを単一のアプリケーションとしてスケーリングできます。
独自の TPU イメージをビルドしてデプロイします。Docker イメージの内容をきめ細かく制御する必要がある場合は、このオプションをおすすめします。

RayService をデプロイする

このチュートリアルのモデルは、RayService カスタムリソースを使用してデプロイできます。

このチュートリアルで作成した RayCluster カスタムリソースを削除します。
```
kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
```

RayService カスタムリソースを作成してモデルをデプロイします。

Llama-3-8B-Instruct

マニフェスト ray-service.tpu-v5e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、2x4 トポロジに TPU v5e 単一ホストを含む workergroup を使用して RayService を作成します。

Mistral-7B

マニフェスト ray-service.tpu-v5e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、2x4 トポロジに TPU v5e 単一ホストを含む workergroup を使用して RayService を作成します。

Llama 3.1 70B

マニフェスト ray-service.tpu-v6e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。
```
envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst コマンドは、マニフェスト内の環境変数を置き換えます。

GKE は、Ray Serve アプリケーションがデプロイされ、後続の RayService カスタムリソースが作成される RayCluster カスタムリソースを作成します。

RayService リソースのステータスを確認します。

kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu

Service のステータスが Running に変わるまで待ちます。

NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          1

RayCluster ヘッドサービスの名前を取得します。
```
SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
    --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
```
注: RayCluster ヘッドサービス値が取得されない場合は、kubectl get services --namespace ${NAMESPACE} コマンドを実行して SERVICE_NAME 値を手動で更新します。

Ray ヘッドへの port-forwarding セッションを確立して、Ray ダッシュボードを表示します。

pkill -f "kubectl .* port-forward .* 8265:8265"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &

Ray ダッシュボードを表示します。
モデルを提供します。

RayService リソースをクリーンアップします。

kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu

モデル構成で複数のモデルを作成する

モデルの構成は、複数のモデルを 1 つのアプリケーションに作成するための手法です。

このセクションでは、GKE クラスタを使用して、Llama 3 8B IT モデルと Gemma 7B IT モデルを作成して、1 つのアプリケーションにします。

最初のモデルは、プロンプトで提供された質問に回答するアシスタントモデルです。
2 番目のモデルは、要約モデルです。アシスタントモデルの出力は、要約モデルの入力に連結されます。最終的な結果は、アシスタントモデルからのレスポンスの要約版です。

Gemma モデルにアクセスする手順は次のとおりです。
1. Kaggle プラットフォームにログインし、ライセンス同意契約に署名して、Kaggle API トークンを取得します。このチュートリアルでは、Kaggle 認証情報に Kubernetes Secret を使用します。
2. Kaggle.com のモデルの同意ページにアクセスします。
3. Kaggle にログインしていない場合はログインします。
4. [アクセス権限をリクエスト] をクリックします。
5. [同意に使用するアカウントを選択] セクションで、[Kaggle アカウントを使用して確認] を選択して、同意に Kaggle アカウントを使用します。
6. モデルの利用規約に同意します。

環境を設定します。

export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
export SUMMARIZER_MODEL_ID=google/gemma-7b-it

Standard クラスタの場合は、単一ホストの追加 TPU スライスノードプールを作成します。
```
gcloud container node-pools create tpu-2 \
  --location=${COMPUTE_ZONE} \
  --cluster=${CLUSTER_NAME} \
  --machine-type=MACHINE_TYPE \
  --num-nodes=1
```
MACHINE_TYPE は、次のいずれかのマシンタイプに置き換えます。
- ct5lp-hightpu-8t: TPU v5e をプロビジョニングします。
- ct6e-standard-8t: TPU v6e をプロビジョニングします。
Autopilot クラスタは、必要なノードを自動的にプロビジョニングします。

使用する TPU のバージョンに基づいて RayService リソースをデプロイします。

TPU v5e

マニフェスト ray-service.tpu-v5e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。

envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

TPU v6e

マニフェスト ray-service.tpu-v6e-singlehost.yaml を調べます。

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

次のようにマニフェストを適用します。

envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

RayService リソースのステータスが Running に変わるまで待ちます。
```
kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
```
出力は次のようになります。
```
NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          2
```
この出力では、RUNNING ステータスは RayService リソースの準備ができていることを示します。

GKE が Ray Serve アプリケーションの Service を作成したことを確認します。

kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc

出力は次のようになります。

NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###

Ray ヘッドへの port-forwarding セッションを確立します。

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &

モデルにリクエストを送信します。

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

出力は次のようになります。

  {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}

TPU イメージをビルドしてデプロイする

このチュートリアルでは、vLLM のホストされている TPU イメージを使用します。vLLM は、TPU の依存関係を含む必要な PyTorch XLA イメージ上に vLLM をビルドする Dockerfile.tpu イメージを提供します。ただし、独自の TPU イメージをビルドしてデプロイし、Docker イメージの内容をきめ細かく制御することもできます。

このガイドのコンテナイメージを保存する Docker リポジトリを作成します。

gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev

vLLM リポジトリのクローンを作成します。

git clone https://github.com/vllm-project/vllm.git
cd vllm

イメージをビルドします。

docker build -f ./docker/Dockerfile.tpu . -t vllm-tpu

TPU イメージに Artifact Registry 名をタグ付けします。
```
export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
docker tag vllm-tpu ${VLLM_IMAGE}
```
TAG は、定義するタグの名前に置き換えます。タグを指定しない場合、Docker はデフォルトの最新タグを適用します。
イメージを Artifact Registry に push します。
```
docker push ${VLLM_IMAGE}
```

個々のリソースの削除

使用している既存のプロジェクトを削除しない場合は、リソースを個別に削除できます。

RayCluster カスタムリソースを削除します。

kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu

Cloud Storage バケットを削除します。
```
gcloud storage rm -r gs://${GSBUCKET}
```

Artifact Registry リポジトリを削除します。

gcloud artifacts repositories delete vllm-tpu \
    --location=${COMPUTE_REGION}

クラスタを削除します。
```
gcloud container clusters delete ${CLUSTER_NAME} \
    --location=LOCATION
```
LOCATION は、次のいずれかの環境変数に置き換えます。
- Autopilot クラスタの場合は、COMPUTE_REGION を使用します。
- Standard クラスタの場合は、COMPUTE_ZONE を使用します。

プロジェクトを削除する

チュートリアルを新しい Google Cloud プロジェクトにデプロイした後、そのプロジェクトが不要になった場合は、次の手順で削除します。

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

複数のアーキテクチャ、チュートリアル、クイックスタートを実施する予定がある場合は、プロジェクトを再利用すると、プロジェクトの割り当て上限の超過を回避できます。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

次のステップ

GKE プラットフォームのオーケストレーション機能を使用して、最適化された AI / ML ワークロードを実行する方法を確認する。
GKE で Ray Serve を使用する方法については、GitHub のサンプルコードをご覧ください。
Ray on GKE クラスタのログと指標を収集して表示するの手順に沿って、GKE で実行されている Ray クラスタの指標を収集して表示する方法を確認する。

KubeRay を使用して GKE の TPU で LLM をサービングする コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

背景

GKE マネージド Kubernetes サービス

Ray Operator

TPU

vLLM

目標

始める前に

環境を準備する

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

クラスタを作成する

Autopilot

Standard

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

クラスタと通信するように kubectl を構成する

Autopilot

Standard

Hugging Face の認証情報用の Kubernetes Secret を作成する

Cloud Storage バケットを作成する

バケットにアクセスする Kubernetes ServiceAccount を設定する

RayCluster カスタム リソースをデプロイする

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

RayCluster カスタム リソースに接続する

vLLM を使用してモデルをデプロイする

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

Ray ダッシュボードを表示する

モデルをサービングする

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

追加構成

RayService をデプロイする

Llama-3-8B-Instruct

Mistral-7B

Llama 3.1 70B

モデル構成で複数のモデルを作成する

TPU v5e

TPU v6e

TPU イメージをビルドしてデプロイする

個々のリソースの削除

プロジェクトを削除する

次のステップ

KubeRay を使用して GKE の TPU で LLM をサービングする

RayCluster カスタムリソースをデプロイする

RayCluster カスタムリソースに接続する