Menyajikan LLM menggunakan TPU di GKE dengan KubeRay


Tutorial ini menunjukkan cara menayangkan model bahasa besar (LLM) menggunakan Tensor Processing Unit (TPU) di Google Kubernetes Engine (GKE) dengan add-on Ray Operator, dan framework penayangan vLLM.

Dalam tutorial ini, Anda dapat menayangkan model LLM di TPU v5e atau TPU Trillium (v6e) sebagai berikut:

Panduan ini ditujukan untuk pelanggan AI generatif, pengguna GKE baru dan lama, engineer ML, engineer MLOps (DevOps), atau administrator platform yang tertarik menggunakan kemampuan orkestrasi penampung Kubernetes untuk menayangkan model menggunakan Ray, di TPU dengan vLLM.

Latar belakang

Bagian ini menjelaskan teknologi utama yang digunakan dalam panduan ini.

Layanan Kubernetes terkelola GKE

Google Cloud menawarkan berbagai layanan, termasuk GKE, yang sangat cocok untuk men-deploy dan mengelola workload AI/ML. GKE adalah layanan Kubernetes terkelola yang menyederhanakan deployment, penskalaan, dan pengelolaan aplikasi dalam container. GKE menyediakan infrastruktur yang diperlukan, termasuk resource yang skalabel, komputasi terdistribusi, dan jaringan yang efisien, untuk menangani permintaan komputasi LLM.

Untuk mempelajari konsep utama Kubernetes lebih lanjut, lihat Mulai mempelajari Kubernetes. Untuk mempelajari GKE lebih lanjut dan cara GKE membantu Anda menskalakan, mengotomatiskan, dan mengelola Kubernetes, lihat ringkasan GKE.

Operator sinar

Add-on Ray Operator di GKE menyediakan platform AI/ML menyeluruh untuk menayangkan, melatih, dan menyesuaikan beban kerja machine learning. Dalam tutorial ini, Anda akan menggunakan Ray Serve, framework di Ray, untuk menayangkan LLM populer dari Hugging Face.

TPU

TPU adalah sirkuit terintegrasi khusus aplikasi (ASIC) Google yang dikembangkan secara khusus dan digunakan untuk mempercepat machine learning dan model AI yang dibuat menggunakan framework seperti TensorFlow, PyTorch, dan JAX.

Tutorial ini membahas penayangan model LLM di node TPU v5e atau TPU Trillium (v6e) dengan topologi TPU yang dikonfigurasi berdasarkan setiap persyaratan model untuk menayangkan perintah dengan latensi rendah.

vLLM

vLLM adalah framework penayangan LLM open source yang dioptimalkan secara maksimal dan dapat meningkatkan throughput penayangan di TPU, dengan fitur seperti:

  • Penerapan transformer yang dioptimalkan dengan PagedAttention
  • Pengelompokan berkelanjutan untuk meningkatkan throughput penayangan secara keseluruhan
  • Paralelisme tensor dan penayangan terdistribusi di beberapa GPU

Untuk mempelajari lebih lanjut, lihat dokumentasi vLLM.

Tujuan

Tutorial ini membahas langkah-langkah berikut:

  1. Buat cluster GKE dengan node pool TPU.
  2. Deploy resource kustom RayCluster dengan slice TPU host tunggal. GKE men-deploy resource kustom RayCluster sebagai Pod Kubernetes.
  3. Menyajikan LLM.
  4. Berinteraksi dengan model.

Secara opsional, Anda dapat mengonfigurasi resource dan teknik penyaluran model berikut yang didukung framework Ray Serve:

  • Men-deploy resource kustom RayService.
  • Buat beberapa model dengan komposisi model.

Sebelum memulai

Sebelum memulai, pastikan Anda telah menjalankan tugas berikut:

  • Aktifkan Google Kubernetes Engine API.
  • Aktifkan Google Kubernetes Engine API
  • Jika ingin menggunakan Google Cloud CLI untuk tugas ini, instal lalu lakukan inisialisasi gcloud CLI. Jika sebelumnya Anda telah menginstal gcloud CLI, dapatkan versi terbaru dengan menjalankan gcloud components update.
  • Buat akun Hugging Face, jika Anda belum memilikinya.
  • Pastikan Anda memiliki token Hugging Face.
  • Pastikan Anda memiliki akses ke model Hugging Face yang ingin digunakan. Anda biasanya mendapatkan akses ini dengan menandatangani perjanjian dan meminta akses dari pemilik model di halaman model Hugging Face.

Menyiapkan lingkungan Anda

  1. Pastikan Anda memiliki cukup kuota di project Google Cloud untuk TPU v5e host tunggal atau TPU Trillium host tunggal (v6e). Untuk mengelola kuota, lihat Kuota TPU.

  2. Di konsol Google Cloud, mulai instance Cloud Shell:
    Buka Cloud Shell

  3. Gandakan repositori sampel

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
    cd kubernetes-engine-samples
    
  4. Buka direktori kerja:

    cd ai-ml/gke-ray/rayserve/llm
    
  5. Tetapkan variabel lingkungan default untuk pembuatan cluster GKE:

    Llama-3-8B-Instruct

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    Ganti kode berikut:

    • HUGGING_FACE_TOKEN: token akses Hugging Face Anda.
    • REGION: region tempat Anda memiliki kuota TPU. Pastikan versi TPU yang ingin Anda gunakan tersedia di region ini. Untuk mempelajari lebih lanjut, lihat Ketersediaan TPU di GKE.
    • ZONE: zona dengan kuota TPU yang tersedia.
    • VLLM_IMAGE: image TPU vLLM. Anda dapat menggunakan image docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 publik atau mem-build image TPU Anda sendiri.

    Mistral-7B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
    export TOKENIZER_MODE=mistral
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    Ganti kode berikut:

    • HUGGING_FACE_TOKEN: token akses Hugging Face Anda.
    • REGION: region tempat Anda memiliki kuota TPU. Pastikan versi TPU yang ingin Anda gunakan tersedia di region ini. Untuk mempelajari lebih lanjut, lihat Ketersediaan TPU di GKE.
    • ZONE: zona dengan kuota TPU yang tersedia.
    • VLLM_IMAGE: image TPU vLLM. Anda dapat menggunakan image docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 publik atau mem-build image TPU Anda sendiri.

    Llava-1.5-13b-hf

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="llava-hf/llava-1.5-13b-hf"
    export DTYPE=bfloat16
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    Ganti kode berikut:

    • HUGGING_FACE_TOKEN: token akses Hugging Face Anda.
    • REGION: region tempat Anda memiliki kuota TPU. Pastikan versi TPU yang ingin Anda gunakan tersedia di region ini. Untuk mempelajari lebih lanjut, lihat Ketersediaan TPU di GKE.
    • ZONE: zona dengan kuota TPU yang tersedia.
    • VLLM_IMAGE: image TPU vLLM. Anda dapat menggunakan image docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 publik atau mem-build image TPU Anda sendiri.

    Llama 3.1 70B

    export PROJECT_ID=$(gcloud config get project)
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
    export CLUSTER_NAME=vllm-tpu
    export COMPUTE_REGION=REGION
    export COMPUTE_ZONE=ZONE
    export HF_TOKEN=HUGGING_FACE_TOKEN
    export GSBUCKET=vllm-tpu-bucket
    export KSA_NAME=vllm-sa
    export NAMESPACE=default
    export MODEL_ID="meta-llama/Llama-3.1-70B"
    export MAX_MODEL_LEN=8192
    export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
    export SERVICE_NAME=vllm-tpu-head-svc
    

    Ganti kode berikut:

    • HUGGING_FACE_TOKEN: token akses Hugging Face Anda.
    • REGION: region tempat Anda memiliki kuota TPU. Pastikan versi TPU yang ingin Anda gunakan tersedia di region ini. Untuk mempelajari lebih lanjut, lihat Ketersediaan TPU di GKE.
    • ZONE: zona dengan kuota TPU yang tersedia.
    • VLLM_IMAGE: image TPU vLLM. Anda dapat menggunakan image docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 publik atau mem-build image TPU Anda sendiri.
  6. Download image container vLLM:

    docker pull ${VLLM_IMAGE}
    

Membuat cluster

Anda dapat menayangkan LLM di TPU dengan Ray di cluster GKE Autopilot atau Standard menggunakan add-on Ray Operator.

Praktik terbaik:

Gunakan cluster Autopilot untuk pengalaman Kubernetes yang dikelola sepenuhnya. Untuk memilih mode operasi GKE yang paling sesuai untuk workload Anda, lihat Memilih mode operasi GKE.

Gunakan Cloud Shell untuk membuat cluster Autopilot atau Standard:

Autopilot

  1. Buat cluster GKE Autopilot dengan add-on Ray Operator yang diaktifkan:

    gcloud container clusters create-auto ${CLUSTER_NAME}  \
        --enable-ray-operator \
        --release-channel=rapid \
        --location=${COMPUTE_REGION}
    

Standar

  1. Buat cluster Standard dengan add-on Ray Operator yang diaktifkan:

    gcloud container clusters create ${CLUSTER_NAME} \
        --release-channel=rapid \
        --location=${COMPUTE_ZONE} \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --machine-type="n1-standard-4" \
        --addons=RayOperator,GcsFuseCsiDriver
    
  2. Buat node pool slice TPU host tunggal:

    Llama-3-8B-Instruct

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE membuat node pool TPU v5e dengan jenis mesin ct5lp-hightpu-8t.

    Mistral-7B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE membuat node pool TPU v5e dengan jenis mesin ct5lp-hightpu-8t.

    Llava-1.5-13b-hf

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct5lp-hightpu-8t \
        --num-nodes=1
    

    GKE membuat node pool TPU v5e dengan jenis mesin ct5lp-hightpu-8t.

    Llama 3.1 70B

    gcloud container node-pools create tpu-1 \
        --location=${COMPUTE_ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct6e-standard-8t \
        --num-nodes=1
    

    GKE membuat node pool TPU v6e dengan jenis mesin ct6e-standard-8t.

Mengonfigurasi kubectl untuk berkomunikasi dengan cluster Anda

Untuk mengonfigurasi kubectl agar dapat berkomunikasi dengan cluster Anda, jalankan perintah berikut:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

Standar

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

Membuat Secret Kubernetes untuk kredensial Hugging Face

Untuk membuat Secret Kubernetes yang berisi token Hugging Face, jalankan perintah berikut:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

Membuat bucket Cloud Storage

Untuk mempercepat waktu startup deployment vLLM dan meminimalkan ruang disk yang diperlukan per node, gunakan driver CSI Cloud Storage FUSE untuk memasang model yang didownload dan cache kompilasi ke node Ray.

Jalankan perintah berikut di Cloud Shell:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

Perintah ini akan membuat bucket Cloud Storage untuk menyimpan file model yang Anda download dari Hugging Face.

Menyiapkan Akun Layanan Kubernetes untuk mengakses bucket

  1. Buat Akun Layanan Kubernetes:

    kubectl create serviceaccount ${KSA_NAME} \
        --namespace ${NAMESPACE}
    
  2. Berikan akses baca-tulis Akun Layanan Kubernetes ke bucket Cloud Storage:

    gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
        --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
        --role "roles/storage.objectUser"
    

    GKE membuat resource berikut untuk LLM:

    1. Bucket Cloud Storage untuk menyimpan model yang didownload dan cache kompilasi. Driver CSI Cloud Storage FUSE membaca konten bucket.
    2. Volume dengan caching file diaktifkan dan fitur download paralel Cloud Storage FUSE.
    Praktik terbaik:

    Gunakan cache file yang didukung oleh tmpfs atau Hyperdisk / Persistent Disk bergantung pada ukuran konten model yang diharapkan, misalnya, file bobot. Dalam tutorial ini, Anda akan menggunakan cache file Cloud Storage FUSE yang didukung oleh RAM.

Men-deploy resource kustom RayCluster

Deploy resource kustom RayCluster, yang biasanya terdiri dari satu Pod sistem dan beberapa Pod pekerja.

Llama-3-8B-Instruct

Buat resource kustom RayCluster untuk men-deploy model yang disesuaikan dengan petunjuk Llama 3 8B dengan menyelesaikan langkah-langkah berikut:

  1. Periksa manifes ray-cluster.tpu-v5e-singlehost.yaml:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. Terapkan manifes:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

GKE membuat resource kustom RayCluster dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

Mistral-7B

Buat resource kustom RayCluster untuk men-deploy model Mistral-7B dengan menyelesaikan langkah-langkah berikut:

  1. Periksa manifes ray-cluster.tpu-v5e-singlehost.yaml:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. Terapkan manifes:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

GKE membuat resource kustom RayCluster dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

Llava-1.5-13b-hf

Buat resource kustom RayCluster untuk men-deploy model Llava-1.5-13b-hf dengan menyelesaikan langkah-langkah berikut:

  1. Periksa manifes ray-cluster.tpu-v5e-singlehost.yaml:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 80G
                    memory: 200G
                env:
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
  2. Terapkan manifes:

    envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

GKE membuat resource kustom RayCluster dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

Llama 3.1 70B

Buat resource kustom RayCluster untuk men-deploy model Llama 3.1 70B dengan menyelesaikan langkah-langkah berikut:

  1. Periksa manifes ray-cluster.tpu-v6e-singlehost.yaml:

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: vllm-tpu
    spec:
      headGroupSpec:
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-head
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "2"
                    memory: 8G
                  requests:
                    cpu: "2"
                    memory: 8G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  - containerPort: 8471
                    name: slicebuilder
                  - containerPort: 8081
                    name: mxla
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
      workerGroupSpecs:
      - groupName: tpu-group
        replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 1
        rayStartParams: {}
        template:
          metadata:
            annotations:
              gke-gcsfuse/volumes: "true"
              gke-gcsfuse/cpu-limit: "0"
              gke-gcsfuse/memory-limit: "0"
              gke-gcsfuse/ephemeral-storage-limit: "0"
          spec:
            serviceAccountName: $KSA_NAME
            containers:
              - name: ray-worker
                image: $VLLM_IMAGE
                imagePullPolicy: IfNotPresent
                resources:
                  limits:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                  requests:
                    cpu: "100"
                    google.com/tpu: "8"
                    ephemeral-storage: 40G
                    memory: 200G
                env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                volumeMounts:
                - name: gcs-fuse-csi-ephemeral
                  mountPath: /data
                - name: dshm
                  mountPath: /dev/shm
            volumes:
            - name: gke-gcsfuse-cache
              emptyDir:
                medium: Memory
            - name: dshm
              emptyDir:
                medium: Memory
            - name: gcs-fuse-csi-ephemeral
              csi:
                driver: gcsfuse.csi.storage.gke.io
                volumeAttributes:
                  bucketName: $GSBUCKET
                  mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
              cloud.google.com/gke-tpu-topology: 2x4
  2. Terapkan manifes:

    envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
    

    Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

GKE membuat resource kustom RayCluster dengan workergroup yang berisi host tunggal TPU v6e dalam topologi 2x4.

Menghubungkan ke resource kustom RayCluster

Setelah resource kustom RayCluster dibuat, Anda dapat terhubung ke resource RayCluster dan mulai menayangkan model.

  1. Pastikan GKE membuat Layanan RayCluster:

    kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
        --output wide
    

    Outputnya mirip dengan hal berikut ini:

    NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
    vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###
    

    Tunggu hingga STATUS menjadi ready dan kolom HEAD POD IP dan HEAD SERVICE IP memiliki alamat IP.

  2. Buat sesi port-forwarding ke head Ray:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    pkill -f "kubectl .* port-forward .* 10001:10001"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &
    
  3. Pastikan klien Ray dapat terhubung ke resource kustom RayCluster jarak jauh:

    docker run --net=host -it ${VLLM_IMAGE} \
    ray list nodes --address http://localhost:8265
    

    Outputnya mirip dengan hal berikut ini:

    ======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
    Stats:
    ------------------------------
    Total: 2
    
    Table:
    ------------------------------
        NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
    0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                               memory: #.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               node:__internal_head__: 1.0
                                                                                               object_store_memory: #.### GiB
    1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                               TPU: 8.0
                                                                                               TPU-v#e-8-head: 1.0
                                                                                               accelerator_type:TPU-V#E: 1.0
                                                                                               memory: ###.### GiB
                                                                                               node:###.###.###.###: 1.0
                                                                                               object_store_memory: ##.### GiB
                                                                                               tpu-group-0: 1.0
    

Men-deploy model dengan vLLM

Deploy model dengan vLLM:

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llava-1.5-13b-hf

docker run \
    --env DTYPE=${DTYPE} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"DTYPE": "bfloat16", "MODEL_ID": "llava-hf/llava-1.5-13b-hf"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

Melihat Dasbor Ray

Anda dapat melihat deployment Ray Serve dan log yang relevan dari Dasbor Ray.

  1. Klik tombol Ikon Web Preview Pratinjau Web, yang dapat ditemukan di kanan atas taskbar Cloud Shell.
  2. Klik Ubah port dan tetapkan nomor port ke 8265.
  3. Klik Ubah dan Lihat Pratinjau.
  4. Di Dasbor Ray, klik tab Serve.

Setelah deployment Serve memiliki status HEALTHY, model siap untuk mulai memproses input.

Menayangkan model

Panduan ini menyoroti model yang mendukung pembuatan teks, yaitu teknik yang memungkinkan pembuatan konten teks dari perintah.

Llama-3-8B-Instruct

  1. Siapkan penerusan port ke server:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. Kirim perintah ke endpoint Serve:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Mistral-7B

  1. Siapkan penerusan port ke server:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. Kirim perintah ke endpoint Serve:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llava-1.5-13b-hf

  1. Siapkan penerusan port ke server:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. Kirim perintah ke endpoint Serve:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Llama 3.1 70B

  1. Siapkan penerusan port ke server:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &
    
  2. Kirim perintah ke endpoint Serve:

    curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'
    

Konfigurasi tambahan

Secara opsional, Anda dapat mengonfigurasi resource dan teknik penyaluran model berikut yang didukung framework Ray Serve:

Men-deploy RayService

Anda dapat men-deploy model yang sama dari tutorial ini menggunakan resource kustom RayService.

  1. Hapus resource kustom RayCluster yang Anda buat dalam tutorial ini:

    kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
    
  2. Buat resource kustom RayService untuk men-deploy model:

    Llama-3-8B-Instruct

    1. Periksa manifes ray-service.tpu-v5e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

      GKE membuat RayService dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

    Mistral-7B

    1. Periksa manifes ray-service.tpu-v5e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

      GKE membuat RayService dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

    Llava-1.5-13b-hf

    1. Periksa manifes ray-service.tpu-v5e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

      GKE membuat RayService dengan workergroup yang berisi host tunggal TPU v5e dalam topologi 2x4.

    Llama 3.1 70B

    1. Periksa manifes ray-service.tpu-v6e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
            - name: llm
              import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
              deployments:
              - name: VLLMDeployment
                num_replicas: 1
              runtime_env:
                working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
                env_vars:
                  MODEL_ID: "$MODEL_ID"
                  MAX_MODEL_LEN: "$MAX_MODEL_LEN"
                  DTYPE: "$DTYPE"
                  TOKENIZER_MODE: "$TOKENIZER_MODE"
                  TPU_CHIPS: "8"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  imagePullPolicy: IfNotPresent
                  ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                  - name: HUGGING_FACE_HUB_TOKEN
                    valueFrom:
                      secretKeyRef:
                        name: hf-secret
                        key: hf_api_token
                  - name: VLLM_XLA_CACHE_PATH
                    value: "/data"
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - groupName: tpu-group
            replicas: 1
            minReplicas: 1
            maxReplicas: 1
            numOfHosts: 1
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                  - name: ray-worker
                    image: $VLLM_IMAGE
                    imagePullPolicy: IfNotPresent
                    resources:
                      limits:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                      requests:
                        cpu: "100"
                        google.com/tpu: "8"
                        ephemeral-storage: 40G
                        memory: 200G
                    env:
                      - name: JAX_PLATFORMS
                        value: "tpu"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                      - name: VLLM_XLA_CACHE_PATH
                        value: "/data"
                    volumeMounts:
                    - name: gcs-fuse-csi-ephemeral
                      mountPath: /data
                    - name: dshm
                      mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

      Perintah envsubst akan mengganti variabel lingkungan dalam manifes.

    GKE membuat resource kustom RayCluster tempat aplikasi Ray Serve di-deploy dan resource kustom RayService berikutnya dibuat.

  3. Verifikasi status resource RayService:

    kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu
    

    Tunggu hingga status Layanan berubah menjadi Running:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          1
    
  4. Ambil nama layanan head RayCluster:

    SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
        --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})
    
  5. Buat sesi port-forwarding ke head Ray untuk melihat dasbor Ray:

    pkill -f "kubectl .* port-forward .* 8265:8265"
    kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
    
  6. Lihat Dasbor Ray.

  7. Menayangkan model.

  8. Bersihkan resource RayService:

    kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu
    

Membuat beberapa model dengan komposisi model

Komposisi model adalah teknik untuk menyusun beberapa model menjadi satu aplikasi.

Di bagian ini, Anda akan menggunakan cluster GKE untuk menyusun dua model, Llama 3 8B IT dan Gemma 7B IT, menjadi satu aplikasi:

  • Model pertama adalah model asisten yang menjawab pertanyaan yang diajukan dalam perintah.
  • Model kedua adalah model ringkasan. Output model asisten dirantai ke input model peringkas. Hasil akhirnya adalah versi ringkasan respons dari model asisten.
  1. Siapkan lingkungan Anda:

    export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
    export SUMMARIZER_MODEL_ID=google/gemma-7b-it
    
  2. Untuk cluster Standard, buat node pool slice TPU host tunggal tambahan:

    gcloud container node-pools create tpu-2 \
      --location=${COMPUTE_ZONE} \
      --cluster=${CLUSTER_NAME} \
      --machine-type=MACHINE_TYPE \
      --num-nodes=1
    

    Ganti MACHINE_TYPE dengan salah satu jenis mesin berikut:

    • ct5lp-hightpu-8t untuk menyediakan TPU v5e.
    • ct6e-standard-8t untuk menyediakan TPU v6e.

    Cluster Autopilot otomatis menyediakan node yang diperlukan.

  3. Deploy resource RayService berdasarkan versi TPU yang ingin Anda gunakan:

    TPU v5e

    1. Periksa manifes ray-service.tpu-v5e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      

    TPU v6e

    1. Periksa manifes ray-service.tpu-v6e-singlehost.yaml:

      apiVersion: ray.io/v1
      kind: RayService
      metadata:
        name: vllm-tpu
      spec:
        serveConfigV2: |
          applications:
          - name: llm
            route_prefix: /
            import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
            deployments:
            - name: MultiModelDeployment
              num_replicas: 1
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
              env_vars:
                ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
                SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
                TPU_CHIPS: "16"
                TPU_HEADS: "2"
        rayClusterConfig:
          headGroupSpec:
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: ray-head
                  image: $VLLM_IMAGE
                  resources:
                    limits:
                      cpu: "2"
                      memory: 8G
                    requests:
                      cpu: "2"
                      memory: 8G
                  ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          workerGroupSpecs:
          - replicas: 2
            minReplicas: 1
            maxReplicas: 2
            numOfHosts: 1
            groupName: tpu-group
            rayStartParams: {}
            template:
              metadata:
                annotations:
                  gke-gcsfuse/volumes: "true"
                  gke-gcsfuse/cpu-limit: "0"
                  gke-gcsfuse/memory-limit: "0"
                  gke-gcsfuse/ephemeral-storage-limit: "0"
              spec:
                serviceAccountName: $KSA_NAME
                containers:
                - name: llm
                  image: $VLLM_IMAGE
                  env:
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
                    - name: VLLM_XLA_CACHE_PATH
                      value: "/data"
                  resources:
                    limits:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                    requests:
                      cpu: "100"
                      google.com/tpu: "8"
                      ephemeral-storage: 40G
                      memory: 200G
                  volumeMounts:
                  - name: gcs-fuse-csi-ephemeral
                    mountPath: /data
                  - name: dshm
                    mountPath: /dev/shm
                volumes:
                - name: gke-gcsfuse-cache
                  emptyDir:
                    medium: Memory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: gcs-fuse-csi-ephemeral
                  csi:
                    driver: gcsfuse.csi.storage.gke.io
                    volumeAttributes:
                      bucketName: $GSBUCKET
                      mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
                  cloud.google.com/gke-tpu-topology: 2x4
    2. Terapkan manifes:

      envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
      
  4. Tunggu hingga status resource RayService berubah menjadi Running:

    kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
    

    Outputnya mirip dengan hal berikut ini:

    NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
    vllm-tpu   Running          2
    

    Dalam output ini, status RUNNING menunjukkan resource RayService sudah siap.

  5. Pastikan GKE membuat Layanan untuk aplikasi Ray Serve:

    kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc
    

    Outputnya mirip dengan hal berikut ini:

    NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
    vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###
    
  6. Buat sesi port-forwarding ke head Ray:

    pkill -f "kubectl .* port-forward .* 8000:8000"
    kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &
    
  7. Kirim permintaan ke model:

    curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'
    

    Outputnya mirip dengan hal berikut ini:

      {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}
    

Mem-build dan men-deploy image TPU

Tutorial ini menggunakan image TPU yang dihosting dari vLLM. vLLM menyediakan image Dockerfile.tpu yang mem-build vLLM di atas image PyTorch XLA yang diperlukan yang menyertakan dependensi TPU. Namun, Anda juga dapat mem-build dan men-deploy image TPU Anda sendiri untuk kontrol yang lebih terperinci atas konten image Docker Anda.

  1. Buat repositori Docker untuk menyimpan image container untuk panduan ini:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
    gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev
    
  2. Clone repositori vLLM:

    git clone https://github.com/vllm-project/vllm.git
    cd vllm
    
  3. Buat gambar:

    docker build -f Dockerfile.tpu . -t vllm-tpu
    
  4. Beri tag pada image TPU dengan nama Artifact Registry Anda:

    export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
    docker tag vllm-tpu ${VLLM_IMAGE}
    

    Ganti TAG dengan nama tag yang ingin Anda tentukan. Jika Anda tidak menentukan tag, Docker akan menerapkan tag terbaru default.

  5. Mengirim image ke Artifact Registry:

    docker push ${VLLM_IMAGE}
    

Menghapus resource satu per satu

Jika Anda telah menggunakan project yang sudah ada dan tidak ingin menghapusnya, Anda dapat menghapus resource individual tersebut.

  1. Hapus resource kustom RayCluster:

    kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu
    
  2. Hapus bucket Cloud Storage:

    gcloud storage rm -r gs://${GSBUCKET}
    
  3. Hapus repositori Artifact Registry:

    gcloud artifacts repositories delete vllm-tpu \
        --location=${COMPUTE_REGION}
    
  4. Hapus cluster:

    gcloud container clusters delete ${CLUSTER_NAME} \
        --location=LOCATION
    

    Ganti LOCATION dengan salah satu variabel lingkungan berikut:

    • Untuk cluster Autopilot, gunakan COMPUTE_REGION.
    • Untuk cluster Standard, gunakan COMPUTE_ZONE.

Menghapus project

Jika Anda men-deploy tutorial di project Google Cloud baru, dan tidak lagi memerlukan project tersebut, hapus dengan melakukan langkah-langkah berikut:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Langkah berikutnya