이 페이지는 Cloud Translation API를 통해 번역되었습니다.

KubeRay를 사용하여 GKE에서 TPU를 사용하는 LLM 제공

Autopilot Standard

이 튜토리얼에서는 Ray Operator 부가기능 및 vLLM 제공 프레임워크를 사용하여 Google Kubernetes Engine (GKE)에서 Tensor Processing Unit (TPU)을 사용하여 대규모 언어 모델 (LLM)을 제공하는 방법을 보여줍니다.

이 튜토리얼에서는 다음과 같이 TPU v5e 또는 TPU Trillium (v6e)에서 LLM 모델을 제공할 수 있습니다.

단일 호스트 TPU v5e에서 Llama 3 8B 안내
단일 호스트 TPU v5e에서 Mistral 7B instruct v0.3
단일 호스트 TPU Trillium (v6e)의 Llama 3.1 70B

이 가이드는 vLLM이 있는 TPU에서 Kubernetes 컨테이너 조정 기능을 사용하여 Ray를 통해 모델을 제공하는 데 관심이 있는 생성형 AI 고객, 신규 및 기존 GKE 사용자, ML 엔지니어, MLOps (DevOps) 엔지니어, 플랫폼 관리자를 대상으로 합니다.

배경

이 섹션에서는 이 가이드에서 사용되는 주요 기술을 설명합니다.

GKE 관리형 Kubernetes 서비스

Google Cloud 는 AI/ML 워크로드의 배포 및 관리에 적합한 GKE를 비롯한 다양한 서비스를 제공합니다. GKE는 컨테이너화된 애플리케이션의 배포, 확장, 관리를 간소화하는 관리형 Kubernetes 서비스입니다. GKE는 확장 가능한 리소스, 분산 컴퓨팅, 효율적인 네트워킹을 비롯하여 LLM의 계산 요구사항을 처리하는 데 필요한 인프라를 제공합니다.

주요 Kubernetes 개념에 대해 자세히 알아보려면 Kubernetes 학습 시작하기를 참고하세요. GKE 및 GKE가 Kubernetes를 확장, 자동화, 관리하는 데 어떤 도움이 되는지 자세히 알아보려면 GKE 개요를 참고하세요.

Ray 연산자

GKE의 Ray Operator 부가기능은 머신러닝 워크로드의 제공, 학습, 미세 조정을 위한 엔드 투 엔드 AI/ML 플랫폼을 제공합니다. 이 튜토리얼에서는 Ray의 프레임워크인 Ray Serve를 사용하여 Hugging Face의 인기 LLM을 제공합니다.

TPU

TPU는 Google이 TensorFlow, PyTorch 및 JAX와 같은 프레임워크를 활용하여 빌드된 머신러닝 및 AI 모델을 가속화하는 데 사용하기 위해 커스텀 개발한 ASIC(application-specific integrated circuits)입니다.

이 튜토리얼에서는 지연 시간이 짧은 프롬프트를 제공하기 위한 각 모델 요구사항에 따라 구성된 TPU 토폴로지를 사용하여 TPU v5e 또는 TPU Trillium (v6e) 노드에서 LLM 모델을 제공하는 방법을 설명합니다.

vLLM

vLLM은 TPU의 제공 처리량을 늘릴 수 있는 고도로 최적화된 오픈소스 LLM 제공 프레임워크로 다음과 같은 기능을 제공합니다.

PagedAttention으로 최적화된 Transformer구현
전체 제공 처리량을 개선하기 위한 연속적인 작업 일괄 처리
여러 GPU에서 텐서 동시 로드 및 분산 제공

자세한 내용은 vLLM 문서를 참고하세요.

목표

이 가이드는 다음 과정을 다룹니다.

TPU 노드 풀이 있는 GKE 클러스터를 만듭니다.
단일 호스트 TPU 슬라이스로 RayCluster 커스텀 리소스를 배포합니다. GKE는 RayCluster 커스텀 리소스를 Kubernetes 포드로 배포합니다.
LLM을 제공합니다.
모델과 상호작용합니다.

원하는 경우 Ray Serve 프레임워크에서 지원하는 다음과 같은 모델 서빙 리소스와 기법을 구성할 수 있습니다.

RayService 커스텀 리소스를 배포합니다.
모델 컴포지션으로 여러 모델을 컴포지션합니다.

시작하기 전에

시작하기 전에 다음 태스크를 수행했는지 확인합니다.

Google Kubernetes Engine API를 사용 설정합니다.

Google Kubernetes Engine API 사용 설정

이 태스크에 Google Cloud CLI를 사용하려면 gcloud CLI를 설치한 후 초기화하세요. 이전에 gcloud CLI를 설치한 경우 gcloud components update를 실행하여 최신 버전을 가져옵니다.
참고: 기존 gcloud CLI 설치의 경우 compute/region 및 compute/zone 속성을 설정해야 합니다. 기본 위치를 설정하면 gcloud CLI에서 One of [--zone, --region] must be supplied: Please specify location과 같은 오류를 방지할 수 있습니다.

아직 계정이 없다면 Hugging Face 계정을 만듭니다.
Hugging Face 토큰이 있는지 확인합니다.
사용하려는 Hugging Face 모델에 대한 액세스 권한이 있어야 합니다. 일반적으로 계약에 서명하고 Hugging Face 모델 페이지에서 모델 소유자에게 액세스 권한을 요청하여 이 액세스 권한을 얻습니다.

개발 환경 준비

프로젝트에 단일 호스트 TPU v5e 또는 단일 호스트 TPU Trillium (v6e)에 Google Cloud 대한 충분한 할당량이 있는지 확인합니다. 할당량을 관리하려면 TPU 할당량을 참고하세요.
Google Cloud 콘솔에서 Cloud Shell 인스턴스를 시작합니다.
Cloud Shell 열기

샘플 저장소를 클론합니다.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples

작업 디렉터리로 이동합니다.
```
cd ai-ml/gke-ray/rayserve/llm
```

GKE 클러스터 생성을 위한 기본 환경 변수를 설정합니다.

Llama-3-8B-Instruct

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

다음을 바꿉니다.

HUGGING_FACE_TOKEN: Hugging Face 액세스 토큰입니다.
REGION: TPU 할당량이 있는 리전입니다. 이 지역에서 사용할 TPU 버전을 사용할 수 있는지 확인합니다. 자세한 내용은 GKE의 TPU 가용성을 참고하세요.
ZONE: 사용 가능한 TPU 할당량이 있는 영역입니다.
VLLM_IMAGE: vLLM TPU 이미지입니다. 공개 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 이미지를 사용하거나 자체 TPU 이미지를 빌드할 수 있습니다.

Mistral-7B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
export TOKENIZER_MODE=mistral
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

다음을 바꿉니다.

HUGGING_FACE_TOKEN: Hugging Face 액세스 토큰입니다.
REGION: TPU 할당량이 있는 리전입니다. 이 리전에서 사용할 TPU 버전을 사용할 수 있는지 확인합니다. 자세한 내용은 GKE의 TPU 가용성을 참고하세요.
ZONE: 사용 가능한 TPU 할당량이 있는 영역입니다.
VLLM_IMAGE: vLLM TPU 이미지입니다. 공개 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 이미지를 사용하거나 자체 TPU 이미지를 빌드할 수 있습니다.

Llama 3.1 70B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Llama-3.1-70B"
export MAX_MODEL_LEN=8192
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

다음을 바꿉니다.

HUGGING_FACE_TOKEN: Hugging Face 액세스 토큰입니다.
REGION: TPU 할당량이 있는 리전입니다. 이 리전에서 사용할 TPU 버전을 사용할 수 있는지 확인합니다. 자세한 내용은 GKE의 TPU 가용성을 참고하세요.
ZONE: 사용 가능한 TPU 할당량이 있는 영역입니다.
VLLM_IMAGE: vLLM TPU 이미지입니다. 공개 docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 이미지를 사용하거나 자체 TPU 이미지를 빌드할 수 있습니다.

vLLM 컨테이너 이미지를 가져옵니다.
```
docker pull ${VLLM_IMAGE}
```

클러스터 만들기

Ray Operator 부가기능을 사용하여 GKE Autopilot 또는 Standard 클러스터에서 Ray를 사용하여 TPU로 LLM을 제공할 수 있습니다.

권장 사항:

완전 관리형 Kubernetes 환경을 위해서는 Autopilot 클러스터를 사용하는 것이 좋습니다. 워크로드에 가장 적합한 GKE 작업 모드를 선택하려면 GKE 작업 모드 선택을 참조하세요.

Cloud Shell을 사용하여 Autopilot 또는 Standard 클러스터를 만듭니다.

Autopilot

Ray Operator 부가기능이 사용 설정된 GKE Autopilot 클러스터를 만듭니다.

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --release-channel=rapid \
    --location=${COMPUTE_REGION}

표준

다음과 같이 Ray 연산자 부가기능이 사용 설정된 표준 클러스터를 만듭니다.

gcloud container clusters create ${CLUSTER_NAME} \
    --release-channel=rapid \
    --location=${COMPUTE_ZONE} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --machine-type="n1-standard-4" \
    --addons=RayOperator,GcsFuseCsiDriver

다음과 같이 단일 호스트 TPU 슬라이스 노드 풀을 만듭니다.

Llama-3-8B-Instruct

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE는 ct5lp-hightpu-8t 머신 유형으로 TPU v5e 노드 풀을 만듭니다.

Mistral-7B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE는 ct5lp-hightpu-8t 머신 유형으로 TPU v5e 노드 풀을 만듭니다.

Llama 3.1 70B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct6e-standard-8t \
    --num-nodes=1

GKE는 ct6e-standard-8t 머신 유형으로 TPU v6e 노드 풀을 만듭니다.

클러스터와 통신하도록 kubectl 구성

클러스터와 통신하도록 kubectl을 구성하려면 다음 명령어를 실행합니다.

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

표준

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

Hugging Face 토큰이 포함된 Kubernetes 보안 비밀을 만들려면 다음 명령어를 실행합니다.

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

Cloud Storage 버킷 만들기

vLLM 배포 시작 시간을 가속화하고 노드당 필요한 디스크 공간을 최소화하려면 Cloud Storage FUSE CSI 드라이버를 사용하여 다운로드한 모델과 컴파일 캐시를 Ray 노드에 마운트합니다.

Cloud Shell에서 다음 명령어를 실행합니다.

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

이 명령어는 Hugging Face에서 다운로드한 모델 파일을 저장할 Cloud Storage 버킷을 만듭니다.

버킷에 액세스할 Kubernetes ServiceAccount 설정

Kubernetes ServiceAccount를 만듭니다.

kubectl create serviceaccount ${KSA_NAME} \
    --namespace ${NAMESPACE}

Kubernetes ServiceAccount에 Cloud Storage 버킷에 대한 읽기/쓰기 액세스 권한을 부여합니다.
```
gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
    --role "roles/storage.objectUser"
```
GKE는 LLM에 다음 리소스를 만듭니다.
1. 다운로드된 모델과 컴파일 캐시를 저장할 Cloud Storage 버킷 Cloud Storage FUSE CSI 드라이버가 버킷의 콘텐츠를 읽습니다.
2. 파일 캐싱이 사용 설정된 볼륨 및 Cloud Storage FUSE의 병렬 다운로드 기능
권장사항:
가중치 파일과 같이 예상되는 모델 콘텐츠 크기에 따라 tmpfs 또는 Hyperdisk / Persistent Disk로 지원되는 파일 캐시를 사용합니다. 이 튜토리얼에서는 RAM으로 지원되는 Cloud Storage FUSE 파일 캐시를 사용합니다.

RayCluster 커스텀 리소스 배포

일반적으로 하나의 시스템 포드와 여러 작업자 포드로 구성된 RayCluster 커스텀 리소스를 배포합니다.

Llama-3-8B-Instruct

다음 단계를 완료하여 Llama 3 8B 명령어 조정 모델을 배포할 RayCluster 커스텀 리소스를 만듭니다.

ray-cluster.tpu-v5e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 2x4 토폴로지에 TPU v5e 단일 호스트가 포함된 workergroup를 사용하여 RayCluster 커스텀 리소스를 만듭니다.

Mistral-7B

다음 단계를 완료하여 Mistral-7B 모델을 배포할 RayCluster 맞춤 리소스를 만듭니다.

ray-cluster.tpu-v5e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 2x4 토폴로지에 TPU v5e 단일 호스트가 포함된 workergroup를 사용하여 RayCluster 커스텀 리소스를 만듭니다.

Llama 3.1 70B

다음 단계를 완료하여 Llama 3.1 70B 모델을 배포할 RayCluster 커스텀 리소스를 만듭니다.

ray-cluster.tpu-v6e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 2x4 토폴로지에 TPU v6e 단일 호스트가 포함된 workergroup를 사용하여 RayCluster 커스텀 리소스를 만듭니다.

RayCluster 커스텀 리소스에 연결

RayCluster 커스텀 리소스가 생성되면 RayCluster 리소스에 연결하고 모델 서빙을 시작할 수 있습니다.

GKE에서 RayCluster 서비스가 생성되었는지 확인합니다.

kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
    --output wide

출력은 다음과 비슷합니다.

NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###

STATUS가 ready이고 HEAD POD IP 및 HEAD SERVICE IP 열에 IP 주소가 있을 때까지 기다립니다.

Ray 헤드에 대한 port-forwarding 세션을 설정합니다.

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 10001:10001"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &

Ray 클라이언트가 원격 RayCluster 맞춤 리소스에 연결할 수 있는지 확인합니다.

docker run --net=host -it ${VLLM_IMAGE} \
ray list nodes --address http://localhost:8265

출력은 다음과 비슷합니다.

======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                           memory: #.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           node:__internal_head__: 1.0
                                                                                           object_store_memory: #.### GiB
1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                           TPU: 8.0
                                                                                           TPU-v#e-8-head: 1.0
                                                                                           accelerator_type:TPU-V#E: 1.0
                                                                                           memory: ###.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           object_store_memory: ##.### GiB
                                                                                           tpu-group-0: 1.0

vLLM으로 모델 배포

vLLM을 사용하여 모델을 배포합니다.

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

Ray 대시보드 보기

Ray 대시보드에서 Ray Serve 배포 및 관련 로그를 볼 수 있습니다.

Cloud Shell 작업 표시줄의 오른쪽 상단에 있는 웹 미리보기 버튼을 클릭합니다.
포트 변경을 클릭하고 포트 번호를 8265로 설정합니다.
변경 및 미리보기를 클릭합니다.
Ray 대시보드에서 Serve 탭을 클릭합니다.

Serve 배포의 상태가 HEALTHY이면 모델이 입력 처리를 시작할 준비가 된 것입니다.

모델 제공

이 가이드에서는 프롬프트에서 텍스트 콘텐츠를 만들 수 있는 기술인 텍스트 생성을 지원하는 모델을 강조 표시합니다.

Llama-3-8B-Instruct

서버로의 포트 전달을 설정합니다.

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve 엔드포인트에 프롬프트를 전송합니다.

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

다음 섹션을 펼쳐 출력 예시를 확인하세요.

{"prompt": "What
are the top 5 most popular programming languages? Be brief.", "text": " (Note:
This answer may change over time.)\n\nAccording to the TIOBE Index, a widely
followed measure of programming language popularity, the top 5 languages
are:\n\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nThese rankings are
based on a combination of search engine queries, web traffic, and online
courses. Keep in mind that other sources may have slightly different rankings.
(Source: TIOBE Index, August 2022)", "token_ids": [320, 9290, 25, 1115, 4320,
1253, 2349, 927, 892, 9456, 11439, 311, 279, 350, 3895, 11855, 8167, 11, 264,
13882, 8272, 6767, 315, 15840, 4221, 23354, 11, 279, 1948, 220, 20, 15823,
527, 1473, 16, 13, 13210, 198, 17, 13, 13325, 198, 18, 13, 8102, 198, 19, 13,
356, 23792, 20, 13, 356, 27585, 9673, 33407, 527, 3196, 389, 264, 10824, 315,
2778, 4817, 20126, 11, 3566, 9629, 11, 323, 2930, 14307, 13, 13969, 304, 4059,
430, 1023, 8336, 1253, 617, 10284, 2204, 33407, 13, 320, 3692, 25, 350, 3895,
11855, 8167, 11, 6287, 220, 2366, 17, 8, 128009]}

Mistral-7B

서버에 대한 포트 전달을 설정합니다.

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve 엔드포인트에 프롬프트를 전송합니다.

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

다음 섹션을 펼쳐 출력 예시를 확인하세요.

{"prompt": "What are the top 5 most popular programming languages? Be brief.",
"text": "\n\n1. JavaScript: Widely used for web development, particularly for
client-side scripting and building dynamic web page content.\n\n2. Python:
Known for its simplicity and readability, it's widely used for web
development, machine learning, data analysis, and scientific computing.\n\n3.
Java: A general-purpose programming language used in a wide range of
applications, including Android app development, web services, and
enterprise-level applications.\n\n4. C#: Developed by Microsoft, it's often
used for Windows desktop apps, game development (Unity), and web development
(ASP.NET).\n\n5. TypeScript: A superset of JavaScript that adds optional
static typing and other features for large-scale, maintainable JavaScript
applications.", "token_ids": [781, 781, 29508, 29491, 27049, 29515, 1162,
1081, 1491, 2075, 1122, 5454, 4867, 29493, 7079, 1122, 4466, 29501, 2973,
7535, 1056, 1072, 4435, 11384, 5454, 3652, 3804, 29491, 781, 781, 29518,
29491, 22134, 29515, 1292, 4444, 1122, 1639, 26001, 1072, 1988, 3205, 29493,
1146, 29510, 29481, 13343, 2075, 1122, 5454, 4867, 29493, 6367, 5936, 29493,
1946, 6411, 29493, 1072, 11237, 22031, 29491, 781, 781, 29538, 29491, 12407,
29515, 1098, 3720, 29501, 15460, 4664, 17060, 4610, 2075, 1065, 1032, 6103,
3587, 1070, 9197, 29493, 3258, 13422, 1722, 4867, 29493, 5454, 4113, 29493,
1072, 19123, 29501, 5172, 9197, 29491, 781, 781, 29549, 29491, 1102, 29539,
29515, 9355, 1054, 1254, 8670, 29493, 1146, 29510, 29481, 3376, 2075, 1122,
9723, 25470, 14189, 29493, 2807, 4867, 1093, 2501, 1240, 1325, 1072, 5454,
4867, 1093, 2877, 29521, 29491, 12466, 1377, 781, 781, 29550, 29491, 6475,
7554, 29515, 1098, 26434, 1067, 1070, 27049, 1137, 14401, 12052, 1830, 25460,
1072, 1567, 4958, 1122, 3243, 29501, 6473, 29493, 9855, 1290, 27049, 9197,
29491, 2]}

Llama 3.1 70B

서버에 대한 포트 전달을 설정합니다.

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Serve 엔드포인트에 프롬프트를 전송합니다.

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

다음 섹션을 펼쳐 출력 예시를 확인하세요.

{"prompt": "What are
the top 5 most popular programming languages? Be brief.", "text": " This is a
very subjective question, but there are some general guidelines to follow when
selecting a language. For example, if you\u2019re looking for a language
that\u2019s easy to learn, you might want to consider Python. It\u2019s one of
the most popular languages in the world, and it\u2019s also relatively easy to
learn. If you\u2019re looking for a language that\u2019s more powerful, you
might want to consider Java. It\u2019s a more complex language, but it\u2019s
also very popular. Whichever language you choose, make sure you do your
research and pick one that\u2019s right for you.\nThe most popular programming
languages are:\nWhy is C++ so popular?\nC++ is a powerful and versatile
language that is used in many different types of software. It is also one of
the most popular programming languages, with a large community of developers
who are always creating new and innovative ways to use it. One of the reasons
why C++ is so popular is because it is a very efficient language. It allows
developers to write code that is both fast and reliable, which is essential
for many types of software. Additionally, C++ is very flexible, meaning that
it can be used for a wide range of different purposes. Finally, C++ is also
very popular because it is easy to learn. There are many resources available
online and in books that can help anyone get started with learning the
language.\nJava is a versatile language that can be used for a variety of
purposes. It is one of the most popular programming languages in the world and
is used by millions of people around the globe. Java is used for everything
from developing desktop applications to creating mobile apps and games. It is
also a popular choice for web development. One of the reasons why Java is so
popular is because it is a platform-independent language. This means that it
can be used on any type of computer or device, regardless of the operating
system. Java is also very versatile and can be used for a variety of different
purposes.", "token_ids": [1115, 374, 264, 1633, 44122, 3488, 11, 719, 1070,
527, 1063, 4689, 17959, 311, 1833, 994, 27397, 264, 4221, 13, 1789, 3187, 11,
422, 499, 3207, 3411, 369, 264, 4221, 430, 753, 4228, 311, 4048, 11, 499,
2643, 1390, 311, 2980, 13325, 13, 1102, 753, 832, 315, 279, 1455, 5526, 15823,
304, 279, 1917, 11, 323, 433, 753, 1101, 12309, 4228, 311, 4048, 13, 1442,
499, 3207, 3411, 369, 264, 4221, 430, 753, 810, 8147, 11, 499, 2643, 1390,
311, 2980, 8102, 13, 1102, 753, 264, 810, 6485, 4221, 11, 719, 433, 753, 1101,
1633, 5526, 13, 1254, 46669, 4221, 499, 5268, 11, 1304, 2771, 499, 656, 701,
3495, 323, 3820, 832, 430, 753, 1314, 369, 499, 627, 791, 1455, 5526, 15840,
15823, 527, 512, 10445, 374, 356, 1044, 779, 5526, 5380, 34, 1044, 374, 264,
8147, 323, 33045, 4221, 430, 374, 1511, 304, 1690, 2204, 4595, 315, 3241, 13,
1102, 374, 1101, 832, 315, 279, 1455, 5526, 15840, 15823, 11, 449, 264, 3544,
4029, 315, 13707, 889, 527, 2744, 6968, 502, 323, 18699, 5627, 311, 1005, 433,
13, 3861, 315, 279, 8125, 3249, 356, 1044, 374, 779, 5526, 374, 1606, 433,
374, 264, 1633, 11297, 4221, 13, 1102, 6276, 13707, 311, 3350, 2082, 430, 374,
2225, 5043, 323, 15062, 11, 902, 374, 7718, 369, 1690, 4595, 315, 3241, 13,
23212, 11, 356, 1044, 374, 1633, 19303, 11, 7438, 430, 433, 649, 387, 1511,
369, 264, 7029, 2134, 315, 2204, 10096, 13, 17830, 11, 356, 1044, 374, 1101,
1633, 5526, 1606, 433, 374, 4228, 311, 4048, 13, 2684, 527, 1690, 5070, 2561,
2930, 323, 304, 6603, 430, 649, 1520, 5606, 636, 3940, 449, 6975, 279, 4221,
627, 15391, 3S74, 264, 33045, 4221, 430, 649, 387, 1511, 369, 264, 8205, 315,
10096, 13, 1102, 374, 832, 315, 279, 1455, 5526, 15840, 15823, 304, 279, 1917,
323, 374, 1511, 555, 11990, 315, 1274, 2212, 279, 24867, 13, 8102, 374, 1511,
369, 4395, 505, 11469, 17963, 8522, 311, 6968, 6505, 10721, 323, 3953, 13,
1102, 374, 1101, 264, 5526, 5873, 369, 3566, 4500, 13, 3861, 315, 279, 8125,
3249, 8102, 374, 779, 5526, 374, 1606, 433, 374, 264, 5452, 98885, 4221, 13,
1115, 3445, 430, 433, 649, 387, 1511, 389, 904, 955, 315, 6500, 477, 3756, 11,
15851, 315, 279, 10565, 1887, 13, 8102, 374, 1101, 1633, 33045, 323, 649, 387,
1511, 369, 264, 8205, 315, 2204, 10096, 13, 128001]}

추가 구성

원하는 경우 Ray Serve 프레임워크에서 지원하는 다음과 같은 모델 서빙 리소스와 기법을 구성할 수 있습니다.

RayService 커스텀 리소스 배포 이 튜토리얼의 이전 단계에서는 RayService 대신 RayCluster를 사용했습니다. 프로덕션 환경에는 RayService를 사용하는 것이 좋습니다.
모델 구성으로 여러 모델 구성 Ray Serve 프레임워크에서 지원하는 모델 다중화 및 모델 구성을 구성합니다. 모델 컴포지션을 사용하면 여러 LLM에서 입력과 출력을 연결하고 모델을 단일 애플리케이션으로 확장할 수 있습니다.
자체 TPU 이미지를 빌드하고 배포합니다. Docker 이미지의 콘텐츠를 더 세부적으로 제어해야 하는 경우 이 옵션을 사용하는 것이 좋습니다.

RayService 배포

RayService 커스텀 리소스를 사용하여 이 튜토리얼과 동일한 모델을 배포할 수 있습니다.

이 튜토리얼에서 만든 RayCluster 맞춤 리소스를 삭제합니다.
```
kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
```

모델을 배포할 RayService 커스텀 리소스를 만듭니다.

Llama-3-8B-Instruct

ray-service.tpu-v5e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 2x4 토폴로지에 TPU v5e 단일 호스트가 포함된 workergroup로 RayService를 만듭니다.

Mistral-7B

ray-service.tpu-v5e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 2x4 토폴로지에 TPU v5e 단일 호스트가 포함된 workergroup를 사용하여 RayService를 만듭니다.

Llama 3.1 70B

ray-service.tpu-v6e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.
```
envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
envsubst 명령어는 매니페스트의 환경 변수를 대체합니다.

GKE는 Ray Serve 애플리케이션이 배포되고 후속 RayService 커스텀 리소스가 생성되는 RayCluster 커스텀 리소스를 만듭니다.

RayService 리소스의 상태를 확인합니다.

kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu

서비스 상태가 Running로 변경될 때까지 기다립니다.

NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          1

RayCluster 헤드 서비스의 이름을 가져옵니다.

SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
    --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})

Ray 헤드에 대한 port-forwarding 세션을 설정하여 Ray 대시보드를 봅니다.

pkill -f "kubectl .* port-forward .* 8265:8265"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &

Ray 대시보드 보기
모델을 제공합니다.

RayService 리소스를 삭제합니다.

kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu

모델 컴포지션으로 여러 모델 컴포지션

모델 컴포지션은 여러 모델을 단일 애플리케이션으로 구성하는 기법입니다.

이 섹션에서는 GKE 클러스터를 사용하여 Llama 3 8B IT 및 Gemma 7B IT라는 두 모델을 단일 애플리케이션으로 구성합니다.

첫 번째 모델은 프롬프트에서 질문에 답하는 어시스턴트 모델입니다.
두 번째 모델은 요약 도구 모델입니다. 어시스턴트 모델의 출력은 요약 도구 모델의 입력에 연결됩니다. 최종 결과는 어시스턴트 모델의 응답을 요약한 버전입니다.

환경 설정:

export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
export SUMMARIZER_MODEL_ID=google/gemma-7b-it

표준 클러스터의 경우 단일 호스트 TPU 슬라이스 노드 풀을 추가로 만듭니다.
```
gcloud container node-pools create tpu-2 \
  --location=${COMPUTE_ZONE} \
  --cluster=${CLUSTER_NAME} \
  --machine-type=MACHINE_TYPE \
  --num-nodes=1
```
MACHINE_TYPE를 다음 머신 유형 중 하나로 바꿉니다.
- ct5lp-hightpu-8t를 사용하여 TPU v5e를 프로비저닝합니다.
- ct6e-standard-8t를 사용하여 TPU v6e를 프로비저닝합니다.
Autopilot 클러스터는 필요한 노드를 자동으로 프로비저닝합니다.

사용할 TPU 버전에 따라 RayService 리소스를 배포합니다.

TPU v5e

ray-service.tpu-v5e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.

envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

TPU v6e

ray-service.tpu-v6e-singlehost.yaml 매니페스트를 검사합니다.

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

매니페스트를 적용합니다.

envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

RayService 리소스의 상태가 Running으로 변경될 때까지 기다립니다.
```
kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
```
출력은 다음과 비슷합니다.
```
NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          2
```
이 출력에서 RUNNING 상태는 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 생성했는지 확인합니다.

kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc

출력은 다음과 비슷합니다.

NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###

Ray 헤드에 대한 port-forwarding 세션을 설정합니다.

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &

모델에 요청을 보냅니다.

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

출력은 다음과 비슷합니다.

  {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}

TPU 이미지 빌드 및 배포

이 튜토리얼에서는 vLLM의 호스팅된 TPU 이미지를 사용합니다. vLLM은 TPU 종속 항목이 포함된 필수 PyTorch XLA 이미지 위에 vLLM을 빌드하는 Dockerfile.tpu 이미지를 제공합니다. 하지만 Docker 이미지의 콘텐츠를 더 세부적으로 제어하기 위해 자체 TPU 이미지를 빌드하고 배포할 수도 있습니다.

이 가이드의 컨테이너 이미지를 저장할 Docker 저장소를 만듭니다.

gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev

vLLM 저장소를 클론합니다.

git clone https://github.com/vllm-project/vllm.git
cd vllm

이미지를 빌드합니다.

docker build -f Dockerfile.tpu . -t vllm-tpu

Artifact Registry 이름으로 TPU 이미지에 태그를 지정합니다.
```
export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
docker tag vllm-tpu ${VLLM_IMAGE}
```
TAG를 정의하려는 태그 이름으로 바꿉니다. 태그를 지정하지 않으면 Docker는 기본 최신 태그를 적용합니다.
이미지를 Artifact Registry로 내보냅니다.
```
docker push ${VLLM_IMAGE}
```

개별 리소스 삭제

기존 프로젝트를 사용한 경우 삭제하지 않으려면 개별 리소스를 삭제하면 됩니다.

RayCluster 커스텀 리소스를 삭제합니다.

kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu

Cloud Storage 버킷을 삭제합니다.
```
gcloud storage rm -r gs://${GSBUCKET}
```

Artifact Registry 저장소를 삭제합니다.

gcloud artifacts repositories delete vllm-tpu \
    --location=${COMPUTE_REGION}

다음과 같이 클러스터를 삭제합니다.
```
gcloud container clusters delete ${CLUSTER_NAME} \
    --location=LOCATION
```
LOCATION를 다음 환경 변수 중 하나로 바꿉니다.
- Autopilot 클러스터의 경우 COMPUTE_REGION을 사용합니다.
- 표준 클러스터의 경우 COMPUTE_ZONE을 사용합니다.

프로젝트 삭제

새 프로젝트에 튜토리얼을 배포했고 Google Cloud 프로젝트가 더 이상 필요하지 않으면 다음 단계에 따라 이를 삭제합니다.

주의: 프로젝트를 삭제하면 다음과 같은 효과가 발생합니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼, 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

다음 단계

GKE 플랫폼 조정 기능으로 최적화된 AI/ML 워크로드를 실행하는 방법 알아보기
GitHub의 샘플 코드를 확인하여 GKE에서 Ray Serve를 사용하는 방법을 알아보세요.
GKE의 Ray 클러스터에 대한 로그 및 측정항목 수집 및 보기의 단계를 완료하여 GKE에서 실행되는 Ray 클러스터의 측정항목을 수집하고 보는 방법을 알아보세요.