Se usó la API de Cloud Translation para traducir esta página.

Entrega un LLM con TPU en GKE con KubeRay

Autopilot Estándar

En este instructivo, se muestra cómo entregar un modelo de lenguaje grande (LLM) con unidades de procesamiento tensorial (TPU) en Google Kubernetes Engine (GKE) con el complemento Ray Operator y el framework de entrega de vLLM.

En este instructivo, puedes entregar modelos de LLM en TPU v5e o TPU Trillium (v6e) de la siguiente manera:

Llama 3 8B instruct en una TPU v5e de host único
Instrucción Mistral 7B v0.3 en una TPU v5e de host único.
Llava 1.5 13b hf en una TPU v5e de host único
Llama 3.1 70B en una TPU Trillium (v6e) de host único

Esta guía está dirigida a clientes de IA generativa, usuarios nuevos y existentes de GKE, Ingenieros de AA, Ingenieros de MLOps (DevOps) o administradores de plataformas que estén interesados en usar las capacidades de organización de contenedores de Kubernetes para entregar modelos con Ray, en TPUs con vLLM.

Fondo

En esta sección, se describen las tecnologías clave que se usan en este instructivo.

Servicio de Kubernetes administrado de GKE

Google Cloud ofrece una amplia variedad de servicios, incluido GKE, que es adecuado para implementar y administrar cargas de trabajo de IA/AA. GKE es un servicio administrado de Kubernetes que simplifica la implementación, el escalamiento y la administración de aplicaciones alojadas en contenedores. GKE proporciona la infraestructura necesaria, incluidos recursos escalables, procesamiento distribuido y redes eficientes, para controlar las demandas computacionales de los LLM.

Para obtener más información sobre los conceptos clave de Kubernetes, consulta Comienza a aprender sobre Kubernetes. Para obtener más información sobre GKE y cómo te ayuda a escalar, automatizar y administrar Kubernetes, consulta Descripción general de GKE.

Operador de Ray

El complemento Ray Operator en GKE proporciona una plataforma de IA/AA de extremo a extremo para la entrega, el entrenamiento y el perfeccionamiento de cargas de trabajo de aprendizaje automático. En este instructivo, usarás Ray Serve, un framework en Ray, para entregar LLM populares de Hugging Face.

TPU

Las TPU son circuitos integrados personalizados específicos de aplicaciones (ASIC) de Google que se usan para acelerar el aprendizaje automático y los modelos de IA compilados con frameworks como el siguiente:TensorFlow, PyTorch yJAX.

En este instructivo, se aborda la entrega de modelos de LLM en nodos TPU v5e o TPU Trillium (v6e) con topologías de TPU configuradas según los requisitos de cada modelo para entregar mensajes con baja latencia.

vLLM

vLLM es un framework de entrega de LLM de código abierto altamente optimizado que puede aumentar la capacidad de procesamiento de entrega en TPUs, con funciones como las siguientes:

Implementación optimizada de transformadores con PagedAttention
Agrupación en lotes continua para mejorar la capacidad de procesamiento general de la entrega
Paralelismo de tensor y entrega distribuida en varias GPUs

Para obtener más información, consulta la documentación de vLLM.

Objetivos

En este instructivo, se abarcan los siguientes pasos:

Crear un clúster de GKE con un grupo de nodos TPU
Implementa un recurso personalizado de RayCluster con una porción de TPU de host único. GKE implementa el recurso personalizado RayCluster como pods de Kubernetes.
Entrega un LLM.
Interactuar con los modelos

De forma opcional, puedes configurar los siguientes recursos y técnicas de entrega de modelos que admite el framework de Ray Serve:

Implementa un recurso personalizado de RayService.
Crea varios modelos con la composición de modelos.

Antes de comenzar

Antes de comenzar, asegúrate de haber realizado las siguientes tareas:

Habilita la API de Google Kubernetes Engine.

Habilitar la API de Google Kubernetes Engine

Si deseas usar Google Cloud CLI para esta tarea, instala y, luego, inicializa gcloud CLI. Si ya instalaste gcloud CLI, ejecuta gcloud components update para obtener la versión más reciente.
Nota: Para las instalaciones de gcloud CLI existentes, asegúrate de configurar las propiedades compute/region y compute/zone. Cuando configuras las ubicaciones predeterminadas, puedes evitar errores en la gcloud CLI como el siguiente: One of [--zone, --region] must be supplied: Please specify location.

Crea una cuenta de Hugging Face, si todavía no la tienes.
Asegúrate de tener un token de Hugging Face.
Asegúrate de tener acceso al modelo de Hugging Face que deseas usar. Por lo general, se obtiene este acceso mediante la firma de un acuerdo y la solicitud de acceso al propietario del modelo en la página del modelo de Hugging Face.

Prepara el entorno

Asegúrate de tener suficiente cuota en tu Google Cloud proyecto para una TPU v5e de host único o una TPU Trillium (v6e) de host único. Para administrar tu cuota, consulta Cuotas de TPU.
En la consola de Google Cloud, inicia una instancia de Cloud Shell:
Abrir Cloud Shell

Clona el repositorio de ejemplo:

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples

Navega hasta el directorio de trabajo:
```
cd ai-ml/gke-ray/rayserve/llm
```

Configura las variables de entorno predeterminadas para la creación del clúster de GKE:

Llama-3-8B-Instruct

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

Reemplaza lo siguiente:

HUGGING_FACE_TOKEN: Tu token de acceso de Hugging Face.
REGION: Es la región en la que tienes la cuota de TPU. Asegúrate de que la versión de TPU que quieres usar esté disponible en esta región. Para obtener más información, consulta la disponibilidad de TPU en GKE.
ZONE: La zona con la cuota de TPU disponible.
VLLM_IMAGE: La imagen de TPU de vLLM. Puedes usar la imagen pública docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 o crear tu propia imagen de TPU.

Mistral-7B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
export TOKENIZER_MODE=mistral
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

Reemplaza lo siguiente:

HUGGING_FACE_TOKEN: Tu token de acceso de Hugging Face.
REGION: Es la región en la que tienes la cuota de TPU. Asegúrate de que la versión de TPU que quieres usar esté disponible en esta región. Para obtener más información, consulta la disponibilidad de TPU en GKE.
ZONE: La zona con la cuota de TPU disponible.
VLLM_IMAGE: La imagen de TPU de vLLM. Puedes usar la imagen pública docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 o crear tu propia imagen de TPU.

Llava-1.5-13b-hf

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="llava-hf/llava-1.5-13b-hf"
export DTYPE=bfloat16
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

Reemplaza lo siguiente:

HUGGING_FACE_TOKEN: Tu token de acceso de Hugging Face.
REGION: Es la región en la que tienes la cuota de TPU. Asegúrate de que la versión de TPU que quieres usar esté disponible en esta región. Para obtener más información, consulta la disponibilidad de TPU en GKE.
ZONE: La zona con la cuota de TPU disponible.
VLLM_IMAGE: La imagen de TPU de vLLM. Puedes usar la imagen pública docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 o crear tu propia imagen de TPU.

Llama 3.1 70B

export PROJECT_ID=$(gcloud config get project)
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export CLUSTER_NAME=vllm-tpu
export COMPUTE_REGION=REGION
export COMPUTE_ZONE=ZONE
export HF_TOKEN=HUGGING_FACE_TOKEN
export GSBUCKET=vllm-tpu-bucket
export KSA_NAME=vllm-sa
export NAMESPACE=default
export MODEL_ID="meta-llama/Llama-3.1-70B"
export MAX_MODEL_LEN=8192
export VLLM_IMAGE=docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1
export SERVICE_NAME=vllm-tpu-head-svc

Reemplaza lo siguiente:

HUGGING_FACE_TOKEN: Tu token de acceso de Hugging Face.
REGION: Es la región en la que tienes la cuota de TPU. Asegúrate de que la versión de TPU que quieres usar esté disponible en esta región. Para obtener más información, consulta la disponibilidad de TPU en GKE.
ZONE: La zona con la cuota de TPU disponible.
VLLM_IMAGE: La imagen de TPU de vLLM. Puedes usar la imagen pública docker.io/vllm/vllm-tpu:866fa4550d572f4ff3521ccf503e0df2e76591a1 o crear tu propia imagen de TPU.

Extrae la imagen del contenedor de vLLM:
```
docker pull ${VLLM_IMAGE}
```

Crea un clúster

Puedes entregar un LLM en TPU con Ray en un clúster de GKE Autopilot o Standard con el complemento Ray Operator.

Prácticas recomendadas:

Usa un clúster de Autopilot para una experiencia de Kubernetes completamente administrada. Para elegir el modo de operación de GKE que se adapte mejor a tus cargas de trabajo, consulta Elige un modo de operación de GKE.

Usa Cloud Shell para crear un clúster de Autopilot o Standard:

Autopilot

Crea un clúster de GKE Autopilot con el complemento Ray Operator habilitado:

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --release-channel=rapid \
    --location=${COMPUTE_REGION}

Estándar

Crea un clúster estándar con el complemento Ray Operator habilitado:

gcloud container clusters create ${CLUSTER_NAME} \
    --release-channel=rapid \
    --location=${COMPUTE_ZONE} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --machine-type="n1-standard-4" \
    --addons=RayOperator,GcsFuseCsiDriver

Crea un grupo de nodos de porción de TPU de host único:

Llama-3-8B-Instruct

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE crea un grupo de nodos TPU v5e con un tipo de máquina ct5lp-hightpu-8t.

Mistral-7B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE crea un grupo de nodos TPU v5e con un tipo de máquina ct5lp-hightpu-8t.

Llava-1.5-13b-hf

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct5lp-hightpu-8t \
    --num-nodes=1

GKE crea un grupo de nodos TPU v5e con un tipo de máquina ct5lp-hightpu-8t.

Llama 3.1 70B

gcloud container node-pools create tpu-1 \
    --location=${COMPUTE_ZONE} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct6e-standard-8t \
    --num-nodes=1

GKE crea un grupo de nodos TPU v6e con un tipo de máquina ct6e-standard-8t.

Configura kubectl para que se comunique con tu clúster

Para configurar kubectl para que se comunique con tu clúster, ejecuta el siguiente comando:

Autopilot

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_REGION}

Estándar

gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --location=${COMPUTE_ZONE}

Crea un secreto de Kubernetes para las credenciales de Hugging Face

Para crear un Secret de Kubernetes que contenga el token de Hugging Face, ejecuta el siguiente comando:

kubectl create secret generic hf-secret \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl --namespace ${NAMESPACE} apply -f -

Cree un bucket de Cloud Storage

Para acelerar el tiempo de inicio de la implementación de vLLM y minimizar el espacio en disco requerido por nodo, usa el controlador CSI de Cloud Storage FUSE para activar el modelo descargado y la caché de compilación en los nodos de Ray.

En Cloud Shell, ejecute el siguiente comando:

gcloud storage buckets create gs://${GSBUCKET} \
    --uniform-bucket-level-access

Este comando crea un bucket de Cloud Storage para almacenar los archivos del modelo que descargas de Hugging Face.

Configura una cuenta de servicio de Kubernetes para acceder al bucket

Crea la ServiceAccount de Kubernetes:

kubectl create serviceaccount ${KSA_NAME} \
    --namespace ${NAMESPACE}

Otorga a la ServiceAccount de Kubernetes acceso de lectura y escritura al bucket de Cloud Storage:
```
gcloud storage buckets add-iam-policy-binding gs://${GSBUCKET} \
    --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/${NAMESPACE}/sa/${KSA_NAME}" \
    --role "roles/storage.objectUser"
```
GKE crea los siguientes recursos para el LLM:
1. Un bucket de Cloud Storage para almacenar el modelo descargado y la caché de compilación Un controlador CSI de Cloud Storage FUSE lee el contenido del bucket.
2. Volumes con el almacenamiento en caché de archivos habilitado y la función de descarga en paralelo de Cloud Storage FUSE
Práctica recomendada:
Usa una caché de archivos respaldada por tmpfs o Hyperdisk / Persistent Disk según el tamaño esperado del contenido del modelo, por ejemplo, los archivos de pesos. En este instructivo, usarás el almacenamiento en caché de archivos de Cloud Storage FUSE respaldado por la RAM.

Implementa un recurso personalizado de RayCluster

Implementa un recurso personalizado de RayCluster, que suele constar de un pod del sistema y varios pods de trabajo.

Llama-3-8B-Instruct

Para crear el recurso personalizado RayCluster y, luego, implementar el modelo ajustado de instrucciones Llama 3 de 8B, completa los siguientes pasos:

Inspecciona el manifiesto ray-cluster.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un recurso personalizado RayCluster con un workergroup que contiene un host único de TPU v5e en una topología 2x4.

Mistral-7B

Para crear el recurso personalizado RayCluster y, luego, implementar el modelo Mistral-7B, completa los siguientes pasos:

Inspecciona el manifiesto ray-cluster.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un recurso personalizado RayCluster con un workergroup que contiene un host único de TPU v5e en una topología 2x4.

Llava-1.5-13b-hf

Para crear el recurso personalizado RayCluster y, luego, implementar el modelo Llava-1.5-13b-hf, completa los siguientes pasos:

Inspecciona el manifiesto ray-cluster.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 80G
                memory: 200G
            env:
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < tpu/ray-cluster.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un recurso personalizado RayCluster con un workergroup que contiene un host único de TPU v5e en una topología 2x4.

Llama 3.1 70B

Para crear el recurso personalizado RayCluster y, luego, implementar el modelo Llama 3.1 70B, completa los siguientes pasos:

Inspecciona el manifiesto ray-cluster.tpu-v6e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: vllm-tpu
spec:
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
              - containerPort: 8471
                name: slicebuilder
              - containerPort: 8081
                name: mxla
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
  workerGroupSpecs:
  - groupName: tpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    numOfHosts: 1
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-worker
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < tpu/ray-cluster.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un recurso personalizado RayCluster con un workergroup que contiene un host único de TPU v6e en una topología 2x4.

Conéctate al recurso personalizado de RayCluster

Después de crear el recurso personalizado de RayCluster, puedes conectarte al recurso de RayCluster y comenzar a entregar el modelo.

Verifica que GKE haya creado el Service de RayCluster:

kubectl --namespace ${NAMESPACE} get raycluster/vllm-tpu \
    --output wide

El resultado es similar a este:

NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   TPUS   STATUS   AGE   HEAD POD IP      HEAD SERVICE IP
vllm-tpu   1                 1                   ###    ###G     0      8      ready    ###   ###.###.###.###  ###.###.###.###

Espera hasta que STATUS sea ready y las columnas HEAD POD IP y HEAD SERVICE IP tengan una dirección IP.

Establece sesiones de port-forwarding en el encabezado de Ray:

pkill -f "kubectl .* port-forward .* 8265:8265"
pkill -f "kubectl .* port-forward .* 10001:10001"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 10001:10001 2>&1 >/dev/null &

Verifica que el cliente de Ray pueda conectarse al recurso personalizado remoto de RayCluster:

docker run --net=host -it ${VLLM_IMAGE} \
ray list nodes --address http://localhost:8265

El resultado es similar a este:

======== List: YYYY-MM-DD HH:MM:SS.NNNNNN ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    NODE_ID    NODE_IP          IS_HEAD_NODE  STATE    STATE_MESSAGE    NODE_NAME          RESOURCES_TOTAL                   LABELS
0  XXXXXXXXXX  ###.###.###.###  True          ALIVE                     ###.###.###.###    CPU: 2.0                          ray.io/node_id: XXXXXXXXXX
                                                                                           memory: #.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           node:__internal_head__: 1.0
                                                                                           object_store_memory: #.### GiB
1  XXXXXXXXXX  ###.###.###.###  False         ALIVE                     ###.###.###.###    CPU: 100.0                       ray.io/node_id: XXXXXXXXXX
                                                                                           TPU: 8.0
                                                                                           TPU-v#e-8-head: 1.0
                                                                                           accelerator_type:TPU-V#E: 1.0
                                                                                           memory: ###.### GiB
                                                                                           node:###.###.###.###: 1.0
                                                                                           object_store_memory: ##.### GiB
                                                                                           tpu-group-0: 1.0

Implementa el modelo con vLLM

Implementa el modelo con vLLM:

Llama-3-8B-Instruct

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct"}}'

Mistral-7B

docker run \
    --env MODEL_ID=${MODEL_ID} \
    --env TOKENIZER_MODE=${TOKENIZER_MODE} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3", "TOKENIZER_MODE": "mistral"}}'

Llava-1.5-13b-hf

docker run \
    --env DTYPE=${DTYPE} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"DTYPE": "bfloat16", "MODEL_ID": "llava-hf/llava-1.5-13b-hf"}}'

Llama 3.1 70B

docker run \
    --env MAX_MODEL_LEN=${MAX_MODEL_LEN} \
    --env MODEL_ID=${MODEL_ID} \
    --net=host \
    --volume=./tpu:/workspace/vllm/tpu \
    -it \
    ${VLLM_IMAGE} \
    serve run serve_tpu:model \
    --address=ray://localhost:10001 \
    --app-dir=./tpu \
    --runtime-env-json='{"env_vars": {"MAX_MODEL_LEN": "8192", "MODEL_ID": "meta-llama/Meta-Llama-3.1-70B"}}'

Consulta el panel de Ray

Puedes ver tu implementación de Ray Serve y los registros pertinentes desde el panel de Ray.

Haz clic en el botón Vista previa en la Web, que se encuentra en la parte superior derecha de la barra de tareas de Cloud Shell.
Haz clic en Cambiar puerto y establece el número de puerto en 8265.
Haz clic en Cambiar y obtener vista previa (Change and Preview).
En el panel de Ray, haz clic en la pestaña Serve.

Una vez que la implementación de Serve tenga un estado HEALTHY, el modelo estará listo para comenzar a procesar las entradas.

Entrega el modelo

En esta guía, se destacan los modelos que admiten la generación de texto, una técnica que permite crear contenido de texto a partir de una instrucción.

Llama-3-8B-Instruct

Configura la redirección de puertos al servidor:

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Envía un mensaje al extremo de Serve:

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expande la siguiente sección para ver un ejemplo del resultado.

{"prompt": "What
are the top 5 most popular programming languages? Be brief.", "text": " (Note:
This answer may change over time.)\n\nAccording to the TIOBE Index, a widely
followed measure of programming language popularity, the top 5 languages
are:\n\n1. JavaScript\n2. Python\n3. Java\n4. C++\n5. C#\n\nThese rankings are
based on a combination of search engine queries, web traffic, and online
courses. Keep in mind that other sources may have slightly different rankings.
(Source: TIOBE Index, August 2022)", "token_ids": [320, 9290, 25, 1115, 4320,
1253, 2349, 927, 892, 9456, 11439, 311, 279, 350, 3895, 11855, 8167, 11, 264,
13882, 8272, 6767, 315, 15840, 4221, 23354, 11, 279, 1948, 220, 20, 15823,
527, 1473, 16, 13, 13210, 198, 17, 13, 13325, 198, 18, 13, 8102, 198, 19, 13,
356, 23792, 20, 13, 356, 27585, 9673, 33407, 527, 3196, 389, 264, 10824, 315,
2778, 4817, 20126, 11, 3566, 9629, 11, 323, 2930, 14307, 13, 13969, 304, 4059,
430, 1023, 8336, 1253, 617, 10284, 2204, 33407, 13, 320, 3692, 25, 350, 3895,
11855, 8167, 11, 6287, 220, 2366, 17, 8, 128009]}

Mistral-7B

Configura la redirección de puertos al servidor:

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Envía un mensaje al extremo de Serve:

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expande la siguiente sección para ver un ejemplo del resultado.

{"prompt": "What are the top 5 most popular programming languages? Be brief.",
"text": "\n\n1. JavaScript: Widely used for web development, particularly for
client-side scripting and building dynamic web page content.\n\n2. Python:
Known for its simplicity and readability, it's widely used for web
development, machine learning, data analysis, and scientific computing.\n\n3.
Java: A general-purpose programming language used in a wide range of
applications, including Android app development, web services, and
enterprise-level applications.\n\n4. C#: Developed by Microsoft, it's often
used for Windows desktop apps, game development (Unity), and web development
(ASP.NET).\n\n5. TypeScript: A superset of JavaScript that adds optional
static typing and other features for large-scale, maintainable JavaScript
applications.", "token_ids": [781, 781, 29508, 29491, 27049, 29515, 1162,
1081, 1491, 2075, 1122, 5454, 4867, 29493, 7079, 1122, 4466, 29501, 2973,
7535, 1056, 1072, 4435, 11384, 5454, 3652, 3804, 29491, 781, 781, 29518,
29491, 22134, 29515, 1292, 4444, 1122, 1639, 26001, 1072, 1988, 3205, 29493,
1146, 29510, 29481, 13343, 2075, 1122, 5454, 4867, 29493, 6367, 5936, 29493,
1946, 6411, 29493, 1072, 11237, 22031, 29491, 781, 781, 29538, 29491, 12407,
29515, 1098, 3720, 29501, 15460, 4664, 17060, 4610, 2075, 1065, 1032, 6103,
3587, 1070, 9197, 29493, 3258, 13422, 1722, 4867, 29493, 5454, 4113, 29493,
1072, 19123, 29501, 5172, 9197, 29491, 781, 781, 29549, 29491, 1102, 29539,
29515, 9355, 1054, 1254, 8670, 29493, 1146, 29510, 29481, 3376, 2075, 1122,
9723, 25470, 14189, 29493, 2807, 4867, 1093, 2501, 1240, 1325, 1072, 5454,
4867, 1093, 2877, 29521, 29491, 12466, 1377, 781, 781, 29550, 29491, 6475,
7554, 29515, 1098, 26434, 1067, 1070, 27049, 1137, 14401, 12052, 1830, 25460,
1072, 1567, 4958, 1122, 3243, 29501, 6473, 29493, 9855, 1290, 27049, 9197,
29491, 2]}

Llava-1.5-13b-hf

Configura la redirección de puertos al servidor:

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Envía un mensaje al extremo de Serve:

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expande la siguiente sección para ver un ejemplo del resultado.

{"prompt": "What are the top 5 most popular programming languages? Be brief.",
"text": " under 100 words.\n\nThe top 5 most popular programming languages
are:\n1. Python\n2. Java\n3. C#\n4. C++\n5. JavaScript.", "token_ids": [1090,
29871, 29896, 29900, 29900, 3838, 29889, 13, 13, 1576, 2246, 29871, 29945,
1556, 5972, 8720, 10276, 526, 29901, 13, 29896, 29889, 5132, 13, 29906, 29889,
3355, 13, 29941, 29889, 315, 29937, 13, 29946, 29889, 315, 1817, 13, 29945,
29889, 8286, 29889, 2]}

Llama 3.1 70B

Configura la redirección de puertos al servidor:

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8000:8000 2>&1 >/dev/null &

Envía un mensaje al extremo de Serve:

curl -X POST http://localhost:8000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Expande la siguiente sección para ver un ejemplo del resultado.

{"prompt": "What are
the top 5 most popular programming languages? Be brief.", "text": " This is a
very subjective question, but there are some general guidelines to follow when
selecting a language. For example, if you\u2019re looking for a language
that\u2019s easy to learn, you might want to consider Python. It\u2019s one of
the most popular languages in the world, and it\u2019s also relatively easy to
learn. If you\u2019re looking for a language that\u2019s more powerful, you
might want to consider Java. It\u2019s a more complex language, but it\u2019s
also very popular. Whichever language you choose, make sure you do your
research and pick one that\u2019s right for you.\nThe most popular programming
languages are:\nWhy is C++ so popular?\nC++ is a powerful and versatile
language that is used in many different types of software. It is also one of
the most popular programming languages, with a large community of developers
who are always creating new and innovative ways to use it. One of the reasons
why C++ is so popular is because it is a very efficient language. It allows
developers to write code that is both fast and reliable, which is essential
for many types of software. Additionally, C++ is very flexible, meaning that
it can be used for a wide range of different purposes. Finally, C++ is also
very popular because it is easy to learn. There are many resources available
online and in books that can help anyone get started with learning the
language.\nJava is a versatile language that can be used for a variety of
purposes. It is one of the most popular programming languages in the world and
is used by millions of people around the globe. Java is used for everything
from developing desktop applications to creating mobile apps and games. It is
also a popular choice for web development. One of the reasons why Java is so
popular is because it is a platform-independent language. This means that it
can be used on any type of computer or device, regardless of the operating
system. Java is also very versatile and can be used for a variety of different
purposes.", "token_ids": [1115, 374, 264, 1633, 44122, 3488, 11, 719, 1070,
527, 1063, 4689, 17959, 311, 1833, 994, 27397, 264, 4221, 13, 1789, 3187, 11,
422, 499, 3207, 3411, 369, 264, 4221, 430, 753, 4228, 311, 4048, 11, 499,
2643, 1390, 311, 2980, 13325, 13, 1102, 753, 832, 315, 279, 1455, 5526, 15823,
304, 279, 1917, 11, 323, 433, 753, 1101, 12309, 4228, 311, 4048, 13, 1442,
499, 3207, 3411, 369, 264, 4221, 430, 753, 810, 8147, 11, 499, 2643, 1390,
311, 2980, 8102, 13, 1102, 753, 264, 810, 6485, 4221, 11, 719, 433, 753, 1101,
1633, 5526, 13, 1254, 46669, 4221, 499, 5268, 11, 1304, 2771, 499, 656, 701,
3495, 323, 3820, 832, 430, 753, 1314, 369, 499, 627, 791, 1455, 5526, 15840,
15823, 527, 512, 10445, 374, 356, 1044, 779, 5526, 5380, 34, 1044, 374, 264,
8147, 323, 33045, 4221, 430, 374, 1511, 304, 1690, 2204, 4595, 315, 3241, 13,
1102, 374, 1101, 832, 315, 279, 1455, 5526, 15840, 15823, 11, 449, 264, 3544,
4029, 315, 13707, 889, 527, 2744, 6968, 502, 323, 18699, 5627, 311, 1005, 433,
13, 3861, 315, 279, 8125, 3249, 356, 1044, 374, 779, 5526, 374, 1606, 433,
374, 264, 1633, 11297, 4221, 13, 1102, 6276, 13707, 311, 3350, 2082, 430, 374,
2225, 5043, 323, 15062, 11, 902, 374, 7718, 369, 1690, 4595, 315, 3241, 13,
23212, 11, 356, 1044, 374, 1633, 19303, 11, 7438, 430, 433, 649, 387, 1511,
369, 264, 7029, 2134, 315, 2204, 10096, 13, 17830, 11, 356, 1044, 374, 1101,
1633, 5526, 1606, 433, 374, 4228, 311, 4048, 13, 2684, 527, 1690, 5070, 2561,
2930, 323, 304, 6603, 430, 649, 1520, 5606, 636, 3940, 449, 6975, 279, 4221,
627, 15391, 3S74, 264, 33045, 4221, 430, 649, 387, 1511, 369, 264, 8205, 315,
10096, 13, 1102, 374, 832, 315, 279, 1455, 5526, 15840, 15823, 304, 279, 1917,
323, 374, 1511, 555, 11990, 315, 1274, 2212, 279, 24867, 13, 8102, 374, 1511,
369, 4395, 505, 11469, 17963, 8522, 311, 6968, 6505, 10721, 323, 3953, 13,
1102, 374, 1101, 264, 5526, 5873, 369, 3566, 4500, 13, 3861, 315, 279, 8125,
3249, 8102, 374, 779, 5526, 374, 1606, 433, 374, 264, 5452, 98885, 4221, 13,
1115, 3445, 430, 433, 649, 387, 1511, 389, 904, 955, 315, 6500, 477, 3756, 11,
15851, 315, 279, 10565, 1887, 13, 8102, 374, 1101, 1633, 33045, 323, 649, 387,
1511, 369, 264, 8205, 315, 2204, 10096, 13, 128001]}

Configuración adicional

De forma opcional, puedes configurar los siguientes recursos y técnicas de entrega de modelos que admite el framework de Ray Serve:

Implementa un recurso personalizado de RayService. En los pasos anteriores de este instructivo, usaste RayCluster en lugar de RayService. Recomendamos RayService para entornos de producción.
Crea varios modelos con la composición de modelos. Configura la multiplexación y la composición de modelos que son compatibles con el framework de Ray Serve. La composición de modelos te permite encadenar entradas y salidas en varios LLM y escalar tus modelos como una sola aplicación.
Compila y, luego, implementa tu propia imagen de TPU. Te recomendamos esta opción si necesitas un control más detallado sobre el contenido de tu imagen de Docker.

Implementa un RayService

Puedes implementar los mismos modelos de este instructivo con un recurso personalizado de RayService.

Borra el recurso personalizado RayCluster que creaste en este instructivo:
```
kubectl --namespace ${NAMESPACE} delete raycluster/vllm-tpu
```

Crea el recurso personalizado RayService para implementar un modelo:

Llama-3-8B-Instruct

Inspecciona el manifiesto ray-service.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un RayService con un workergroup que contiene un TPU v5e de host único en una topología 2x4.

Mistral-7B

Inspecciona el manifiesto ray-service.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un RayService con un workergroup que contiene un TPU v5e de host único en una topología 2x4.

Llava-1.5-13b-hf

Inspecciona el manifiesto ray-service.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto
```
envsubst < tpu/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -
```
El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un RayService con un workergroup que contiene un TPU v5e de host único en una topología 2x4.

Llama 3.1 70B

Inspecciona el manifiesto ray-service.tpu-v6e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: ai-ml.gke-ray.rayserve.llm.tpu.serve_tpu:model
        deployments:
        - name: VLLMDeployment
          num_replicas: 1
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          env_vars:
            MODEL_ID: "$MODEL_ID"
            MAX_MODEL_LEN: "$MAX_MODEL_LEN"
            DTYPE: "$DTYPE"
            TOKENIZER_MODE: "$TOKENIZER_MODE"
            TPU_CHIPS: "8"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - groupName: tpu-group
      replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: $VLLM_IMAGE
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
                requests:
                  cpu: "100"
                  google.com/tpu: "8"
                  ephemeral-storage: 40G
                  memory: 200G
              env:
                - name: JAX_PLATFORMS
                  value: "tpu"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_XLA_CACHE_PATH
                  value: "/data"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < tpu/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

El comando envsubst reemplaza las variables de entorno en el manifiesto.

GKE crea un recurso personalizado de RayCluster en el que se implementa la aplicación Ray Serve y se crea el recurso personalizado de RayService posterior.

Verifica el estado del recurso de RayService:

kubectl --namespace ${NAMESPACE} get rayservices/vllm-tpu

Espera a que el estado del servicio cambie a Running:

NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          1

Recupera el nombre del servicio principal de RayCluster:

SERVICE_NAME=$(kubectl --namespace=${NAMESPACE} get rayservices/vllm-tpu \
    --template={{.status.activeServiceStatus.rayClusterStatus.head.serviceName}})

Establece sesiones de port-forwarding en el encabezado de Ray para ver el panel de Ray:

pkill -f "kubectl .* port-forward .* 8265:8265"
kubectl --namespace ${NAMESPACE} port-forward service/${SERVICE_NAME} 8265:8265 2>&1 >/dev/null &

Consulta el panel de Ray.
Publica el modelo.

Limpia el recurso RayService:

kubectl --namespace ${NAMESPACE} delete rayservice/vllm-tpu

Cómo crear varios modelos con la composición de modelos

La composición de modelos es una técnica para componer varios modelos en una sola aplicación.

En esta sección, usarás un clúster de GKE para compilar dos modelos, Llama 3 8B IT y Gemma 7B IT, en una sola aplicación:

El primer modelo es el modelo de asistente que responde las preguntas que se hacen en la instrucción.
El segundo modelo es el de resumen. El resultado del modelo de asistente se encadena a la entrada del modelo de resumen. El resultado final es la versión resumida de la respuesta del modelo del asistente.

Configura tu entorno:

export ASSIST_MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
export SUMMARIZER_MODEL_ID=google/gemma-7b-it

Para los clústeres estándar, crea un grupo de nodos de porción de TPU de host único adicional:
```
gcloud container node-pools create tpu-2 \
  --location=${COMPUTE_ZONE} \
  --cluster=${CLUSTER_NAME} \
  --machine-type=MACHINE_TYPE \
  --num-nodes=1
```
Reemplaza MACHINE_TYPE por cualquiera de los siguientes tipos de máquinas:
- ct5lp-hightpu-8t para aprovisionar TPU v5e.
- ct6e-standard-8t para aprovisionar TPU v6e.
Los clústeres de Autopilot aprovisionan automáticamente los nodos necesarios.

Implementa el recurso RayService según la versión de TPU que deseas usar:

TPU v5e

Inspecciona el manifiesto ray-service.tpu-v5e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < model-composition/ray-service.tpu-v5e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

TPU v6e

Inspecciona el manifiesto ray-service.tpu-v6e-singlehost.yaml:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: vllm-tpu
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path:  ai-ml.gke-ray.rayserve.llm.model-composition.serve_tpu:multi_model
      deployments:
      - name: MultiModelDeployment
        num_replicas: 1
      runtime_env:
        working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
        env_vars:
          ASSIST_MODEL_ID: "$ASSIST_MODEL_ID"
          SUMMARIZER_MODEL_ID: "$SUMMARIZER_MODEL_ID"
          TPU_CHIPS: "16"
          TPU_HEADS: "2"
  rayClusterConfig:
    headGroupSpec:
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: ray-head
            image: $VLLM_IMAGE
            resources:
              limits:
                cpu: "2"
                memory: 8G
              requests:
                cpu: "2"
                memory: 8G
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 2
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
          - name: llm
            image: $VLLM_IMAGE
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
              - name: VLLM_XLA_CACHE_PATH
                value: "/data"
            resources:
              limits:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
              requests:
                cpu: "100"
                google.com/tpu: "8"
                ephemeral-storage: 40G
                memory: 200G
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 2x4

Aplica el manifiesto

envsubst < model-composition/ray-service.tpu-v6e-singlehost.yaml | kubectl --namespace ${NAMESPACE} apply -f -

Espera a que el estado del recurso de RayService cambie a Running:
```
kubectl --namespace ${NAMESPACE} get rayservice/vllm-tpu
```
El resultado es similar a este:
```
NAME       SERVICE STATUS   NUM SERVE ENDPOINTS
vllm-tpu   Running          2
```
En este resultado, el estado RUNNING indica que el recurso de RayService está listo.

Confirma que GKE haya creado el Service para la aplicación Ray Serve:

kubectl --namespace ${NAMESPACE} get service/vllm-tpu-serve-svc

El resultado es similar a este:

NAME                 TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)    AGE
vllm-tpu-serve-svc   ClusterIP   ###.###.###.###   <none>        8000/TCP   ###

Establece sesiones de port-forwarding en el encabezado de Ray:

pkill -f "kubectl .* port-forward .* 8000:8000"
kubectl --namespace ${NAMESPACE} port-forward service/vllm-tpu-serve-svc 8000:8000 2>&1 >/dev/null &

Envía una solicitud al modelo:

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

El resultado es similar a este:

  {"text": [" used in various data science projects, including building machine learning models, preprocessing data, and visualizing results.\n\nSure, here is a single sentence summarizing the text:\n\nPython is the most popular programming language for machine learning and is widely used in data science projects, encompassing model building, data preprocessing, and visualization."]}

Compila y, luego, implementa la imagen de TPU

En este instructivo, se usan imágenes de TPU alojadas de vLLM. vLLM proporciona una imagen Dockerfile.tpu que compila vLLM sobre la imagen XLA de PyTorch requerida que incluye dependencias de TPU. Sin embargo, también puedes compilar e implementar tu propia imagen de TPU para obtener un control más detallado sobre el contenido de tu imagen de Docker.

Crea un repositorio de Docker para almacenar las imágenes de contenedor de esta guía:

gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=${COMPUTE_REGION} && \
gcloud auth configure-docker ${COMPUTE_REGION}-docker.pkg.dev

Clona el repositorio de vLLM:

git clone https://github.com/vllm-project/vllm.git
cd vllm

Compila la imagen:

docker build -f Dockerfile.tpu . -t vllm-tpu

Etiqueta la imagen de la TPU con el nombre de Artifact Registry:
```
export VLLM_IMAGE=${COMPUTE_REGION}-docker.pkg.dev/${PROJECT_ID}/vllm-tpu/vllm-tpu:TAG
docker tag vllm-tpu ${VLLM_IMAGE}
```
Reemplaza TAG por el nombre de la etiqueta que deseas definir. Si no especificas una etiqueta, Docker aplica la etiqueta predeterminada más reciente.
Envía la imagen a Artifact Registry:
```
docker push ${VLLM_IMAGE}
```

Borra los recursos individuales

Si usaste un proyecto existente y no quieres borrarlo, puedes borrar los recursos individuales.

Borra el recurso personalizado RayCluster:

kubectl --namespace ${NAMESPACE} delete rayclusters vllm-tpu

Borra el bucket de Cloud Storage:
```
gcloud storage rm -r gs://${GSBUCKET}
```

Borra el repositorio de Artifact Registry:

gcloud artifacts repositories delete vllm-tpu \
    --location=${COMPUTE_REGION}

Borra el clúster:
```
gcloud container clusters delete ${CLUSTER_NAME} \
    --location=LOCATION
```
Reemplaza LOCATION por cualquiera de las siguientes variables de entorno:
- Para los clústeres de Autopilot, usa COMPUTE_REGION.
- Para los clústeres de Standard, usa COMPUTE_ZONE.

Borra el proyecto

Si implementaste el instructivo en un proyecto Google Cloud nuevo y ya no lo necesitas, sigue estos pasos para borrarlo:

Precaución: Borrar un proyecto tiene las siguientes consecuencias:

Se borra todo en el proyecto. Si usaste un proyecto existente para las tareas de este documento, cuando lo borres, también se borrará cualquier otro trabajo que hayas realizado en el proyecto.
Se pierden los ID personalizados de proyectos. Cuando creaste este proyecto, es posible que hayas creado un ID del proyecto personalizado que desees utilizar en el futuro. Para conservar las URL que utilizan el ID del proyecto, como una URL appspot.com, borra los recursos seleccionados dentro del proyecto en lugar de borrar todo el proyecto.

Si planeas explorar varias infraestructuras, instructivos y guías de inicio rápido la reutilización de proyectos puede ayudarte a evitar exceder los límites de las cuotas del proyecto.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

¿Qué sigue?

Descubre cómo ejecutar cargas de trabajo de IA/AA optimizadas con las capacidades de organización de la plataforma de GKE.
Consulta el código de muestra en GitHub para aprender a usar Ray Serve en GKE.
Para obtener información sobre cómo recopilar y ver métricas de los clústeres de Ray que se ejecutan en GKE, completa los pasos que se indican en Recopila y consulta registros y métricas de los clústeres de Ray en GKE.