Halaman ini diterjemahkan oleh Cloud Translation API.

Menyajikan model open source menggunakan TPU di GKE dengan TPU Optimum

Standard

Tutorial ini menunjukkan cara menayangkan model open source model bahasa besar (LLM), menggunakan Tensor Processing Unit (TPU) di Google Kubernetes Engine (GKE) dengan framework penayangan Optimum TPU dari Hugging Face. Dalam tutorial ini, Anda akan mendownload model open source dari Hugging Face dan men-deploy model tersebut di cluster Standard GKE menggunakan penampung yang menjalankan TPU Optimum.

Panduan ini memberikan titik awal jika Anda memerlukan kontrol terperinci, skalabilitas, ketahanan, portabilitas, dan hemat biaya dari Kubernetes terkelola saat men-deploy dan menayangkan beban kerja AI/ML.

Tutorial ini ditujukan untuk pelanggan AI Generatif di ekosistem Hugging Face, pengguna baru atau lama GKE, Engineer ML, engineer MLOps (DevOps), atau administrator platform yang tertarik untuk menggunakan kemampuan orkestrasi penampung Kubernetes untuk menayangkan LLM.

Sebagai pengingat, Anda memiliki beberapa opsi untuk inferensi LLM di Google Cloud yang mencakup penawaran seperti Vertex AI, GKE, dan Google Compute Engine tempat Anda dapat menggabungkan library penayangan seperti JetStream, vLLM, dan penawaran partner lainnya. Misalnya, Anda dapat menggunakan JetStream untuk mendapatkan pengoptimalan terbaru dari project. Jika Anda lebih memilih opsi Hugging Face, Anda dapat menggunakan TPU Optimum.

TPU Optimum mendukung fitur berikut:

Pengelompokan berkelanjutan
Streaming token
Greedy search dan sampling multinomial menggunakan transformer.

Tujuan

Siapkan cluster GKE Standard dengan topologi TPU yang direkomendasikan berdasarkan karakteristik model.
Men-deploy TPU Optimum di GKE.
Gunakan TPU Optimum untuk menayangkan model yang didukung melalui curl.

Sebelum memulai

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Buka IAM
2. Pilih project.
3. Klik Berikan akses.
4. Di kolom New principals, masukkan ID pengguna Anda. Ini biasanya adalah alamat email untuk Akun Google.
5. Di daftar Pilih peran, pilih peran.
6. Untuk memberikan peran tambahan, klik Tambahkan peran lain, lalu tambahkan setiap peran tambahan.
7. Klik Simpan.

Buat akun Hugging Face, jika Anda belum memilikinya.
Pastikan project Anda memiliki kuota yang memadai untuk Cloud TPU di GKE.

Menyiapkan lingkungan

Dalam tutorial ini, Anda akan menggunakan Cloud Shell untuk mengelola resource yang dihosting diGoogle Cloud. Cloud Shell telah diinstal dengan software yang akan Anda perlukan untuk tutorial ini, termasuk kubectl dan gcloud CLI.

Untuk menyiapkan lingkungan Anda dengan Cloud Shell, ikuti langkah-langkah berikut:

Di konsol Google Cloud, luncurkan sesi Cloud Shell dengan mengklik Aktifkan Cloud Shell di konsol Google Cloud. Tindakan ini akan meluncurkan sesi di panel bawah Konsol Google Cloud.
Tetapkan variabel lingkungan default:
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
```
Ganti nilai berikut:
- PROJECT_ID: Google Cloud project ID Anda.
- CLUSTER_NAME: nama cluster GKE Anda.
- REGION_NAME: region tempat cluster GKE, bucket Cloud Storage, dan node TPU Anda berada. Wilayah ini berisi zona tempat jenis mesin TPU v5e tersedia (misalnya, us-west1, us-west4, us-central1, us-east1, us-east5, atau europe-west4).
- (Khusus cluster standar) ZONE: zona tempat resource TPU tersedia (misalnya, us-west4-a). Untuk cluster Autopilot, Anda tidak perlu menentukan zona, hanya region.

Buat clone repositori TPU Optimum:

git clone https://github.com/huggingface/optimum-tpu.git

Mendapatkan akses ke model

Anda dapat menggunakan model Gemma 2B atau Llama3 8B. Tutorial ini berfokus pada dua model ini, tetapi Optimum TPU mendukung lebih banyak model.

Gemma 2B

Untuk mendapatkan akses ke model Gemma untuk di-deploy ke GKE, Anda harus menandatangani perjanjian izin lisensi terlebih dahulu, lalu membuat token akses Hugging Face.

Anda harus menandatangani perjanjian izin untuk menggunakan Gemma. Ikuti petunjuk berikut:

Akses halaman izin model.
Verifikasi izin menggunakan akun Hugging Face Anda.
Setujui persyaratan model.

Membuat token akses

Buat token Hugging Face baru jika Anda belum memilikinya:

Klik Profil Anda > Setelan > Token Akses.
Klik New Token.
Tentukan Nama pilihan Anda dan Peran minimal Read.
Klik Generate a token.
Salin token yang dihasilkan ke papan klip Anda.

Llama3 8B

Anda harus menandatangani perjanjian izin untuk menggunakan Llama3 8b di Hugging Face Repo

Membuat token akses

Buat token Hugging Face baru jika Anda belum memilikinya:

Klik Profil Anda > Setelan > Token Akses.
Pilih New Token.
Tentukan Nama pilihan Anda dan Peran minimal Read.
Pilih Buat token.
Salin token yang dihasilkan ke papan klip Anda.

Membuat cluster GKE

Buat cluster GKE Standard dengan 1 node CPU:

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=ZONE

Membuat node pool TPU

Buat node pool TPU v5e dengan 1 node dan 8 chip:

gcloud container node-pools create tpunodepool \
    --location=ZONE \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --cluster=CLUSTER_NAME

Konfigurasi kubectl untuk berkomunikasi dengan cluster Anda:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}

Buat container

Jalankan perintah make untuk mem-build image

cd optimum-tpu && make tpu-tgi

Mengirim image ke Artifact Registry

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

Membuat Secret Kubernetes untuk kredensial Hugging Face

Buat Secret Kubernetes yang berisi token Hugging Face:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

Men-deploy TPU Optimum

Untuk men-deploy TPU Optimum, tutorial ini menggunakan Deployment Kubernetes. Deployment adalah objek Kubernetes API yang memungkinkan Anda menjalankan beberapa replika Pod yang didistribusikan di antara node dalam cluster.

Gemma 2B

Simpan manifes Deployment berikut sebagai optimum-tpu-gemma-2b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

Manifes ini menjelaskan deployment TPU Optimum dengan load balancer internal di port TCP 8080.

Menerapkan manifes

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

Simpan manifes berikut sebagai optimum-tpu-llama3-8b-2x4.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

Manifes ini menjelaskan deployment TPU Optimum dengan load balancer internal di port TCP 8080.

Menerapkan manifes

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

Lihat log dari Deployment yang sedang berjalan:

kubectl logs -f -l app=tgi-tpu

Outputnya akan mirip dengan berikut ini:

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

Pastikan model didownload sepenuhnya sebelum melanjutkan ke bagian berikutnya.

Menayangkan model

Siapkan penerusan port ke model:

kubectl port-forward svc/service 8080:8080

Berinteraksi dengan server model menggunakan curl

Verifikasi model yang di-deploy:

Dalam sesi terminal baru, gunakan curl untuk melakukan chat dengan model:

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

Outputnya akan mirip dengan berikut ini:

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

Pembersihan

Agar tidak perlu membayar biaya pada akun Google Cloud Anda untuk resource yang digunakan dalam tutorial ini, hapus project yang berisi resource tersebut, atau simpan project dan hapus setiap resource.

Menghapus resource yang di-deploy

Agar tidak menimbulkan biaya pada akun Google Cloud Anda untuk resource yang dibuat dalam panduan ini, jalankan perintah berikut:

gcloud container clusters delete CLUSTER_NAME \
  --location=ZONE

Langkah berikutnya

Pelajari dokumentasi TPU Optimum.
Temukan cara menjalankan model Gemma di GKE dan cara menjalankan beban kerja AI/ML yang dioptimalkan dengan kemampuan orkestrasi platform GKE.
Pelajari TPU di GKE lebih lanjut.

Menyajikan model open source menggunakan TPU di GKE dengan TPU Optimum

Tujuan

Sebelum memulai

Check for the roles

Grant the roles

Menyiapkan lingkungan

Mendapatkan akses ke model

Gemma 2B

Menandatangani perjanjian izin lisensi

Membuat token akses

Llama3 8B

Membuat token akses

Membuat cluster GKE

Membuat node pool TPU

Konfigurasi kubectl untuk berkomunikasi dengan cluster Anda:

Buat container

Mengirim image ke Artifact Registry

Membuat Secret Kubernetes untuk kredensial Hugging Face

Men-deploy TPU Optimum

Gemma 2B

Llama3 8B

Menayangkan model

Berinteraksi dengan server model menggunakan curl

Pembersihan

Menghapus resource yang di-deploy

Langkah berikutnya