Halaman ini diterjemahkan oleh Cloud Translation API.

Menyajikan LLM seperti DeepSeek-R1 671B atau Llama 3.1 405B di bare metal

Ringkasan

Panduan ini menunjukkan cara menayangkan model bahasa besar (LLM) canggih seperti DeepSeek-R1 671B atau Llama 3.1 405B di Google Distributed Cloud (khusus software) di bare metal menggunakan unit pemrosesan grafis (GPU) di beberapa node.

Panduan ini menunjukkan cara menggunakan teknologi open source portabel—Kubernetes, vLLM, dan API LeaderWorkerSet (LWS)—untuk men-deploy dan menayangkan workload AI/ML di cluster bare metal. Google Distributed Cloud memperluas GKE untuk digunakan di lingkungan lokal, sekaligus memberikan keuntungan dari kontrol terperinci, skalabilitas, ketahanan, portabilitas, dan efektivitas biaya GKE.

Latar belakang

Bagian ini menjelaskan teknologi utama yang digunakan dalam panduan ini, termasuk dua LLM yang digunakan sebagai contoh dalam panduan ini—DeepSeek-R1 dan Llama 3.1 405B.

DeepSeek-R1

DeepSeek-R1, model bahasa besar dengan 671 miliar parameter dari DeepSeek, dirancang untuk inferensi logis, penalaran matematika, dan pemecahan masalah real-time dalam berbagai tugas berbasis teks. Google Distributed Cloud menangani tuntutan komputasi DeepSeek-R1, mendukung kemampuannya dengan resource yang dapat diskalakan, komputasi terdistribusi, dan jaringan yang efisien.

Untuk mempelajari lebih lanjut, lihat dokumentasi DeepSeek.

Llama 3.1 405B

Llama 3.1 405B adalah model bahasa besar dari Meta yang dirancang untuk berbagai tugas natural language processing, termasuk pembuatan teks, terjemahan, dan question answering. Google Distributed Cloud menawarkan infrastruktur yang tangguh yang diperlukan untuk mendukung kebutuhan pelatihan dan penayangan terdistribusi model skala ini.

Untuk mempelajari lebih lanjut, lihat dokumentasi Llama.

Layanan Kubernetes terkelola Google Distributed Cloud

Google Distributed Cloud menawarkan berbagai layanan, termasuk Google Distributed Cloud (khusus software) untuk bare metal, yang sangat cocok untuk men-deploy dan mengelola workload AI/ML di pusat data Anda sendiri. Google Distributed Cloud adalah layanan Kubernetes terkelola yang menyederhanakan deployment, penskalaan, dan pengelolaan aplikasi dalam container. Google Distributed Cloud menyediakan infrastruktur yang diperlukan, termasuk resource yang skalabel, komputasi terdistribusi, dan jaringan yang efisien, untuk menangani tuntutan komputasi LLM.

Untuk mempelajari lebih lanjut konsep utama Kubernetes, lihat Mulai mempelajari Kubernetes. Untuk mempelajari lebih lanjut Google Distributed Cloud dan cara layanan ini membantu Anda menskalakan, mengotomatiskan, dan mengelola Kubernetes, lihat Ringkasan Google Distributed Cloud (khusus software) untuk bare metal.

GPU

Unit pemrosesan grafis (GPU) memungkinkan Anda mempercepat workload tertentu, seperti machine learning dan pemrosesan data. Google Distributed Cloud mendukung node yang dilengkapi dengan GPU berperforma tinggi ini, sehingga Anda dapat mengonfigurasi cluster untuk performa optimal dalam tugas machine learning dan pemrosesan data. Google Distributed Cloud menyediakan berbagai opsi jenis mesin untuk konfigurasi node, termasuk jenis mesin dengan GPU NVIDIA H100, L4, dan A100.

Untuk mempelajari lebih lanjut, lihat Menyiapkan dan menggunakan GPU NVIDIA.

LeaderWorkerSet (LWS)

LeaderWorkerSet (LWS) adalah API deployment Kubernetes yang menangani pola deployment umum workload inferensi multi-node AI/ML. Penayangan multi-node memanfaatkan beberapa Pod, yang masing-masing berpotensi berjalan di node yang berbeda, untuk menangani workload inferensi terdistribusi. LWS memungkinkan memperlakukan beberapa Pod sebagai grup, sehingga menyederhanakan pengelolaan penayangan model terdistribusi.

vLLM dan penayangan multi-host

Saat menayangkan LLM dengan komputasi intensif, sebaiknya gunakan vLLM dan jalankan workload di seluruh GPU.

vLLM adalah framework penayangan LLM open source yang sangat dioptimalkan yang dapat meningkatkan throughput penayangan di GPU, dengan fitur seperti berikut:

Implementasi transformer yang dioptimalkan dengan PagedAttention
Batch berkelanjutan untuk meningkatkan throughput penayangan secara keseluruhan
Inferensi terdistribusi pada beberapa GPU

Dengan LLM yang sangat intensif secara komputasi dan tidak dapat dimuat ke dalam satu node GPU, Anda dapat menggunakan beberapa node GPU untuk menayangkan model. vLLM mendukung menjalankan workload di seluruh GPU dengan dua strategi:

Paralelisme tensor membagi perkalian matriks di lapisan transformer di beberapa GPU. Namun, strategi ini memerlukan jaringan yang cepat karena komunikasi yang diperlukan antar-GPU, sehingga kurang cocok untuk menjalankan workload di seluruh node.
Paralelisme pipeline membagi model berdasarkan lapisan, atau secara vertikal. Strategi ini tidak memerlukan komunikasi yang konstan antar-GPU, sehingga menjadi pilihan yang lebih baik saat menjalankan model di seluruh node.

Anda dapat menggunakan kedua strategi ini dalam penayangan multi-node. Misalnya, saat menggunakan dua node dengan delapan GPU H100 di setiap node, Anda dapat menggunakan kedua strategi:

Paralelisme pipeline dua arah untuk membagi model di dua node
Paralelisme tensor delapan arah untuk memecah model di delapan GPU pada setiap node

Untuk mempelajari lebih lanjut, lihat dokumentasi vLLM.

Buat Secret Kubernetes untuk kredensial Hugging Face

Buat Secret Kubernetes yang berisi token Hugging Face menggunakan perintah berikut:

kubectl create secret generic hf-secret \
    --kubeconfig KUBECONFIG \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

Ganti KUBECONFIG dengan jalur file kubeconfig untuk cluster tempat Anda ingin menghosting LLM.

Membuat image multi-node vLLM Anda sendiri

Untuk memfasilitasi komunikasi lintas-node untuk vLLM, Anda dapat menggunakan Ray. Repositori LeaderWorkerSet menyediakan Dockerfile, yang mencakup skrip bash untuk mengonfigurasi Ray dengan vLLM.

Untuk membuat image multi-node vLLM Anda sendiri, Anda perlu meng-clone repositori LeaderWorkerSet, membangun image Docker menggunakan Dockerfile yang disediakan (yang mengonfigurasi Ray untuk komunikasi lintas node), lalu mengirim image tersebut ke Artifact Registry untuk deployment di Google Distributed Cloud.

Buat container

Untuk membuat penampung, ikuti langkah-langkah berikut:

Buat clone repositori LeaderWorkerSet:

git clone https://github.com/kubernetes-sigs/lws.git

Membangun image

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile.GPU . -t vllm-multihost

Kirim image ke Artifact Registry

Untuk memastikan deployment Kubernetes Anda dapat mengakses image, simpan image di Artifact Registry dalam project Anda: Google Cloud

docker image tag vllm-multihost ${IMAGE_NAME}
docker push ${IMAGE_NAME}

Instal LeaderWorkerSet

Untuk menginstal LWS, jalankan perintah berikut:

kubectl apply --server-side \
    --kubeconfig KUBECONFIG \
    -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

Validasi bahwa pengontrol LeaderWorkerSet berjalan di namespace lws-system menggunakan perintah berikut:

kubectl get pod -n lws-system --kubeconfig KUBECONFIG

Outputnya mirip dengan hal berikut ini:

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

Men-deploy Server Model vLLM

Untuk men-deploy server model vLLM, ikuti langkah-langkah berikut:

Buat dan terapkan manifes, bergantung pada LLM yang ingin Anda deploy.

DeepSeek-R1

Buat manifes YAML, vllm-deepseek-r1-A3.yaml, untuk server model vLLM:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

Terapkan manifes dengan menjalankan perintah berikut:

kubectl apply -f vllm-deepseek-r1-A3.yaml \
    --kubeconfig KUBECONFIG

Llama 3.1 405B

Buat manifes YAML, vllm-llama3-405b-A3.yaml, untuk server model vLLM:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

Terapkan manifes dengan menjalankan perintah berikut:

kubectl apply -f vllm-llama3-405b-A3.yaml \
    --kubeconfig KUBECONFIG

Lihat log dari server model yang sedang berjalan dengan perintah berikut:

kubectl logs vllm-0 -c vllm-leader \
    --kubeconfig KUBECONFIG

Output-nya akan terlihat seperti berikut:

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Menyajikan model

Siapkan penerusan port ke model dengan menjalankan perintah berikut:

kubectl port-forward svc/vllm-leader 8080:8080 \
    --kubeconfig KUBECONFIG

Berinteraksi dengan model menggunakan curl

Untuk berinteraksi dengan model menggunakan curl, ikuti petunjuk berikut:

DeepSeek-R1

Di terminal baru, kirim permintaan ke server:

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

Outputnya akan mirip dengan berikut ini:

{
  "id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
  "object": "text_completion",
  "created": 1738269669,
  "model": "deepseek-ai/DeepSeek-R1",
  "choices": [
    {
      "index": 0,
      "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 76,
    "total_tokens": 544,
    "completion_tokens": 468,
    "prompt_tokens_details": null
  }
}

Llama 3.1 405B

Di terminal baru, kirim permintaan ke server:

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

Outputnya akan mirip dengan berikut ini:

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}