이 페이지는 Cloud Translation API를 통해 번역되었습니다.

베어메탈에서 DeepSeek-R1 671B 또는 Llama 3.1 405B와 같은 LLM 제공

개요

이 가이드에서는 여러 노드에서 그래픽 처리 장치(GPU)를 사용하여 베어메탈 기반 Google Distributed Cloud(소프트웨어 전용)에서 DeepSeek-R1 671B 또는 Llama 3.1 405B와 같은 최신 대규모 언어 모델(LLM)을 제공하는 방법을 설명합니다.

이 가이드에서는 이동식 오픈소스 기술인 Kubernetes, vLLM, LeaderWorkerSet(LWS) API를 사용하여 베어메탈 클러스터에 AI/ML 워크로드를 배포하고 제공하는 방법을 보여줍니다. Google Distributed Cloud는 온프레미스 환경에서 사용할 수 있도록 GKE를 확장하는 동시에 GKE의 세분화된 제어, 확장성, 복원력, 이동성, 경제성의 이점을 제공합니다.

배경

이 섹션에서는 이 가이드에서 예로 사용되는 두 가지 LLM인 DeepSeek-R1 및 Llama 3.1 405B를 포함하여 이 가이드에서 사용되는 주요 기술을 설명합니다.

DeepSeek-R1

매개변수가 6710억 개 있는 DeepSeek의 대규모 언어 모델인 DeepSeek-R1은 다양한 텍스트 기반 태스크에서 논리적 추론, 수학적 추론, 실시간 문제 해결을 위해 설계되었습니다. Google Distributed Cloud는 확장 가능한 리소스, 분산 컴퓨팅, 효율적인 네트워킹을 통해 DeepSeek-R1의 계산 요구를 처리하여 기능을 지원합니다.

자세한 내용은 DeepSeek 문서를 참조하세요.

Llama 3.1 405B

Llama 3.1 405B는 텍스트 생성, 번역, 질의 응답을 포함한 다양한 자연어 처리 태스크를 위해 설계된 Meta의 대규모 언어 모델입니다. Google Distributed Cloud는 이러한 규모의 모델의 분산 학습 및 제공 니즈를 지원하는 데 필요한 강력한 인프라를 제공합니다.

자세한 내용은 Llama 문서를 참조하세요.

Google Distributed Cloud 관리형 Kubernetes 서비스

Google Distributed Cloud는 자체 데이터 센터에서 AI/ML 워크로드를 배포하고 관리하는 데 적합한 베어메탈용 Google Distributed Cloud(소프트웨어 전용)를 포함한 다양한 서비스를 제공합니다. Google Distributed Cloud는 컨테이너화된 애플리케이션 배포, 확장, 관리를 간소화하는 관리형 Kubernetes 서비스입니다. Google Distributed Cloud는 확장 가능한 리소스, 분산 컴퓨팅, 효율적인 네트워킹을 포함하여 LLM의 계산 요구를 처리하는 데 필요한 인프라를 제공합니다.

주요 Kubernetes 개념에 대한 자세한 내용은 Kubernetes 학습 시작을 참조하세요. Google Distributed Cloud에 대한 자세한 내용과 Google Distributed Cloud가 Kubernetes를 확장, 자동화, 관리하는 데 어떻게 도움이 되는지 알아보려면 베어메탈용 Google Distributed Cloud(소프트웨어 전용) 개요를 참조하세요.

GPU

그래픽 처리 장치(GPU)를 사용하면 머신러닝 및 데이터 처리와 같은 특정 워크로드를 가속화할 수 있습니다. Google Distributed Cloud는 이러한 강력한 GPU가 장착된 노드를 지원하므로 머신러닝 및 데이터 처리 태스크에서 최적의 성능을 발휘하는 클러스터를 구성할 수 있습니다. Google Distributed Cloud는 NVIDIA H100, L4, A100 GPU가 있는 머신 유형을 포함하여 노드 구성에 사용되는 다양한 머신 유형 옵션을 제공합니다.

자세한 내용은 NVIDIA GPU 설정 및 사용을 참조하세요.

LeaderWorkerSet(LWS)

LeaderWorkerSet(LWS)는 AI/ML 멀티노드 추론 워크로드의 일반적인 배포 패턴을 처리하는 Kubernetes 배포 API입니다. 멀티노드 제공은 각각 다른 노드에서 실행될 수 있는 여러 포드를 활용하여 분산 추론 워크로드를 처리합니다. LWS를 사용하면 여러 포드를 한 그룹으로 처리하여 분산 모델 서빙 관리를 간소화할 수 있습니다.

vLLM 및 멀티호스트 제공

컴퓨팅 집약적인 LLM을 제공하는 경우 vLLM을 사용하여 GPU에서 워크로드를 실행하는 것이 좋습니다.

vLLM은 GPU의 제공 처리량을 늘릴 수 있는 고도로 최적화된 오픈소스 LLM 제공 프레임워크로, 다음과 같은 기능을 제공합니다.

PagedAttention으로 최적화된 Transformer 구현
전체 제공 처리량을 개선하기 위한 연속적인 작업 일괄 처리
여러 GPU에서 분산 제공

단일 GPU 노드에 맞지 않는 특히 컴퓨팅 집약적인 LLM의 경우 여러 GPU 노드를 사용하여 모델을 제공할 수 있습니다. vLLM을 사용하면 두 가지 전략으로 GPU에서 워크로드를 실행할 수 있습니다.

텐서 동시 로드는 Transformer 레이어의 행렬 곱셈을 GPU 여러 개로 분할합니다. 그러나 GPU 간에 필요한 통신으로 인해 이 전략에는 빠른 네트워크가 필요하므로 노드 전반에서 워크로드를 실행하기에는 적합하지 않습니다.
파이프라인 병렬 처리는 레이어별로 또는 수직으로 모델을 분할합니다. 이 전략에는 GPU 간에 지속적인 통신이 필요하지 않으므로 노드 전반에서 모델을 실행할 때 적합한 옵션입니다.

멀티 노드 서빙에서 두 전략을 모두 사용할 수 있습니다. 예를 들어 H100 GPU가 각각 8개 있는 노드 2개를 사용하는 경우 두 전략 모두 사용할 수 있습니다.

두 노드에 걸쳐 모델을 샤딩하는 양방향 파이프라인 동시 로드
각 노드의 GPU 8개에 걸쳐 모델을 샤딩하는 8방향 텐서 동시 로드

자세한 내용은 vLLM 문서를 참조하세요.

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

다음 명령어를 사용하여 Hugging Face 토큰이 포함된 Kubernetes 보안 비밀을 만듭니다.

kubectl create secret generic hf-secret \
    --kubeconfig KUBECONFIG \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

KUBECONFIG를 LLM을 호스팅하려는 클러스터의 kubeconfig 파일 경로로 바꿉니다.

자체 vLLM 멀티노드 이미지 만들기

vLLM의 교차 노드 통신을 용이하게 하려면 Ray를 사용하면 됩니다. LeaderWorkerSet 저장소는 vLLM으로 Ray를 구성할 수 있도록 bash 스크립트가 포함된 Dockerfile을 제공합니다.

자체 vLLM 멀티노드 이미지를 만들려면 LeaderWorkerSet 저장소를 클론하고 제공된 Dockerfile(교차 노드 통신을 위해 Ray 구성)을 사용하여 Docker 이미지를 빌드한 후 Google Distributed Cloud에 배포할 수 있도록 이 이미지를 Artifact Registry에 내보내야 합니다.

컨테이너 빌드

컨테이너를 빌드하려면 다음 단계를 수행합니다.

LeaderWorkerSet 저장소를 클론합니다.

git clone https://github.com/kubernetes-sigs/lws.git

이미지 빌드

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile.GPU . -t vllm-multihost

이미지를 Artifact Registry로 푸시

Kubernetes 배포에서 이미지에 액세스할 수 있게 하려면 Google Cloud 프로젝트 내 Artifact Registry에 이미지를 저장합니다.

docker image tag vllm-multihost ${IMAGE_NAME}
docker push ${IMAGE_NAME}

LeaderWorkerSet 설치

LWS를 설치하려면 다음 명령어를 실행합니다.

kubectl apply --server-side \
    --kubeconfig KUBECONFIG \
    -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

다음 명령어를 사용하여 LeaderWorkerSet 컨트롤러가 lws-system 네임스페이스에서 실행 중인지 확인합니다.

kubectl get pod -n lws-system --kubeconfig KUBECONFIG

출력은 다음과 비슷합니다.

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

vLLM 모델 서버 배포

vLLM 모델 서버를 배포하려면 다음 단계를 수행합니다.

배포하려는 LLM에 따라 매니페스트를 만들고 적용합니다.

DeepSeek-R1

vLLM 모델 서버의 YAML 매니페스트 vllm-deepseek-r1-A3.yaml을 만듭니다.

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

다음 명령어를 실행하여 매니페스트를 적용합니다.

kubectl apply -f vllm-deepseek-r1-A3.yaml \
    --kubeconfig KUBECONFIG

Llama 3.1 405B

vLLM 모델 서버의 YAML 매니페스트 vllm-llama3-405b-A3.yaml을 만듭니다.

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

다음 명령어를 실행하여 매니페스트를 적용합니다.

kubectl apply -f vllm-llama3-405b-A3.yaml \
    --kubeconfig KUBECONFIG

다음 명령어를 사용하여 실행 중인 모델 서버에서 로그를 봅니다.

kubectl logs vllm-0 -c vllm-leader \
    --kubeconfig KUBECONFIG

출력은 다음과 비슷하게 표시됩니다.

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

모델 제공

다음 명령어를 실행하여 모델에 대한 포트 전달을 설정합니다.

kubectl port-forward svc/vllm-leader 8080:8080 \
    --kubeconfig KUBECONFIG

curl을 사용하여 모델과 상호작용

curl을 사용하여 모델과 상호작용하려면 다음 안내를 따르세요.

DeepSeek-R1

새 터미널에서 요청을 서버에 보냅니다.

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

출력은 다음과 비슷하게 표시됩니다.

{
  "id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
  "object": "text_completion",
  "created": 1738269669,
  "model": "deepseek-ai/DeepSeek-R1",
  "choices": [
    {
      "index": 0,
      "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 76,
    "total_tokens": 544,
    "completion_tokens": 468,
    "prompt_tokens_details": null
  }
}

Llama 3.1 405B

새 터미널에서 요청을 서버에 보냅니다.

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

출력은 다음과 비슷하게 표시됩니다.

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}