Ray를 사용하여 L4 GPU에 LLM 제공

Autopilot Standard

이 가이드에서는 Google Kubernetes Engine(GKE)에서 Ray 및 Ray Operator 부가기능을 사용하여 대규모 언어 모델(LLM)을 제공하는 방법을 보여줍니다.

이 가이드에서는 다음 모델을 제공할 수 있습니다.

또한 이 가이드에서는 Ray Serve 프레임워크에서 지원하는 모델 다중화 및 모델 구성과 같은 모델 서빙 기법을 설명합니다.

배경

Ray 프레임워크는 머신러닝 워크로드의 학습, 미세 조정, 추론을 위한 엔드 투 엔드 AI/ML 플랫폼을 제공합니다. Ray Serve는 Hugging Face의 인기 LLM을 제공하는 데 사용할 수 있는 Ray의 프레임워크입니다.

모델의 데이터 형식에 따라 GPU 수가 달라집니다. 이 가이드에서 모델은 1~2개의 L4 GPU를 사용할 수 있습니다.

이 가이드에서는 다음 단계를 설명합니다.

Ray Operator 부가기능이 사용 설정된 Autopilot 또는 표준 GKE 클러스터를 만듭니다.
Hugging Face에서 대규모 언어 모델(LLM)을 다운로드하고 제공하는 RayService 리소스를 배포합니다.
LLM과의 채팅 인터페이스 및 대화를 배포합니다.

시작하기 전에

시작하기 전에 다음 태스크를 수행했는지 확인합니다.

Google Kubernetes Engine API를 사용 설정합니다.

Google Kubernetes Engine API 사용 설정

이 태스크에 Google Cloud CLI를 사용하려면 gcloud CLI를 설치한 후 초기화합니다. 이전에 gcloud CLI를 설치한 경우 gcloud components update를 실행하여 최신 버전을 가져옵니다.
참고: 기존 gcloud CLI 설치의 경우 compute/region 및 compute/zone 속성을 설정해야 합니다. 기본 위치를 설정하면 gcloud CLI에서 One of [--zone, --region] must be supplied: Please specify location과 같은 오류를 방지할 수 있습니다.

아직 계정이 없다면 Hugging Face 계정을 만듭니다.
Hugging Face 토큰이 있는지 확인합니다.
사용하려는 Hugging Face 모델에 대한 액세스 권한이 있어야 합니다. 일반적으로 계약에 서명하고 Hugging Face 모델 페이지에서 모델 소유자에게 액세스 권한을 요청하면 됩니다.
us-central1 리전에 GPU 할당량이 있는지 확인합니다. 자세한 내용은 GPU 할당량을 참조하세요.

개발 환경 준비

Google Cloud 콘솔에서 Cloud Shell 인스턴스를 시작합니다.
Cloud Shell 열기

샘플 저장소를 클론합니다.

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/llm
export TUTORIAL_HOME=`pwd`

기본 환경 변수를 설정합니다.
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export COMPUTE_REGION=us-central1
export CLUSTER_VERSION=CLUSTER_VERSION
export HF_TOKEN=HUGGING_FACE_TOKEN
```
다음을 바꿉니다.
- PROJECT_ID: Google Cloud 프로젝트 ID
- CLUSTER_VERSION: 사용할 GKE 버전. 1.30.1 이상이어야 합니다.
- HUGGING_FACE_TOKEN: Hugging Face 액세스 토큰

GPU 노드 풀이 있는 클러스터 만들기

Ray Operator 부가기능을 사용해서 GKE Autopilot 또는 Standard 클러스터에서 Ray로 L4 GPU로 LLM을 제공할 수 있습니다. 일반적으로 완전 관리형 Kubernetes 환경을 위해서는 Autopilot을 사용하는 것이 좋습니다. 사용 사례에 따라 높은 확장성이 필요하거나 클러스터 구성을 보다 세밀하게 제어하고 싶은 경우에는 Standard 클러스터를 대신 선택하세요. 워크로드에 가장 적합한 GKE 작업 모드를 선택하려면 GKE 작업 모드 선택을 참조하세요.

Cloud Shell을 사용하여 Autopilot 또는 Standard 클러스터를 만듭니다.

Autopilot

Ray Operator 부가기능이 사용 설정된 Autopilot 클러스터를 만듭니다.

gcloud container clusters create-auto rayserve-cluster \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

표준

Ray Operator 부가기능이 사용 설정된 Standard 클러스터를 만듭니다.

gcloud container clusters create rayserve-cluster \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION} \
    --machine-type=g2-standard-24 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=2 \
    --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

Cloud Shell에서 다음을 수행하여 Kubernetes 보안 비밀을 만듭니다.

클러스터와 통신하도록 kubectl을 구성합니다.

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${COMPUTE_REGION}

Hugging Face 토큰이 포함된 Kubernetes 보안 비밀을 만듭니다.

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

LLM 모델 배포

클론한 GitHub 저장소에는 RayService 구성이 포함된 각 모델의 디렉터리가 있습니다. 각 모델의 구성에는 다음 구성요소가 포함됩니다.

Ray Serve 배포: 리소스 구성 및 런타임 종속 항목을 포함하는 Ray Serve 배포
모델: Hugging Face 모델 ID
Ray 클러스터: 기본 Ray 클러스터와 각 구성요소에 필요한 리소스(헤드 및 작업자 포드 포함)

Gemma 2B IT

모델을 배포합니다.
```
kubectl apply -f gemma-2b-it/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice gemma-2b-it -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 만들었는지 확인합니다.

kubectl get service gemma-2b-it-serve-svc

출력은 다음과 비슷합니다.

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-2b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Gemma 7B IT

모델을 배포합니다.
```
kubectl apply -f gemma-7b-it/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice gemma-7b-it -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 만들었는지 확인합니다.

kubectl get service gemma-7b-it-serve-svc

출력은 다음과 비슷합니다.

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-7b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 2 7B

모델을 배포합니다.
```
kubectl apply -f llama-2-7b/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice llama-2-7b -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 만들었는지 확인합니다.

kubectl get service llama-2-7b-serve-svc

출력은 다음과 비슷합니다.

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-2-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 3 8B

모델을 배포합니다.
```
kubectl apply -f llama-3-8b/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice llama-3-8b -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 만들었는지 확인합니다.

kubectl get service llama-3-8b-serve-svc

출력은 다음과 비슷합니다.

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-3-8b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Mistral 7B

모델을 배포합니다.
```
kubectl apply -f mistral-7b/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice mistral-7b -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 만들었는지 확인합니다.

kubectl get service mistral-7b-serve-svc

출력은 다음과 비슷합니다.

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
mistral-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

모델 서빙

Llama2 7B 및 Llama3 8B 모델은 OpenAI API 채팅 사양을 사용합니다. 다른 모델은 프롬프트를 기반으로 텍스트를 생성하는 기술인 텍스트 생성만 지원합니다.

포트 전달 설정

추론 서버에 대한 포트 전달을 설정합니다.

Gemma 2B IT

kubectl port-forward svc/gemma-2b-it-serve-svc 8000:8000

Gemma 7B IT

kubectl port-forward svc/gemma-7b-it-serve-svc 8000:8000

Llama2 7B

kubectl port-forward svc/llama-7b-serve-svc 8000:8000

Llama 3 8B

kubectl port-forward svc/llama-3-8b-serve-svc 8000:8000

Mistral 7B

kubectl port-forward svc/mistral-7b-serve-svc 8000:8000

curl을 사용하여 모델과 상호작용

curl을 사용해서 모델과 채팅합니다.

Gemma 2B IT

새 터미널 세션:

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Gemma 7B IT

새 터미널 세션:

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Llama2 7B

새 터미널 세션:

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Llama-2-7b-chat-hf",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Llama 3 8B

새 터미널 세션:

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Mistral 7B

새 터미널 세션:

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

제공한 모델은 기록을 보존하지 않으므로, 대화형 대화 환경을 만들려면 각 메시지 및 응답을 모델로 다시 전송해야 합니다. 다음 예시는 Llama 3 8B 모델을 사용하여 대화형 대화를 만드는 방법을 보여줍니다.

curl을 사용하여 모델과 대화를 만듭니다.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."},
        {"role": "assistant", "content": " \n1. Java\n2. Python\n3. C++\n4. C#\n5. JavaScript"},
        {"role": "user", "content": "Can you give me a brief description?"}
      ],
      "temperature": 0.7
}'

출력은 다음과 비슷합니다.

{
  "id": "cmpl-3cb18c16406644d291e93fff65d16e41",
  "object": "chat.completion",
  "created": 1719035491,
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a brief description of each:\n\n1. **Java**: A versatile language for building enterprise-level applications, Android apps, and web applications.\n2. **Python**: A popular language for data science, machine learning, web development, and scripting, known for its simplicity and ease of use.\n3. **C++**: A high-performance language for building operating systems, games, and other high-performance applications, with a focus on efficiency and control.\n4. **C#**: A modern, object-oriented language for building Windows desktop and mobile applications, as well as web applications using .NET.\n5. **JavaScript**: A versatile language for client-side scripting on the web, commonly used for creating interactive web pages, web applications, and mobile apps.\n\nNote: These descriptions are brief and don't do justice to the full capabilities and uses of each language."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 73,
    "total_tokens": 245,
    "completion_tokens": 172
  }
}

(선택사항) 채팅 인터페이스에 연결

GRadio를 사용하여 모델과 상호작용할 수 있는 웹 애플리케이션을 빌드할 수 있습니다. Gradio는 챗봇용 사용자 인터페이스를 만드는 ChatInterface 래퍼가 있는 Python 라이브러리입니다. Llama 2 7B 및 Llama 3 7B의 경우 LLM 모델을 배포할 때 Gradio를 설치했습니다.

gradio 서비스로의 포트 전달을 설정하세요.
```
kubectl port-forward service/gradio 8080:8080 &
```
브라우저에서 http://localhost:8080을 열어 모델과 채팅합니다.

모델 다중화를 사용하여 여러 모델 제공

모델 다중화는 동일한 Ray 클러스터 내에서 여러 모델을 서빙하는 데 사용되는 기술입니다. 요청 헤더를 사용하거나 부하 분산을 통해 특정 모델로 트래픽을 라우팅할 수 있습니다.

이 예시에서는 Gemma 7B IT 및 Llama 3 8B의 두 모델로 구성된 다중화 Ray Serve 애플리케이션을 만듭니다.

RayService 리소스를 배포합니다.
```
kubectl apply -f model-multiplexing/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice model-multiplexing -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 Kubernetes 서비스를 만들었는지 확인합니다.

kubectl get service model-multiplexing-serve-svc

출력은 다음과 비슷합니다.

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-multiplexing-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Ray Serve 애플리케이션에 대한 포트 전달을 설정합니다.
```
kubectl port-forward svc/model-multiplexing-serve-svc 8000:8000
```

Gemma 7B IT 모델에 요청을 보냅니다.

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: google/gemma-7b-it" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

출력은 다음과 비슷합니다.

{"text": ["What are the top 5 most popular programming languages? Please be brief.\n\n1. JavaScript\n2. Java\n3. C++\n4. Python\n5. C#"]}

Llama 3 8B 모델에 요청을 전송합니다.

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: meta-llama/Meta-Llama-3-8B-Instruct" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

출력은 다음과 비슷합니다.

{"text": ["What are the top 5 most popular programming languages? Please be brief. Here are your top 5 most popular programming languages, based on the TIOBE Index, a widely used measure of the popularity of programming languages.\r\n\r\n1. **Java**: Used in Android app development, web development, and enterprise software development.\r\n2. **Python**: A versatile language used in data science, machine learning, web development, and automation.\r\n3. **C++**: A high-performance language used in game development, system programming, and high-performance computing.\r\n4. **C#**: Used in Windows and web application development, game development, and enterprise software development.\r\n5. **JavaScript**: Used in web development, mobile app development, and server-side programming with technologies like Node.js.\r\n\r\nSource: TIOBE Index (2022).\r\n\r\nThese rankings can vary depending on the source and methodology used, but this gives you a general idea of the most popular programming languages."]}

serve_multiplexed_model_id 헤더를 제외하여 무작위 모델에 요청을 전송합니다.

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

출력은 이전 단계의 출력 중 하나입니다.

모델 구성으로 여러 모델 작성

모델 구성은 여러 모델을 단일 애플리케이션으로 조합하는 데 사용되는 기술입니다. 모델 구성을 사용하면 여러 LLM 간에 입력 및 출력을 연결하고 모델을 단일 애플리케이션으로 확장할 수 있습니다.

이 예시에서는 Gemma 7B IT 및 Llama 38B의 두 모델을 단일 애플리케이션으로 구성합니다. 첫 번째 모델은 프롬프트에 제공된 질문에 답변하는 어시스턴트 모델입니다. 두 번째 모델은 요약기 모델입니다. 어시스턴트 모델의 출력은 요약기 모델의 입력에 연결됩니다. 최종 결과는 어시스턴트 모델의 응답 요약 버전입니다.

RayService 리소스를 배포합니다.
```
kubectl apply -f model-composition/
```

RayService 리소스가 준비될 때까지 기다립니다.

kubectl get rayservice model-composition -o yaml

출력은 다음과 비슷합니다.

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

이 출력에서 status: RUNNING은 RayService 리소스가 준비되었음을 나타냅니다.

GKE가 Ray Serve 애플리케이션에 대해 서비스를 생성했는지 확인하세요.

kubectl get service model-composition-serve-svc

출력은 다음과 비슷합니다.

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-composition-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

모델에 요청을 보냅니다.

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

출력은 다음과 비슷합니다.

{"text": ["\n\n**Sure, here is a summary in a single sentence:**\n\nThe most popular programming language for machine learning is Python due to its ease of use, extensive libraries, and growing community."]}

프로젝트 삭제

주의: 프로젝트를 삭제하면 다음과 같은 효과가 발생합니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼, 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

개별 리소스 삭제

기존 프로젝트를 사용한 경우 삭제하지 않으려면 개별 리소스를 삭제하면 됩니다.

다음과 같이 클러스터를 삭제합니다.

gcloud container clusters delete rayserve-cluster

다음 단계

GKE 플랫폼 조정 기능으로 최적화된 AI/ML 워크로드를 실행하는 방법 알아보기
GKE Standard 모드에서 GPU를 사용하여 모델 학습
GitHub의 샘플 코드를 확인하여 GKE에서 RayServe를 사용하는 방법을 알아보기