Saxml을 사용하여 Vertex AI Prediction에서 멀티 호스트 Cloud TPU를 사용하여 Llama 3 개방형 모델 제공

Llama 3은 Meta의 오픈소스 대규모 언어 모델(LLM)입니다. 이 가이드에서는 Saxml을 사용하여 Vertex AI Prediction에서 멀티 호스트 Tensor Processing Unit(TPU)을 사용하는 Llama 3 LLM을 제공하는 방법을 보여줍니다.

이 가이드에서는 Llama 3 70B 모델 가중치와 토크나이저를 다운로드하고 TPU에서 Saxml을 실행하는 Vertex AI Prediction에 배포합니다.

시작하기 전에

M2 메모리 최적화 VM을 사용하여 모델을 다운로드하고 Saxml로 변환하는 것이 좋습니다. 이는 모델 변환 프로세스에 상당한 메모리가 필요하고 메모리가 부족한 머신 유형을 선택하면 실패할 수 있기 때문입니다.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Artifact Registry 문서에 따라 Docker를 설치합니다.
Vertex AI Prediction용 TPU v5e 칩 16개를 사용하기에 충분한 할당량이 있는지 확인합니다.

이 튜토리얼에서는 Google Cloud와 상호작용하기 위해 Cloud Shell을 사용한다고 가정합니다. Cloud Shell 대신 다른 셸을 사용하려면 다음 추가 구성을 수행하세요.

Install the Google Cloud CLI.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

모델 배포에 Cloud Shell 대신 다른 셸을 사용하는 경우 Google Cloud CLI 버전이 475.0.0 이상인지 확인합니다. gcloud components update 명령어를 실행하여 Google Cloud CLI를 업데이트할 수 있습니다.

Vertex AI SDK를 사용하여 모델을 배포하는 경우 1.50.0 이상 버전이 있어야 합니다.

모델에 액세스 및 모델 가중치 다운로드

다음 단계는 M2 메모리 최적화 VM이 있는 Vertex AI Workbench 인스턴스를 위한 단계입니다. Vertex AI Workbench 인스턴스 머신 유형을 변경하는 방법은 Vertex AI Workbench 인스턴스 머신 유형 변경을 참조하세요.

Llama 모델 동의 페이지로 이동합니다.
Llama 3을 선택하고 동의 양식을 작성한 후 이용약관에 동의합니다.
받은편지함에 서명된 URL이 포함된 이메일이 있는지 확인합니다.

다음 명령어를 실행하여 GitHub에서 download.sh 스크립트를 다운로드합니다.

wget https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh
chmod +x download.sh

모델 가중치를 다운로드하려면 GitHub에서 다운로드한 download.sh 스크립트를 실행합니다.
메시지가 표시되면 이전 섹션에서 받은 이메일에 있는 서명된 URL을 입력합니다.
모델을 다운로드할지 묻는 메시지가 표시되면 70B를 입력합니다.

모델 가중치를 Saxml 형식으로 변환

다음 명령어를 실행하여 Saxml을 다운로드합니다.
```
git clone https://github.com/google/saxml.git
```
다음 명령어를 실행하여 Python 가상 환경을 구성합니다.
```
python -m venv .
source bin/activate
```

다음 명령어를 실행하여 종속 항목을 설치합니다.

pip install --upgrade pip

pip install paxml

pip install praxis

pip install torch

모델 가중치를 Saxml 형식으로 변환하려면 다음 명령어를 실행합니다.
```
python3 saxml/saxml/tools/convert_llama_ckpt.py \
    --base PATH_TO_META_LLAMA3 \
    --pax PATH_TO_PAX_LLAMA3 \
    --model-size llama3_70b
```
다음을 바꿉니다.
- PATH_TO_META_LLAMA3: 다운로드한 모델 가중치가 포함된 디렉터리의 경로
- PATH_TO_PAX_LLAMA3: 변환된 모델 가중치를 저장할 디렉터리의 경로
참고: 모든 Llama 2 또는 Llama 3 모델에 이 명령어를 사용할 수 있습니다.

변환된 모델은 $PATH_TO_PAX_LLAMA3/checkpoint_00000000 폴더에 저장됩니다.
다음과 같이 토크나이저 파일을 원래 디렉터리에서 vocabs 하위 폴더로 복사합니다.
```
cp $PATH_TO_META_LLAMA3/tokenizer.model $PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
```

다음과 같이 빈 commit_success.txt 파일을 $PATH_TO_PAX_LLAMA3 폴더에 추가하고 metadata 및 state 하위 폴더를 해당 폴더에 추가합니다.

touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt

이제 $PATH_TO_PAX_LLAMA3 폴더에 다음 폴더와 파일이 포함됩니다.

$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt
$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model

Cloud Storage 버킷 만들기

변환된 모델 가중치를 저장할 Cloud Storage 버킷을 만듭니다.

Cloud Shell에서 다음 명령어를 실행하고 PROJECT_ID를 프로젝트 ID로 바꿉니다.
```
projectid=PROJECT_ID
gcloud config set project ${projectid}
```
버킷을 만들려면 다음 명령어를 실행합니다.
```
gcloud storage buckets create gs://WEIGHTS_BUCKET_NAME
```
WEIGHTS_BUCKET_NAME을 버킷에 사용할 이름으로 바꿉니다.

Cloud Storage 버킷에 모델 가중치 복사

모델 가중치를 버킷에 복사하려면 다음 명령어를 실행합니다.

gcloud storage cp PATH_TO_PAX_LLAMA3/* gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/ --recursive

모델 업로드

us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest에서 사전 빌드된 Saxml 컨테이너를 사용할 수 있습니다.

사전 빌드된 Saxml 컨테이너를 사용하여 Model 리소스를 Vertex AI Prediction에 업로드하려면 다음과 같이 gcloud ai models upload 명령어를 실행합니다.

gcloud ai models upload \
    --region=LOCATION \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest \
    --artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b' \
    --container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16' \
    --container-args='--platform_chip=tpuv5e' \
    --container-args='--platform_topology=4x4' \
    --container-args='--ckpt_path_suffix=checkpoint_00000000' \
    --container-deployment-timeout-seconds=2700 \
    --container-ports=8502 \
    --project=PROJECT_ID

다음을 바꿉니다.

LOCATION: Vertex AI를 사용하는 리전. TPU는 us-west1에서만 사용 가능합니다.
MODEL_DISPLAY_NAME: 모델에 사용할 표시 이름
PROJECT_ID: Google Cloud 프로젝트의 ID

온라인 예측 엔드포인트 만들기

엔드포인트를 만들려면 다음 명령어를 실행합니다.

gcloud ai endpoints create \
    --region=LOCATION \
    --display-name=ENDPOINT_DISPLAY_NAME \
    --project=PROJECT_ID

ENDPOINT_DISPLAY_NAME을 엔드포인트에 사용할 표시 이름으로 바꿉니다.

엔드포인트에 모델 배포

엔드포인트가 준비되면 엔드포인트에 모델을 배포합니다.

이 튜토리얼에서는 4x4 토폴로지를 사용하여 Cloud TPU v5e 칩 16개로 샤딩된 Llama 3 70B 모델을 배포합니다. 그러나 지원되는 다음 멀티 호스트 Cloud TPU 토폴로지 중 하나를 지정할 수 있습니다.

머신 유형	토폴로지	TPU 칩 수	호스트 수
`ct5lp-hightpu-4t`	4x4	16	2
`ct5lp-hightpu-4t`	4x8	32	4
`ct5lp-hightpu-4t`	8x8	64	8
`ct5lp-hightpu-4t`	8x16	128	16
`ct5lp-hightpu-4t`	16x16	256	32

Saxml GitHub 저장소에 정의된 다른 Llama 모델을 배포하는 경우 타겟팅하는 기기 수와 일치하도록 파티션이 나뉘어져 있고 Cloud TPU에는 모델을 로드하기에 충분한 메모리가 있는지 확인합니다.

단일 호스트 Cloud TPU에 모델을 배포하는 방법은 모델 배포를 참조하세요.

Cloud TPU v5e 유형에 대한 자세한 내용은 TPU v5e를 참조하세요.

온라인 예측 엔드포인트의 엔드포인트 ID를 가져옵니다.

ENDPOINT_ID=$(gcloud ai endpoints list \
    --region=LOCATION \
    --filter=display_name=ENDPOINT_NAME \
    --format="value(name)")

모델의 모델 ID를 가져옵니다.

MODEL_ID=$(gcloud ai models list \
    --region=LOCATION \
    --filter=display_name=DEPLOYED_MODEL_NAME \
    --format="value(name)")

모델을 엔드포인트에 배포합니다.
```
gcloud ai endpoints deploy-model $ENDPOINT_ID \
    --region=LOCATION \
    --model=$MODEL_ID \
    --display-name=DEPLOYED_MODEL_NAME \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=4x4 \
    --traffic-split=0=100
```
DEPLOYED_MODEL_NAME을 배포된 이름으로 바꿉니다. 모델 표시 이름(MODEL_DISPLAY_NAME)과 같을 수 있습니다.

배포 작업이 타임아웃될 수 있습니다.

deploy-model 명령어는 작업 완료 시간을 확인하는 데 사용할 수 있는 작업 ID를 반환합니다. 응답에 "done": true가 포함될 때까지 작업 상태를 폴링할 수 있습니다. 다음 명령어를 사용하여 상태를 폴링합니다.
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
OPERATION_ID를 이전 명령어에서 반환한 작업 ID로 바꿉니다.

배포된 모델에서 온라인 예측 가져오기

Vertex AI Prediction 엔드포인트에서 온라인 예측을 가져오려면 gcloud ai endpoints predict 명령어를 실행합니다.

다음 명령어를 실행하여 샘플 예측 요청이 포함된 request.json 파일을 만듭니다.

cat << EOF > request.json
{"instances": [{"text_batch": "the distance between Earth and Moon is "}]}
EOF

엔드포인트에 온라인 예측 요청을 전송하려면 다음 명령어를 실행합니다.

gcloud ai endpoints predict $ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION \
    --json-request=request.json

삭제

추가 Vertex AI 요금이 청구되지 않도록 이 튜토리얼에서 만든 Google Cloud 리소스를 삭제합니다.

엔드포인트에서 모델을 배포 해제하고 엔드포인트를 삭제하려면 다음 명령어를 실행합니다.

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

모델을 삭제하려면 다음 명령어를 실행합니다.

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet