Llama 3 is an open-source large language model (LLM) from Meta. This guide shows you how to serve a Llama 3 LLM using multi-host Tensor Processing Units (TPUs) on Vertex AI Prediction with Saxml.
In this guide, you download the Llama 3 70B model weights and tokenizer and deploy them on Vertex AI Prediction that runs Saxml on TPUs.
Before you begin
We recommend that you use an M2 memory-optimized VM for downloading the model and converting it to Saxml. This is because the model conversion process requires significant memory and may fail if you choose a machine type that doesn't have enough memory.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI and Artifact Registry APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI and Artifact Registry APIs.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
- Follow the Artifact Registry documentation to Install Docker.
- Ensure that you have sufficient quotas for 16 TPU v5e chips for Vertex AI Prediction.
This tutorial assumes that you are using Cloud Shell to interact with Google Cloud. If you want to use a different shell instead of Cloud Shell, then perform the following additional configuration:
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
If you're using a different shell instead of Cloud Shell for model
deployment, make sure that the Google Cloud CLI version is
later than 475.0.0
. You can update the Google Cloud CLI by running the
gcloud components update
command.
If you're deploying your model using the Vertex AI SDK, make sure that
you have version 1.50.0
or later.
Get access to the model and download the model weights
The following steps are for a Vertex AI Workbench instance that has an M2 memory-optimized VM. For information on changing the machine type of a Vertex AI Workbench instance, see Change machine type of a Vertex AI Workbench instance.
Go to the Llama model consent page.
Select Llama 3, fill out the consent form, and accept the terms and conditions.
Check your inbox for an email containing a signed URL.
Download the
download.sh
script from GitHub by executing the following command:wget https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh chmod +x download.sh
To download the model weights, run the
download.sh
script that you downloaded from GitHub.When prompted, enter the signed URL from the email you received in the previous section.
When prompted for the models to download, enter
70B
.
Convert the model weights to Saxml format
Run the following command to download Saxml:
git clone https://github.com/google/saxml.git
Run the following commands to configure a Python virtual environment:
python -m venv . source bin/activate
Run the following commands to install dependencies:
pip install --upgrade pip pip install paxml pip install praxis pip install torch
To convert the model weights to Saxml format, run the following command:
python3 saxml/saxml/tools/convert_llama_ckpt.py \ --base PATH_TO_META_LLAMA3 \ --pax PATH_TO_PAX_LLAMA3 \ --model-size llama3_70b
Replace the following:
PATH_TO_META_LLAMA3
: the path to the directory containing the downloaded model weightsPATH_TO_PAX_LLAMA3
: the path to the directory in which to store the converted model weights
Converted models will be put into the
$PATH_TO_PAX_LLAMA3/checkpoint_00000000
folder.Copy the tokenizer file from original directory into a subfolder named
vocabs
as follows:cp $PATH_TO_META_LLAMA3/tokenizer.model $PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
Add an empty
commit_success.txt
file in the$PATH_TO_PAX_LLAMA3
folder and themetadata
andstate
subfolders in that folder as follows:touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt
The
$PATH_TO_PAX_LLAMA3
folder now contains the following folders and files:$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/ $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/ $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt $PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
Create a Cloud Storage bucket
Create a Cloud Storage bucket to store the converted model weights.
In Cloud Shell, run the following commands, replacing PROJECT_ID with your project ID:
projectid=PROJECT_ID gcloud config set project ${projectid}
To create the bucket, run the following command:
gcloud storage buckets create gs://WEIGHTS_BUCKET_NAME
Replace WEIGHTS_BUCKET_NAME with the name you want to use for the bucket.
Copy the model weights to the Cloud Storage bucket
To copy the model weights to your bucket, run the following command:
gcloud storage cp PATH_TO_PAX_LLAMA3/* gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/ --recursive
Upload the model
A prebuilt Saxml container is available at
us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest
.
To upload a Model
resource to Vertex AI Prediction using the prebuilt
Saxml container, run the
gcloud ai models upload
command
as follows:
gcloud ai models upload \
--region=LOCATION \
--display-name=MODEL_DISPLAY_NAME \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest \
--artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b' \
--container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16' \
--container-args='--platform_chip=tpuv5e' \
--container-args='--platform_topology=4x4' \
--container-args='--ckpt_path_suffix=checkpoint_00000000' \
--container-deployment-timeout-seconds=2700 \
--container-ports=8502 \
--project=PROJECT_ID
Make the following replacements:
LOCATION
: the region where you are using Vertex AI. Note that TPUs are only available inus-west1
.MODEL_DISPLAY_NAME
: the display name you want for your modelPROJECT_ID
: the ID of your Google Cloud project
Create an online prediction endpoint
To create the endpoint, run the following command:
gcloud ai endpoints create \
--region=LOCATION \
--display-name=ENDPOINT_DISPLAY_NAME \
--project=PROJECT_ID
Replace ENDPOINT_DISPLAY_NAME
with the display name you want for
your endpoint.
Deploy the model to the endpoint
After the endpoint is ready, deploy the model to the endpoint.
In this tutorial, you deploy a Llama 3 70B model that's sharded for 16 Cloud TPU v5e chips using 4x4 topology. However, you can specify any of the following supported multi-host Cloud TPU topologies:
Machine Type | Topology | Number of TPU chips | Number of Hosts |
---|---|---|---|
ct5lp-hightpu-4t |
4x4 | 16 | 2 |
ct5lp-hightpu-4t |
4x8 | 32 | 4 |
ct5lp-hightpu-4t |
8x8 | 64 | 8 |
ct5lp-hightpu-4t |
8x16 | 128 | 16 |
ct5lp-hightpu-4t |
16x16 | 256 | 32 |
If you're deploying a different Llama model that's defined in the Saxml GitHub repo, make sure that it's partitioned to match the number of devices you're targeting and that Cloud TPU has sufficient memory to load the model.
For information about deploying a model on single-host Cloud TPUs, see Deploy a model.
For more information about Cloud TPU v5e types, see TPU v5e.
Get the endpoint ID for the online prediction endpoint:
ENDPOINT_ID=$(gcloud ai endpoints list \ --region=LOCATION \ --filter=display_name=ENDPOINT_NAME \ --format="value(name)")
Get the model ID for your model:
MODEL_ID=$(gcloud ai models list \ --region=LOCATION \ --filter=display_name=DEPLOYED_MODEL_NAME \ --format="value(name)")
Deploy the model to the endpoint:
gcloud ai endpoints deploy-model $ENDPOINT_ID \ --region=LOCATION \ --model=$MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --machine-type=ct5lp-hightpu-4t \ --tpu-topology=4x4 \ --traffic-split=0=100
Replace DEPLOYED_MODEL_NAME with a name for the deployed. This can be the same as the model display name (MODEL_DISPLAY_NAME).
The deployment operation might time out.
The
deploy-model
command returns an operation ID that can be used to check when the operation is finished. You can poll for the status of the operation until the response includes"done": true
. Use the following command to poll the status:gcloud ai operations describe \ --region=LOCATION \ OPERATION_ID
Replace OPERATION_ID with the operation ID that was returned by the previous command.
Get online predictions from the deployed model
To get online predictions from the Vertex AI Prediction endpoint,
run the gcloud ai endpoints predict
command.
Run the following command to create a
request.json
file containing a sample prediction request:cat << EOF > request.json {"instances": [{"text_batch": "the distance between Earth and Moon is "}]} EOF
To send the online prediction request to the endpoint, run the following command:
gcloud ai endpoints predict $ENDPOINT_ID \ --project=PROJECT_ID \ --region=LOCATION \ --json-request=request.json
Clean up
To avoid incurring further Vertex AI charges, delete the Google Cloud resources that you created during this tutorial:
To undeploy the model from the endpoint and delete the endpoint, run the following commands:
ENDPOINT_ID=$(gcloud ai endpoints list \ --region=LOCATION \ --filter=display_name=ENDPOINT_NAME \ --format="value(name)") DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \ --region=LOCATION \ --format="value(deployedModels.id)") gcloud ai endpoints undeploy-model $ENDPOINT_ID \ --region=LOCATION \ --deployed-model-id=$DEPLOYED_MODEL_ID gcloud ai endpoints delete $ENDPOINT_ID \ --region=LOCATION \ --quiet
To delete your model, run the following commands:
MODEL_ID=$(gcloud ai models list \ --region=LOCATION \ --filter=display_name=DEPLOYED_MODEL_NAME \ --format="value(name)") gcloud ai models delete $MODEL_ID \ --region=LOCATION \ --quiet