vLLM inference on v6e TPUs
This tutorial shows you how to run vLLM inference on v6e TPUs. It also shows you how to run the benchmark script for the Meta Llama-3.1 8B model.
To get started with vLLM on v6e TPUs, see the vLLM quickstart.
If you are using GKE, also see the GKE tutorial.
Before you begin
You must sign the consent agreement to use Llama3 family of models in the HuggingFace repo. Go to https://huggingface.co/meta-llama/Llama-3.1-8B, fill out the consent agreement, and wait until you are approved.
Prepare to provision a TPU v6e with 4 chips:
- Sign in to your Google Account. If you haven't already, sign up for a new account.
- In the Google Cloud console, select or create a Google Cloud project from the project selector page.
- Enable billing for your Google Cloud project. Billing is required for all Google Cloud usage.
- Install the gcloud alpha components.
Run the following command to install the latest version of
gcloud
components.gcloud components update
Enable the TPU API through the following
gcloud
command using Cloud Shell. You can also enable it from the Google Cloud console.gcloud services enable tpu.googleapis.com
Create a service identity for the TPU VM.
gcloud alpha compute tpus tpu-vm service-identity create --zone=ZONE
Create a TPU service account and grant access to Google Cloud services.
Service accounts allow the Google Cloud TPU service to access other Google Cloud services. A user-managed service account is recommended. Follow these guides to create and grant roles. The following roles are necessary:
- TPU Admin: Needed to create a TPU
- Storage Admin: Needed for accessing Cloud Storage
- Logs Writer: Needed for writing logs with the Logging API
- Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring
Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.
gcloud auth login gcloud config set project PROJECT_ID gcloud config set compute/zone ZONE
Secure capacity
Contact your Cloud TPU sales or account team to request TPU quota and to ask any questions about capacity.
Provision the Cloud TPU environment
You can provision v6e TPUs with GKE, with GKE and XPK, or as queued resources.
Prerequisites
- Verify that your project has enough
TPUS_PER_TPU_FAMILY
quota, which specifies the maximum number of chips you can access within your Google Cloud project. - This tutorial was tested with the following configuration:
- Python
3.10 or later
- Nightly software versions:
- nightly JAX
0.4.32.dev20240912
- nightly LibTPU
0.1.dev20240912+nightly
- nightly JAX
- Stable software versions:
- JAX + JAX Lib of
v0.4.35
- JAX + JAX Lib of
- Python
- Verify that your project has enough TPU quota for:
- TPU VM quota
- IP Address quota
- Hyperdisk balanced quota
- User project permissions
- If you are using GKE with XPK, see Cloud Console Permissions on the user or service account for the permissions needed to run XPK.
Provision a TPU v6e
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ --node-id TPU_NAME \ --project PROJECT_ID \ --zone ZONE \ --accelerator-type v6e-4 \ --runtime-version v2-alpha-tpuv6e \ --service-account SERVICE_ACCOUNT
Variable | Description |
NODE_ID | The user-assigned ID of the TPU that is created when the queued resource request is allocated. |
PROJECT_ID | Google Cloud project name. Use an existing project or create a new one.> |
ZONE | See the TPU regions and zones document for the supported zones. |
ACCELERATOR_TYPE | See the Accelerator Typesdocumentation for the supported accelerator types. |
RUNTIME_VERSION | v2-alpha-tpuv6e
|
SERVICE_ACCOUNT | This is the email address for your service account that you can find in
Google Cloud console -> IAM -> Service Accounts
For example: tpu-service-account@<your_project_ID>.iam.gserviceaccount.com.com |
Use the list
or describe
commands
to query the status of your queued resource.
gcloud alpha compute tpus queued-resources describe ${QUEUED_RESOURCE_ID} \
--project ${PROJECT_ID} --zone ${ZONE}
For a complete list of queued resource request statuses, see the Queued Resources documentation.
Connect to the TPU using SSH
gcloud compute tpus tpu-vm ssh TPU_NAME
Install dependencies
Create a directory for Miniconda:
mkdir -p ~/miniconda3
Download the Miniconda installer script:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
Install Miniconda:
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
Remove the Miniconda installer script:
rm -rf ~/miniconda3/miniconda.sh
Add Miniconda to your
PATH
variable:export PATH="$HOME/miniconda3/bin:$PATH"
Reload
~/.bashrc
to apply the changes to thePATH
variable:source ~/.bashrc
Create a Conda environment:
conda create -n vllm python=3.10 -y conda activate vllm
Clone the vLLM repository and navigate to the vLLM directory:
git clone https://github.com/vllm-project/vllm.git && cd vllm
Clean up the existing torch and torch-xla packages:
pip uninstall torch torch-xla -y
Install other build dependencies:
pip install -r requirements-tpu.txt VLLM_TARGET_DEVICE="tpu" python setup.py develop sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
Get access to the model
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role with at least
Read
permissions. - Select Generate a token.
Copy the generated token to your clipboard, set it as an environment variable and authenticate with the huggingface-cli:
export TOKEN=YOUR_TOKEN git config --global credential.helper store huggingface-cli login --token $TOKEN
Download benchmarking data
Create a
/data
directory and download the ShareGPT dataset from Hugging Face.mkdir ~/data && cd ~/data wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Launch the vLLM server
The following command downloads the model weights from
Hugging Face Model Hub
to the TPU VM's /tmp
directory, pre-compiles a range of input shapes, and
writes the model compilation to ~/.cache/vllm/xla_cache
.
For more details, refer to the vLLM docs.
cd ~/vllm
vllm serve "meta-llama/Meta-Llama-3.1-8B" --download_dir /tmp --num-scheduler-steps 4 --swap-space 16 --disable-log-requests --tensor_parallel_size=4 --max-model-len=2048 &> serve.log &
Run vLLM benchmarks
Run the vLLM benchmarking script:
python benchmarks/benchmark_serving.py \
--backend vllm \
--model "meta-llama/Meta-Llama-3.1-8B" \
--dataset-name sharegpt \
--dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000
Clean up
Delete the TPU:
gcloud compute tpus queued-resources delete QUEUED_RESOURCE_ID \ --project PROJECT_ID \ --zone ZONE \ --force \ --async