vLLM inference on v6e TPUs

This tutorial shows you how to run vLLM inference on v6e TPUs. It also shows you how to run the benchmark script for the Meta Llama-3.1-8B model.

To get started with vLLM on v6e TPUs, see the vLLM quickstart.

If you are using GKE, also see the GKE tutorial.

Before you begin

You must sign the consent agreement to use Llama3 family of models in the HuggingFace repo. Go to meta-llama/Llama-3.1-8B, fill out the consent agreement, and wait until you are approved.

Prepare to provision a TPU v6e with 4 chips:

Follow Set up the Cloud TPU environment guide to set up a Google Cloud project, configure the Google Cloud CLI, enable the Cloud TPU API, and ensure you have access to use Cloud TPUs.

Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.

gcloud auth login
gcloud config set project PROJECT_ID
gcloud config set compute/zone ZONE

Secure capacity

When you are ready to secure TPU capacity, see Cloud TPU Quotas for more information about the Cloud TPU quotas. If you have additional questions about securing capacity, contact your Cloud TPU sales or account team.

Provision the Cloud TPU environment

You can provision TPU VMs with GKE, with GKE and XPK, or as queued resources.

Prerequisites

Verify that your project has enough TPUS_PER_TPU_FAMILY quota, which specifies the maximum number of chips you can access within your Google Cloud project.
Verify that your project has enough TPU quota for:
- TPU VM quota
- IP address quota
- Hyperdisk Balanced quota
User project permissions
- If you are using GKE with XPK, see Cloud Console Permissions on the user or service account for the permissions needed to run XPK.

Provision a TPU v6e

   gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
     --node-id TPU_NAME \
     --project PROJECT_ID \
     --zone ZONE \
     --accelerator-type v6e-4 \
     --runtime-version v2-alpha-tpuv6e \
     --service-account SERVICE_ACCOUNT

Command flag descriptions

Variable	Description
NODE_ID	The user-assigned ID of the TPU that is created when the queued resource request is allocated.
PROJECT_ID	The Google Cloud project name. Use an existing project or create a new one.
ZONE	See the TPU regions and zones document for the supported zones.
ACCELERATOR_TYPE	See the Accelerator Typesdocumentation for the supported accelerator types.
RUNTIME_VERSION	`v2-alpha-tpuv6e`
SERVICE_ACCOUNT	This is the email address for your service account that you can find in Google Cloud console > IAM > Service Accounts. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Use the list or describe commands to query the status of your queued resource.

gcloud alpha compute tpus queued-resources describe QUEUED_RESOURCE_ID  \
  --project PROJECT_ID --zone ZONE

For a complete list of queued resource request statuses, see the Queued resources documentation.

Connect to the TPU using SSH

  gcloud compute tpus tpu-vm ssh TPU_NAME

Install dependencies

Create a directory for Miniconda:
```
mkdir -p ~/miniconda3
```

Download the Miniconda installer script:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh

Install Miniconda:

bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3

Remove the Miniconda installer script:
```
rm -rf ~/miniconda3/miniconda.sh
```

Add Miniconda to your PATH variable:

export PATH="$HOME/miniconda3/bin:$PATH"

Reload ~/.bashrc to apply the changes to the PATH variable:
```
source ~/.bashrc
```
Create a Conda environment:

Note: If it's the first time running Conda, you need to run conda init and reload the shell before running conda activate vllm.
```
conda create -n vllm python=3.12 -y
conda activate vllm
```

Clone the vLLM repository and navigate to the vllm directory:

git clone https://github.com/vllm-project/vllm.git && cd vllm

Clean up the existing torch and torch-xla packages:
```
pip uninstall torch torch-xla -y
```
Note: If you don't have these packages installed, you will see an error message saying they are not installed. These error messages can be safely ignored.
Install other build dependencies:
```
pip install -r requirements/tpu.txt
VLLM_TARGET_DEVICE="tpu" python -m pip install --editable .
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
Note: If you see an error similar to:

- E: Unable to locate package libopenblas-base
- E: Unable to locate package libopenmpi-dev
- E: Package 'libomp-dev' has no installation candidate

Then run the sudo apt-get update before running the install.

Get access to the model

Generate a new Hugging Face token if you don't already have one:

Go to Your Profile > Settings > Access Tokens.
Select Create new token.
Specify a Name of your choice and a Role with at least Read permissions.
Select Generate a token.
Copy the generated token to your clipboard, set it as an environment variable, and authenticate with the huggingface-cli:
```
export TOKEN=YOUR_TOKEN
git config --global credential.helper store
huggingface-cli login --token $TOKEN
```

Launch the vLLM server

The following command downloads the model weights from Hugging Face Model Hub to the TPU VM's /tmp directory, pre-compiles a range of input shapes, and writes the model compilation to ~/.cache/vllm/xla_cache.

For more details, refer to the vLLM docs.

cd ~/vllm
vllm serve "meta-llama/Llama-3.1-8B" --download_dir /tmp --swap-space 16 --disable-log-requests --tensor_parallel_size=4 --max-model-len=2048 &> serve.log &

Run vLLM benchmarks

Run the vLLM benchmarking script:

export MODEL="meta-llama/Llama-3.1-8B"
pip install pandas
pip install datasets
python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model $MODEL  \
  --dataset-name random \
  --random-input-len 1820 \
  --random-output-len 128 \
  --random-prefix-len 0

Clean up

Delete the TPU:

gcloud compute tpus queued-resources delete QUEUED_RESOURCE_ID \
  --project PROJECT_ID \
  --zone ZONE \
  --force \
  --async