vLLM inference on v6e TPUs

This tutorial shows you how to run vLLM inference on v6e TPUs. It also shows you how to run the benchmark script for the Meta Llama-3.1 8B model.

To get started with vLLM on v6e TPUs, see the vLLM quickstart.

If you are using GKE, also see the GKE tutorial.

Before you begin

You must sign the consent agreement to use Llama3 family of models in the HuggingFace repo. Go to https://huggingface.co/meta-llama/Llama-3.1-8B, fill out the consent agreement, and wait until you are approved.

Prepare to provision a TPU v6e with 4 chips:

  1. Sign in to your Google Account. If you haven't already, sign up for a new account.
  2. In the Google Cloud console, select or create a Google Cloud project from the project selector page.
  3. Enable billing for your Google Cloud project. Billing is required for all Google Cloud usage.
  4. Install the gcloud alpha components.
  5. Run the following command to install the latest version of gcloudcomponents.

    gcloud components update
    
  6. Enable the TPU API through the following gcloud command using Cloud Shell. You can also enable it from the Google Cloud console.

    gcloud services enable tpu.googleapis.com
    
  7. Create a service identity for the TPU VM.

    gcloud alpha compute tpus tpu-vm service-identity create --zone=ZONE
  8. Create a TPU service account and grant access to Google Cloud services.

    Service accounts allow the Google Cloud TPU service to access other Google Cloud services. A user-managed service account is recommended. Follow these guides to create and grant roles. The following roles are necessary:

    • TPU Admin: Needed to create a TPU
    • Storage Admin: Needed for accessing Cloud Storage
    • Logs Writer: Needed for writing logs with the Logging API
    • Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring
  9. Authenticate with Google Cloud and configure the default project and zone for Google Cloud CLI.

    gcloud auth login
    gcloud config set project PROJECT_ID
    gcloud config set compute/zone ZONE

Secure capacity

Contact your Cloud TPU sales or account team to request TPU quota and to ask any questions about capacity.

Provision the Cloud TPU environment

You can provision v6e TPUs with GKE, with GKE and XPK, or as queued resources.

Prerequisites

  • Verify that your project has enough TPUS_PER_TPU_FAMILY quota, which specifies the maximum number of chips you can access within your Google Cloud project.
  • This tutorial was tested with the following configuration:
    • Python 3.10 or later
    • Nightly software versions:
      • nightly JAX 0.4.32.dev20240912
      • nightly LibTPU 0.1.dev20240912+nightly
    • Stable software versions:
      • JAX + JAX Lib of v0.4.35
  • Verify that your project has enough TPU quota for:
    • TPU VM quota
    • IP Address quota
    • Hyperdisk balanced quota
  • User project permissions

Provision a TPU v6e

   gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
      --node-id TPU_NAME \
       --project PROJECT_ID \
       --zone ZONE \
       --accelerator-type v6e-4 \
       --runtime-version v2-alpha-tpuv6e \
       --service-account SERVICE_ACCOUNT

Command flag descriptions

Variable Description
NODE_ID The user-assigned ID of the TPU that is created when the queued resource request is allocated.
PROJECT_ID Google Cloud project name. Use an existing project or create a new one.>
ZONE See the TPU regions and zones document for the supported zones.
ACCELERATOR_TYPE See the Accelerator Typesdocumentation for the supported accelerator types.
RUNTIME_VERSION v2-alpha-tpuv6e
SERVICE_ACCOUNT This is the email address for your service account that you can find in Google Cloud console -> IAM -> Service Accounts

For example: tpu-service-account@<your_project_ID>.iam.gserviceaccount.com.com

Use the list or describe commands to query the status of your queued resource.

   gcloud alpha compute tpus queued-resources describe ${QUEUED_RESOURCE_ID}  \
      --project ${PROJECT_ID} --zone ${ZONE}

For a complete list of queued resource request statuses, see the Queued Resources documentation.

Connect to the TPU using SSH

  gcloud compute tpus tpu-vm ssh TPU_NAME

Install dependencies

  1. Create a directory for Miniconda:

    mkdir -p ~/miniconda3
  2. Download the Miniconda installer script:

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
  3. Install Miniconda:

    bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
  4. Remove the Miniconda installer script:

    rm -rf ~/miniconda3/miniconda.sh
  5. Add Miniconda to your PATH variable:

    export PATH="$HOME/miniconda3/bin:$PATH"
  6. Reload ~/.bashrc to apply the changes to the PATH variable:

    source ~/.bashrc
  7. Create a Conda environment:

    conda create -n vllm python=3.10 -y
    conda activate vllm
  8. Clone the vLLM repository and navigate to the vLLM directory:

    git clone https://github.com/vllm-project/vllm.git && cd vllm
    
  9. Clean up the existing torch and torch-xla packages:

    pip uninstall torch torch-xla -y
    
  10. Install other build dependencies:

    pip install -r requirements-tpu.txt
    VLLM_TARGET_DEVICE="tpu" python setup.py develop
    sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
    

Get access to the model

Generate a new Hugging Face token if you don't already have one:

  1. Click Your Profile > Settings > Access Tokens.
  2. Select New Token.
  3. Specify a Name of your choice and a Role with at least Read permissions.
  4. Select Generate a token.
  5. Copy the generated token to your clipboard, set it as an environment variable and authenticate with the huggingface-cli:

    export TOKEN=YOUR_TOKEN
    git config --global credential.helper store
    huggingface-cli login --token $TOKEN

Download benchmarking data

  1. Create a /data directory and download the ShareGPT dataset from Hugging Face.

    mkdir ~/data && cd ~/data
    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
    

Launch the vLLM server

The following command downloads the model weights from Hugging Face Model Hub to the TPU VM's /tmp directory, pre-compiles a range of input shapes, and writes the model compilation to ~/.cache/vllm/xla_cache.

For more details, refer to the vLLM docs.

   cd ~/vllm
   vllm serve "meta-llama/Meta-Llama-3.1-8B" --download_dir /tmp --num-scheduler-steps 4 --swap-space 16 --disable-log-requests --tensor_parallel_size=4 --max-model-len=2048 &> serve.log &

Run vLLM benchmarks

Run the vLLM benchmarking script:

   python benchmarks/benchmark_serving.py \
       --backend vllm \
       --model "meta-llama/Meta-Llama-3.1-8B"  \
       --dataset-name sharegpt \
       --dataset-path ~/data/ShareGPT_V3_unfiltered_cleaned_split.json  \
       --num-prompts 1000

Clean up

Delete the TPU:

gcloud compute tpus queued-resources delete QUEUED_RESOURCE_ID \
    --project PROJECT_ID \
    --zone ZONE \
    --force \
    --async