[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# vLLM inference on v6e TPUs\n==========================\n\nThis tutorial shows you how to run [vLLM](https://docs.vllm.ai/en/stable/index.html)\ninference on v6e TPUs. It also shows you how to run the benchmark script for the\nMeta Llama-3.1-8B model.\n\nTo get started with vLLM on v6e TPUs, see the\n[vLLM quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html).\n\nIf you are using GKE, also see the [GKE tutorial](/kubernetes-engine/docs/tutorials/serve-vllm-tpu).\n| **Note:** After you complete the inference benchmark, be sure to [clean up](#clean-up) the TPU resources.\n\nBefore you begin\n----------------\n\nYou must sign the consent agreement to use Llama3 family of models in the\n[HuggingFace repo](https://huggingface.co/meta-llama). Go to [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B),\nfill out the consent agreement, and wait until you are approved.\n\nPrepare to provision a TPU v6e with 4 chips:\n\n1. Follow [Set up the Cloud TPU environment](/tpu/docs/setup-gcp-account)\n guide to set up a Google Cloud project, configure the Google Cloud CLI,\n enable the Cloud TPU API, and ensure you have access to use\n Cloud TPUs.\n\n2. Authenticate with Google Cloud and configure the default project and\n zone for Google Cloud CLI.\n\n ```bash\n gcloud auth login\n gcloud config set project PROJECT_ID\n gcloud config set compute/zone ZONE\n ```\n\n### Secure capacity\n\nWhen you are ready to secure TPU capacity, see [Cloud TPU\nQuotas](/tpu/docs/quota) for more information about the Cloud TPU quotas. If\nyou have additional questions about securing capacity, contact your Cloud TPU\nsales or account team.\n\n### Provision the Cloud TPU environment\n\nYou can provision TPU VMs with\n[GKE](/tpu/docs/tpus-in-gke), with GKE and\n[XPK](https://github.com/google/xpk/tree/main),\nor as [queued resources](/tpu/docs/queued-resources).\n| **Note:** This document describes how to provision TPUs using queued resources. If you are provisioning your TPUs using [XPK](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md) (a wrapper CLI tool over GKE), set up XPK permissions on your user account for GKE.\n\n### Prerequisites\n\n| **Note:** This tutorial has been tested with Python 3.10 or later.\n\n- Verify that your project has enough `TPUS_PER_TPU_FAMILY` quota, which specifies the maximum number of chips you can access within your Google Cloud project.\n- Verify that your project has enough TPU quota for:\n - TPU VM quota\n - IP address quota\n - Hyperdisk Balanced quota\n- User project permissions\n - If you are using GKE with XPK, see [Cloud Console Permissions on\n the user or service account](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#cloud-console-permissions-on-the-user-or-service-account-needed-to-run-xpk) for the permissions needed to run XPK.\n\nProvision a TPU v6e\n-------------------\n\n```bash\n gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \\\n --node-id TPU_NAME \\\n --project PROJECT_ID \\\n --zone ZONE \\\n --accelerator-type v6e-4 \\\n --runtime-version v2-alpha-tpuv6e \\\n --service-account SERVICE_ACCOUNT\n``` \n\n#### Command flag descriptions\n\nUse the `list` or `describe` commands to query the status of your queued resource. \n\n```bash\ngcloud alpha compute tpus queued-resources describe QUEUED_RESOURCE_ID \\\n --project PROJECT_ID --zone ZONE\n```\n\nFor a complete list of queued resource request statuses, see the\n[Queued resources](/tpu/docs/queued-resources) documentation.\n\nConnect to the TPU using SSH\n----------------------------\n\n```bash\n gcloud compute tpus tpu-vm ssh TPU_NAME\n```\n\nInstall dependencies\n--------------------\n\n1. Create a directory for Miniconda:\n\n ```bash\n mkdir -p ~/miniconda3\n ```\n2. Download the Miniconda installer script:\n\n ```bash\n wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh\n ```\n3. Install Miniconda:\n\n ```bash\n bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3\n ```\n4. Remove the Miniconda installer script:\n\n ```bash\n rm -rf ~/miniconda3/miniconda.sh\n ```\n5. Add Miniconda to your `PATH` variable:\n\n ```bash\n export PATH=\"$HOME/miniconda3/bin:$PATH\"\n ```\n6. Reload `~/.bashrc` to apply the changes to the `PATH` variable:\n\n ```bash\n source ~/.bashrc\n ```\n7. Create a Conda environment:\n\n **Note:** If it's the first time running Conda, you need to run `conda init` and reload the shell before running `conda activate vllm`. \n\n ```bash\n conda create -n vllm python=3.12 -y\n conda activate vllm\n ```\n8. Clone the vLLM repository and navigate to the `vllm` directory:\n\n git clone https://github.com/vllm-project/vllm.git && cd vllm\n\n9. Clean up the existing torch and torch-xla packages:\n\n pip uninstall torch torch-xla -y\n\n | **Note:** If you don't have these packages installed, you will see an error message saying they are not installed. These error messages can be safely ignored.\n10. Install other build dependencies:\n\n pip install -r requirements/tpu.txt\n VLLM_TARGET_DEVICE=\"tpu\" python -m pip install --editable .\n sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev\n\n | **Note:** If you see an error similar to: \n |\n | - `E: Unable to locate package libopenblas-base` \n | - `E: Unable to locate package libopenmpi-dev` \n | - `E: Package 'libomp-dev' has no installation candidate` \n |\n | Then run the `sudo apt-get update` before running the install.\n\n### Get access to the model\n\nGenerate a new [Hugging Face token](https://huggingface.co/docs/hub/security-tokens)\nif you don't already have one:\n\n1. Go to **Your Profile \\\u003e Settings \\\u003e Access Tokens**.\n\n2. Select **Create new token**.\n\n3. Specify a Name of your choice and a Role with at least `Read` permissions.\n\n4. Select **Generate a token**.\n\n5. Copy the generated token to your clipboard, set it as an environment variable, and authenticate with the huggingface-cli:\n\n ```bash\n export TOKEN=YOUR_TOKEN\n git config --global credential.helper store\n huggingface-cli login --token $TOKEN\n ```\n\n#### Launch the vLLM server\n\nThe following command downloads the model weights from\n[Hugging Face Model Hub](https://huggingface.co/docs/hub/en/models-the-hub)\nto the TPU VM's `/tmp` directory, pre-compiles a range of input shapes, and\nwrites the model compilation to `~/.cache/vllm/xla_cache`.\n\nFor more details, refer to the [vLLM docs](https://docs.vllm.ai/en/stable/getting_started/tpu-installation.html#build-from-source). \n\n cd ~/vllm\n vllm serve \"meta-llama/Llama-3.1-8B\" --download_dir /tmp --swap-space 16 --disable-log-requests --tensor_parallel_size=4 --max-model-len=2048 &\u003e serve.log &\n\n| **Note:** Before running the `benchmark_serving`, wait until the logs indicate that the server has started on port 8000. It takes a minute or so to start up.\n\n### Run vLLM benchmarks\n\nRun the vLLM benchmarking script: \n\n```bash\nexport MODEL=\"meta-llama/Llama-3.1-8B\"\npip install pandas\npip install datasets\npython benchmarks/benchmark_serving.py \\\n --backend vllm \\\n --model $MODEL \\\n --dataset-name random \\\n --random-input-len 1820 \\\n --random-output-len 128 \\\n --random-prefix-len 0\n```\n\n#### Clean up\n\nDelete the TPU: \n\n```bash\ngcloud compute tpus queued-resources delete QUEUED_RESOURCE_ID \\\n --project PROJECT_ID \\\n --zone ZONE \\\n --force \\\n --async\n```"]]