vLLM TPU is a highly-efficient serving framework for large language models (LLM) that's optimized for Cloud TPU hardware. It's powered by tpu-inference, which is an expressive and powerful new hardware plugin that unifies JAX and Pytorch under a single lowering path.
Read more about this framework in the vLLM TPU blog post.
vLLM TPU is available in Model Garden through one-click deployment and notebook.
Get started in Model Garden
The vLLM TPU serving container is integrated in Model Garden. You can access this serving solution through one-click deployment and Colab Enterprise notebook examples for a variety of models.
Use one-click deployment
You can deploy a custom Vertex AI endpoint with vLLM TPU through the model card for the following models:
- google/gemma-3-27b-it
- meta-llama/Llama-3.3-70B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- Qwen/Qwen3-32B
- Qwen/Qwen3-8B
- Qwen/Qwen3-4B
- Qwen/Qwen3-4B-Instruct-2507
Steps:
Navigate to the model card page (such as google/gemma-3-27b-it) and click Deploy model to open the deployment panel.
Select the model variant you want to deploy under Resource ID.
For the model variant that you want to deploy, click Edit settings and select the vLLM TPU option under Machine spec for deployment.
Click Deploy at the bottom of the panel to begin the deployment process. You will receive an email notification when the endpoint is ready.
Use the Colab Enterprise notebook
For flexibility and customization, you can use Colab Enterprise notebook examples to deploy a Vertex AI endpoint with vLLM TPU by using the Vertex AI SDK for Python.
Open the vLLM TPU notebook in Colab Enterprise.
Run through the notebook to deploy a model with vLLM TPU and send prediction requests to the endpoint.
Request Cloud TPU quota
In Model Garden, the default quota is 16 Cloud TPU v6e chips in the europe-west4 region. This quotas applies to one-click deployments and Colab Enterprise notebook deployments. If you have a default quota of 0 or would like to request more quota, see Request a quota adjustment.