AI & Machine Learning

PyTorch/XLA 2.5: vLLM support and an improved developer experience

October 31, 2024

Manfei Bai

Software Engineer

Duncan Campbell

Developer Advocate, Google Cloud

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

Machine learning engineers are bullish on PyTorch/XLA, a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs. And now, PyTorch/XLA 2.5 is here, along with a set of improvements to add support for vLLM and enhance the overall developer experience. Featured in this release are:

A clarified proposal for deprecation of the older torch_xla API in favor of moving towards the existing PyTorch API, providing for a simplified developer experience. An example of this is the migration of existing Distributed API.
A series of improvements to the torch_xla.compile function which improve the debugging experience for developers during the development process.
Experimental support in vLLM for TPUs, allowing you to extend your existing deployments and while leveraging the same vLLM interface across your TPUs.

Let’s take a look at each of these enhancements.

Streamlining the torch_xla API

With PyTorch/XLA 2.5, we're taking a significant step towards making the API more consistent with upstream PyTorch. Our north star is to minimize the learning curve for developers already familiar with PyTorch, making it easier to use XLA devices. This means gradually phasing out and deprecating custom API calls for PyTorch/XLA for more mature functionality when possible, and then, migrating the API calls over to their PyTorch counterparts. Other features still remain within the existing Python module before migration.

In the spirit of a simpler developer experience for PyTorch/XLA, in this release we have migrated over to leveraging some existing PyTorch distributed API functions when running models on top of PyTorch/XLA. Historically, the calls for the distributed API were located under the torch_xla module; in this update we migrated most of them to torch.distributed.

Improvement to ‘torch_xla.compile’

We’ve also added a few new compilation features to help you debug or notice potential issues within your model code. For example, a ‘full_graph’ mode emits an error message when there’s more than one compilation graph. This helps you discover potential issues caused by multiple compilation graphs early on (during compilation).

Additionally, you can now specify an expected number of recompilations for compiled functions. This can help you debug performance issues in which a function might be getting recompiled more times than expected, for example, when it has unexpected dynamism.

You can now also give compiled functions an understandable name instead of an automatically created one. By naming compiled targets, you gain more context when debugging messages, making it easier to figure out where the problem may be. Here’s an example of what that looks like in reality:

Looking at the above output you can see the original versus the named output generated from the same file; ‘SyncTensorsGraph’ is the automatically generated name. Below, you can see the renamed file related to the small code example above.

vLLM on TPU (experimental)

If you use vLLM to serve models on GPUs, you can now switch to TPU as a backend. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM on TPU retains the same vLLM interface that developers love, including direct integration into Hugging Face Model Hub to simplify model experimentation on TPU.

Switching your vLLM endpoint to TPU is a matter of a few config changes. Aside from the TPU image, everything else remains the same: request payload, metrics used for autoscaling, load balancing, model source code, etc. For details, see the installation guide.

Other vLLM features we've extended to TPU include Pallas kernels such as paged attention, flash attention and performance optimizations in dynamo bridge, all which are now part of the PyTorch/XLA repository (code). While vLLM is available to PyTorch TPU users, this work is still ongoing, and we look forward to rolling out additional features and optimizations in future releases.

Start using PyTorch/XLA 2.5

You can start taking advantage of these latest features by downloading the latest release through your Python package manager. Or, if this is your first time hearing about PyTorch/XLA, check out the project’s Github page for installation instructions and more detailed information.

For a full list of changes, check out the release notes!

Posted in

Financial Services

Build financial resilience with AI-powered tabletop exercises on Google Cloud

By Florian Graf • 5-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/GEAR_Website_Graphics_1920x1080-2.max-700x700.png

AI & Machine Learning

Gemini Enterprise Agent Ready (GEAR) program now available, a new path to building AI agents at scale

By Peder Ulander • 2-minute read

Containers & Kubernetes

How we cut Vertex AI latency by 35% with GKE Inference Gateway

By Fisayo Feyisetan • 4-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/021326b_HF1428_GC_Social_ClaudeSonnet_4.6_He.max-700x700.jpg

AI & Machine Learning

Announcing Claude Opus 4.6 and Claude Sonnet 4.6 on Vertex AI

By Michael Gerstenhaber • 6-minute read

PyTorch/XLA 2.5: vLLM support and an improved developer experience

Manfei Bai

Duncan Campbell

Try Gemini 3

Streamlining the torch_xla API

Improvement to ‘torch_xla.compile’

vLLM on TPU (experimental)

Start using PyTorch/XLA 2.5

Related articles

Build financial resilience with AI-powered tabletop exercises on Google Cloud

Gemini Enterprise Agent Ready (GEAR) program now available, a new path to building AI agents at scale

How we cut Vertex AI latency by 35% with GKE Inference Gateway

Announcing Claude Opus 4.6 and Claude Sonnet 4.6 on Vertex AI