Supercharge ML performance on xPUs with the new XProf profiler and Cloud Diagnostics XProf library

Rajesh Anantharaman
Product Management Lead, Google Cloud
Are you spending more time debugging ML model performance than you are building? You're not alone. In today's fast-paced AI landscape, optimizing models is a complex challenge, from navigating new model architectures to dealing with the ever-changing hardware and software stacks. Even at Google, where we optimize models for products like Gemini, Search, and YouTube, we've faced these exact challenges.
That's why we're providing a key tool to the community from our internal toolbox: XProf, our core ML profiler. In this post, we'll show you how to use this powerful tool and the new Cloud Diagnostics XProf library to easily identify model bottlenecks, find hidden opportunities for optimization, and get back to what you do best: innovating.
Updated XProf
A few years ago, we released a fork of our XProf tool as open source called "TensorFlow profiler", accessible through TensorBoard as “tensorboard_plugin_profile”, which many ML Engineers (MLEs) in the community were using to optimize ML models on xPUs. However, given the rapid changes in the market and the huge advancements made to our Google internal XProf tool, we decided it was time to share these advanced features with the community. We have recently merged the functionality of internal XProf to our external tool, moved the external code base under OpenXLA, and renamed our external profiler to simply “XProf”, reflecting its evolution beyond TensorFlow and TensorBoard. Now, the community can enjoy the benefits of using the same profiling tool that is being used within Google and Google DeepMind. Another major benefit of moving XProf under OpenXLA is that we will now support all XLA based frameworks equally, including JAX, Pytorch/XLA and TensorFlow/Keras, with the same consistent experience.
Capture profiles
In order to use XProf, you need to first enable profile capture within your model workload code. There are two ways to capture profiles:
- Programmatic capture– annotate your model code in order to specify where in your code you want to capture profiles.
- On-demand capture (also known as, manual capture)– for time periods when you did not enable programmatic profile capture, you can still capture profiles ad-hoc during your run. You can use this if you see a problem with your model metrics during the run and want to capture profiles at that point for some period in order to diagnose the problem.
Visualize profiles
In order to visualize your captured profiles, we recommend you use TensorBoard, either locally or on Google Cloud. In this section, we will focus on how we have enabled a seamless self-hosted profiling experience on Google Cloud with XProf and TensorBoard with a library called Cloud Diagnostics XProf (also known as, XProfiler library) which is available in open source on GitHub.
With this library, we make the following improvements over running XProf and TensorBoard locally:
- Easy setup and packaging of XProf and TensorBoard dependencies
- Store your profiles in Google Cloud Storage which can be useful for long term retention and post-run analysis (local profiles captured will be deleted after researcher finishes run)
- Fast loading of large profiles and multiple profiles by provisioning TensorBoard on Google Compute Engine VM or Google Kubernetes Engine pod, with the option to change machine type based on user needs for loading speed and cost.
- Create a link for easy sharing of profiles and collaboration with team members and Google engineers
- Easier on-demand profiling of workloads on GKE and Compute Engine instances to choose any host running your workload to capture profiles
Set up the Cloud Diagnostics XProf library
First, capture the profiles in a storage path similar to gs://<some_bucket>/<some_run>/
Next, set up the gCloud CLI and a Python virtual environment.
Finally, install the library with the following command:
pip
install
cloud-diagnostics-XProf
Create TensorBoard instance with XProfiler
Create a VM or GKE pod to host TensorBoard:
XProfiler
create
-z
$ZONE
-l
$GCS_PATH
or
XProfiler
create --GKE
-z
$ZONE
-l
$GCS_PATH
Visualize Profiles on TensorBoard
One of the major changes we made to the TensorBoard profile plugin recently is that we relaxed the profile path requirements so that every TensorBoard instance can load profiles from multiple runs in a directory. This means you can now easily compare profiles across runs or between sessions in a long-running job without having to launch separate TensorBoard instances, speeding up your analysis workflow. During XProfiler create, you can point the TensorBoard instance to the root storage path gs://<some_bucket>
and this will allow TensorBoard to pick up all profiles under this root directory. You can see in the TensorBoard UI below how you will see the profiles for all the runs and sessions as run1/session1, run1/session2, run2/session1 in the drop down.


Faster profile loading and tool transitions
Another major change we have made to the TensorBoard XProf plugin is improving loading time. We added support for multithreading to the plugin, which allows larger profiles to load much faster. We also added caching, which allows the second load to be even faster, and allows users to move smoothly between the tools in XProf while doing their performance optimization.
Machine options for hosting TensorBoard + XProf with Xprofiler
By default, XProfiler create will create a VM, specifically c4-highmem-8
VM. You can change the machine type with the -m flag. Also, if you want to create a TensorBoard instance on a GKE pod, you can pass —GKE flag to XProfiler create. Some customers prefer to have their TensorBoard instance hosted on a GKE pod as it makes it easier to manage this TensorBoard instance along with the rest of their workload deployed on GKE.
The VM or GKE pod that hosts TensorBoard makes loading of large profiles and multiple profiles much faster on Google Cloud than locally hosted TensorBoard. Based on our benchmarking, profiles of order of 1GB load will load within a few minutes for the first load using the default c4-highmem-8 VM. You can choose different machine types based on your performance and cost needs.
Link for sharing profiles
After you run XProfiler create, you will get something like the following message
Instance for gs://<some-bucket> has been created.
You can access it via following,
1.https://<id>-dot-us-<region>.notebooks.googleusercontent.com.
2. XProfiler connect -z <some zone> -l gs://<some-bucket> -m ssh
Instance is hosted at XProf-97db0ee6-93f6-46d4-b4c4-6d024b34a99f VM.
Note: The first option (1) is a link that has been created which you can just click and view your XProf profiles on TensorBoard. Performance optimization is a very iterative and collaborative process, so in order to enable this collaboration, the Cloud Diagnostics XProf library creates a link to he TensorBoard instance so that users can easily share their profiles with their teams and with Google engineers helping with performance optimization on Google Cloud. You control who has access to the link based on permissions set for the Cloud Storage bucket that the TensorBoard instance is pointing to.
In case the link doesn’t work for some reason, we also provide a way to SSH into the TensorBoard instance in order to view your profiles using XProfiler connect command.
On-demand profile capture
If you enabled the profiler server in your workload code and want to perform on-demand profiling, you can do this in two ways:
-
Click on the “capture profile” button on TensorBoard UI. We support on demand capture for workloads running on GKE and Compute Engine.
-
Use XProfiler capture in CLI, providing similar information as your would through the “Capture profile” button on the TensorBoard UI.
New capabilities of XProf
With the updated XProf, users will see some many updated features for the most popular tools which include:
- Trace viewer
- Memory viewer/memory profile
- Graph Viewer
- HLO Op profile/HLO Op stats
- Overview page
Most notably, on the memory viewer, you can now see 7 different types of memory including HBM (high bandwidth memory), host, and for TPUs - SparseCore, VMEM, SMEM, CMEM and Sync Flags (SFlag).
You will also see many of the links from trace viewer and HLO Op profile back to Graph viewer work seamlessly for all Ops. We have also improved source line visibility to cover more Ops.
The most common flow used for finding performance bottlenecks using XProf looks something like the following:
-
MLE opens up XProf in TensorBoard and looks at different ops in trace viewer or HLO op profile
-
They click on the op that they interested in digging into to get more details
-
For this op, they click on the link to graph viewer to see how the Op is placed in their model
-
They take a look at memory viewer to see if HBM/host/SparseCore memory is utilized efficiently for the model and for the specific Op
-
Once they have determined which Ops they want to optimize, they look at the source code line for those Ops in order to implement any optimizations.
The updated XProf tool makes this entire flow smooth and easy.
New XProf tools
In addition, we also have released a few new tools in XProf including:
- Framework op stats– performance statistics of framework-level operations (e.g., JAX or TensorFlow).
- Roofline– visually see whether your program is memory-bound or compute-bound, and how close the program's performance is to the hardware's theoretical peak performance, represented as a "roofline".
- Megascale stats– analyze multi-slice communication performance of workloads spanning multiple TPU slices that communicate across the Data Center Network (DCN).
- GPU kernel stats– performance statistics and the originating framework operation for every GPU-accelerated kernel that was launched during a profiling session.
Pallas Kernel Visibility in XProf
One of the big areas of performance visibility that has been requested in XProf is around visibility of performance for Pallas kernels. These kernels were displayed in XProf as “custom calls”, but it was hard to see details of the performance of the custom call and its implementation. We are very happy to announce increased support and visibility for Pallas kernels within XProf for you. Now, you can see more details of your Pallas kernel in both HLO Op Profile as well as Graph Viewer. For each Pallas kernel custom call, you will see the name of the kernel if it is a common Pallas kernel, and when you click it, you will see performance and other information about the kernel in the side panel. To get accurate performance metrics when clicking on the kernel in the side panel, the kernel author must provide a cost model by passing a pl.CostEstimate object to their pallas_call function. In addition, there is a “custom call text” button where the user can see more details about the Pallas kernel implementation.


CUDA Graph Tracing with XProf
XProf also supports profiling on NVIDIA GPUs for workloads using XLA. Most of the XProf tools support NVIDIA GPUs, but the view of GPU traces is different as GPU traces are organized by CUDA streams. You capture profiles for JAX on GPUs just as previously mentioned, and most XProf tools support NVIDIA GPU performance views. One new feature we have added support in XProf for is CUDA Graphs. CUDA Graphs allow users to optimize kernel launch overheads by allowing users to define larger units of work as graphs that combine individual ops. XProf now allows users that use CUDA Graphs to now trace those with XProf for further performance optimization.
To enable CUDA Graph tracing, you need to enable the relevant XLA flag <XLA_FLAGS=--xla_enable_command_buffers_during_profiling
>.
This will allow you to see the CUDA Graph in your XProf trace. You can also see the CUDA Graph in Graph Viewer as well.
To see the traces of the different nodes broken down within the CUDA Graph, you can add an advanced config value for ProfilerOptions <gpu_enable_cupti_activity_graph_trace
, True
>.
You will now see the detailed nodes traced within the graph, as well as arrows linking hosts to the specific stream where the graph node execution commenced.


What customers are saying
There are many TPU and GPU customers on various frameworks that have benefited from the updated XProf tool and Cloud Diagnostics XProf library. Here are a couple of quotes to give you a sense of what they see.
"We are a technology hub, building next-gen apps using world-class expertise and cutting-edge tech. The improved XProf tool has helped us optimize our models and be more productive on both JAX and PyTorch XLA on TPUs. XProf lets us drill down from an identified bottleneck in the trace viewer to the graph view of our model to the source code so we know what we need to modify. We can now focus on tuning our code for the best performance with TPUs. With the Cloud Diagnostics XProf library, it is very easy to capture and load profiles for long term analysis and sharing links makes collaboration with others a breeze!"
- Mustafa Ozuysal, Senior ML Researcher, HubX
"Mathpix is on the cutting edge of computer vision, changing the way that individuals and businesses interact with documents. For our image inference performance challenge, XProf helped us collect and analyze profile traces through which we discovered our cache usage was the problem. Tuning this code improved our performance with JAX on TPUs. The Cloud Diagnostics XProf library made it easy for me to set up XProf on Google Cloud and load my profiles super fast!"
- Remy Ochei, Senior Machine Learning Engineer, Mathpix
Conclusion
Whether you're looking to identify bottlenecks to overlap compute and communication, optimize memory usage on-device, or get deep insights into custom kernel performance, the updated XProf and the new Cloud Diagnostics XProf library provide a comprehensive, end-to-end, and collaborative profiling solution. We've brought Google's internal tools to the entire community to help you achieve peak performance for your models on Google Cloud.
Get started today by checking out the following resources:
- XProf tools documentation: https://openxla.org/XProf
- XProf github: https://github.com/openxla/XProf
- Cloud-diagnostics-XProf library github: https://github.com/AI-Hypercomputer/cloud-diagnostics-XProf
- Profiling on Google Cloud: https://cloud.google.com/tpu/docs/profile-tpu-vm
In addition, please check out the "Unlocking ML Performance on TPUs and GPUs" talk, which provides a walk through of how to use the aforementioned tools, a quick “tour” of XProf tools, and some real performance optimization scenarios using XProf on TPUs and GPUs.
Acknowledgements
This work constituted a massive effort across Google Cloud, CoreML as well as multiple teams within Google. Special thanks to Kan Cai, Vaibhav Tyagi, Victor Geislinger, Pavel Dournov, Navid Khajouei, Clive Verghese, Yin Zhang, Kelvin Le, Matt Hurd, Mudit Gokhale, Sai Ganesh, Vikas Aggarwal, Sannidhya Chauhan, Stephanie Morton, Subham Soni, Ani Udipi, Newfel Harrat, Aspi Siganporia, Bryan Massoth, Chetna Jain, David Duan, George Vanica, Jiten Thakkar, Lei Zhang, Jiya Zhang, Jonah Weaver, Aditya Sharma, Fenghui Zhang, Bill Jia, Alex Spiridonov, and Niranjan Hira for their immense contributions and support to these products.