Developers & Practitioners

PyTorch on Google Cloud: Blog series recap

PyTorch is an open source machine learning framework, primarily developed by Meta (previously Facebook). PyTorch is  extensively  used in the research space  and in recent years it has gained immense traction in the industry due to its ease of use and deployment. Vertex AI, a fully managed end-to-end data science and machine learning platform on Google Cloud, has first class support for PyTorch making it optimized, compatibility tested and ready to deploy. 

We started a new blog series - PyTorch on Google Cloud - to uncover, demonstrate and share how to build, train and deploy PyTorch models at scale on Cloud AI Infrastructure  using GPUs and TPUs on Vertex AI, and how to create reproducible machine learning pipelines on Google Cloud . This blog post is the home page to the series with links to the existing and upcoming posts for the readers to refer to. Here are links to the blog posts in this series:

PyTorch on Vertex AI

  1. How To train and tune PyTorch models on Vertex AI: In this post, learn how to use Vertex AI Training to build and train a sentiment text classification model using PyTorch and Vertex AI Hyperparameter Tuning to tune hyperparameters of PyTorch models

  2. How to deploy PyTorch models on Vertex AI: This post walks through the deployment of a Pytorch model using TorchServe as a custom container by deploying the model artifacts to a Vertex Prediction service.

  3. Orchestrating PyTorch ML Workflows on Vertex AI Pipelines: In this post, we show how to build and orchestrate ML pipelines for training and deploying PyTorch models on Google Cloud Vertex AI using Vertex AI Pipelines.

  4. Scalable ML Workflows using PyTorch on Kubeflow Pipelines and Vertex Pipelines:  This post shows examples of PyTorch-based ML workflows on two pipelines frameworks: OSS Kubeflow Pipelines, part of the Kubeflow project; and Vertex AI Pipelines. We share new PyTorch built-in components added to the Kubeflow Pipelines. 

PyTorch/XLA and Cloud TPU

  1. Scaling deep learning workloads with PyTorch / XLA and Cloud TPU VM: This post describes the challenges associated with scaling deep learning jobs to distributed training settings, using the Cloud TPU VM and shows how to stream training data from Google Cloud Storage (GCS) to PyTorch / XLA models running on Cloud TPU Pod slices

  2. PyTorch/XLA: Performance debugging on Cloud TPU VM: Part I: In this first part of the performance debugging series on Cloud TPU, we lay out the conceptual framework for PyTorch/XLA in the context of training performance. We introduced a case study to make sense of preliminary profiler logs and identify the corrective actions.

  3. PyTorch/XLA: Performance debugging on Cloud TPU VM: Part II: In the second part,  we deep dive into further analysis of the performance debugging to discover more performance improvement opportunities.

  4. PyTorch/XLA: Performance debugging on Cloud TPU VM: Part III: In the final part of the performance debugging series, we introduce user defined code annotation and visualize these annotations in the form of a trace.

  5. Train ML models with Pytorch Lightning on TPUs: This post shows how easy it is to start training models with PyTorch Lightning on TPUs with its built-in TPU support.

A few more articles related to PyTorch on Google Cloud

  1. How To train PyTorch models on AI Platform: In this post, learn how to setup PyTorch development environment on Vertex AI Workbench (previously Notebooks) and use AI Platform Training to build and train a sentiment text classification model using PyTorch.

  2. Increase your productivity using PyTorch Lightning: Learn how to use PyTorch Lightning on Vertex AI Workbench (previously Notebooks)

We have more articles coming soon covering PyTorch and Google Cloud AI

Stay tuned. Thank you for reading! Have a question or want to chat? Find us here: Vaibhav Singh, Rajesh Thallam, Jordan Totten and Karl Weinmeister.