Developers & Practitioners

PyTorch on Google Cloud: How To train and tune PyTorch models on Vertex AI

Since the publishing of the inaugural post of PyTorch on Google Cloud blog series, we announced Vertex AI: Google Cloud’s end-to-end ML platform at Google I/O 2021.  Vertex AI unifies Google Cloud’s existing ML offerings into a single platform for efficiently building and managing the lifecycle of ML projects. It provides tools for every step of the machine learning workflow across various model types, for varying levels of machine learning expertise.

We will continue the blog series with Vertex AI to share how to build, train and deploy PyTorch models at scale and how to create reproducible machine learning pipelines on Google Cloud. 

Figure 1 Vertex AI

Figure 1. What’s included in Vertex AI?

In this post, we will show how to use:

  1. Vertex AI Training to build and train a sentiment text classification model using PyTorch
  2. Vertex AI Hyperparameter Tuning to tune hyperparameters of PyTorch models

You can find the accompanying code for this blog post on the GitHub repository and the Jupyter Notebook.

Let’s get started!

Use case and dataset

In this article we will fine tune a transformer model (BERT-base) from Hugging Face Transformers Library for a sentiment analysis task using PyTorch. BERT (Bidirectional Encoder Representations from Transformers) is a Transformer model pre-trained on a large corpus of unlabeled text in a self-supervised fashion. We will begin experimentation with the IMDB sentiment classification dataset on Notebooks

We recommend using a Notebook instance with limited compute for development and experimentation purposes. Once we are satisfied with the local experiment on the notebook, we show how you can submit a training job from the same Jupyter notebook to the Vertex Training service to scale the training with bigger GPU shapes. Vertex Training service optimizes the training pipeline by spinning up infrastructure for the training job and spinning it down after the training is complete, without you having to manage the infrastructure.
Figure 2 Vertex AI

Figure 2. ML workflow on Vertex AI

In the upcoming posts, we will show how you can deploy and serve these PyTorch models on Vertex Prediction service followed by Vertex Pipelines to automate, monitor and govern your ML systems by orchestrating a ML workflow in a serverless manner, and storing workflow's artifacts using Vertex ML Metadata.  

Creating a development environment on Notebooks

To set up a PyTorch development environment on JupyterLab notebooks with Notebooks, follow the setup section in the earlier post here

To interact with the new notebook instance, go to the Notebooks page in the Google Cloud Console and click the “OPEN JUPYTERLAB” link next to the new instance, which becomes active when the instance is ready to use.
Figure 3 Vertex AI
Figure 3. Notebook instance

Training a PyTorch model on VertexTraining

After creating a Notebooks instance, you can start with your experiments. Let’s look into the model specifics for the use case.

The model specifics

For analyzing sentiments of the movie reviews in the IMDB dataset, we will fine-tune a pre-trained BERT model from Hugging Face. The pre-trained BERT model already encodes a lot of information about the language as the model was trained on a large corpus of English data in a self-supervised fashion. Now we only need to slightly tune them using their outputs as features for the sentiment classification task. This means quicker development iteration on a much smaller dataset, instead of training a specific Natural Language Processing (NLP) model with a larger training dataset.

Figure 4 Vertex AI

Figure 4. Pretrained Model with classification layer: The blue-box indicates the pre-trained BERT Encoder module. Output of the encoder is pooled into a linear layer with the number of outputs same as the number of target labels (classes).

For training the sentiment classification model, we will:

  • Preprocess and transform (tokenize) the reviews data
  • Load the pre-trained BERT model and add the sequence classification head for sentiment analysis
  • Fine-tune the BERT model for sentence classification

The following code snippet shows how to preprocess the data and fine-tune a pre-trained BERT model. Please refer to the Jupyter Notebook for complete code and detailed explanation.

  # code snippet from training

# loading the dataset

# instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

# tokenize the texts
# refer notebook for the preprocess function
datasets =, 
                        batched=True, load_from_cache_file=True)

# load pre-trained BERT model
model = AutoModelForSequenceClassification.from_pretrained(

args = TrainingArguments(
    evaluation_strategy = "epoch",

trainer = Trainer(model,

# start training

In the snippet above, notice that the encoder (also referred to as the base model) weights are not frozen. This is why a very small learning rate (2e-5) is chosen to avoid loss of pre-trained representations. Learning rate and other hyperparameters are captured under the TrainingArguments object. During the training, we are only capturing accuracy metrics. You can modify the compute_metrics function to capture and report other metrics.

Training the model on Vertex AI

While you can do local experimentation on your Notebooks instance, for larger datasets or large models often a vertically scaled compute resource or horizontally distributed training is required. The most effective way to perform this task is Vertex Training service for following reasons:

  • Automatically provision and deprovision resources: Training job on Vertex AI will automatically provision computing resources, perform the training task and ensure deletion of compute resources once the training job is finished.
  • Reusability and portability: You can package training code with its parameters and dependencies into a container and create a portable component. This container can then be run with different scenarios such as hyperparameter tuning, various data sources and more.
  • Training at scale: You can run a distributed training job on Vertex Training to train models in a cluster across multiple nodes in parallel and resulting in faster training time.
  • Logging and Monitoring: The training service logs messages from the job to Cloud Logging and can be monitored while the job is running.

In this post, we show how to scale a training job with Vertex Training by packaging the code and creating a training pipeline to orchestrate a training job. 

There are three steps to run a training job using Vertex AI custom training service:

Figure 5 Vertex AI

Figure 5. Custom training on Vertex AI

  • STEP 1 - Determine training code structure: Package training application code as a Python source distribution or as a custom container image (Docker)

  • STEP 2 - Choose a custom training method: You can run a training job on Vertex Training as a custom job or a hyperparameter training job or a training pipeline.

    • Custom jobs: With a custom job you configure the settings to run your training code on Vertex AI such as worker pool specs - machine types, accelerators, Python training spec or custom container spec.

    • Hyperparameter tuning jobs: Hyperparameter tuning jobs automate tuning of hyperparameters of your model based on the criteria you configure such as goal or metric to optimize, hyperparameters values and number of trials to run.

    • Training pipelines: Orchestrates custom training jobs or hyperparameter tuning jobs with additional steps after the training job is successfully completed.

  • STEP 3 - Run the training job: You can submit the training job to run on Vertex Training using gcloud CLI or any of Client SDK libraries such as Vertex SDK for Python.

Refer to the documentation for further details on custom training methods.

Packaging the training application

Before running the training application on Vertex Training, the training application code with required dependencies must be packaged and uploaded to a Cloud Storage bucket that your Google Cloud project can access. There are two ways to package the application and run on Vertex Training:

  1. Create a Python source distribution with the training code and dependencies to use with a pre-built containers on Vertex AI
  2. Use custom containers to package dependencies using Docker containers

You can structure your training code in any way you prefer. Refer to the GitHub repository or Jupyter Notebook for our recommended approach on structuring training code. 

Run Custom Job on Vertex Training with a pre-built container

Vertex AI provides Docker container images that can be run as pre-built containers for custom training. These containers include common dependencies used in training code based on the Machine Learning framework and framework version.

For the sentiment analysis task, we are using Hugging Face Datasets and fine-tune a transformer model from Hugging Face Transformers Library using PyTorch. We use the pre-built container for PyTorch and package the training application code as a Python Source Distribution by adding standard Python dependencies required by the training algorithm - transformers, datasets and tqdm - in the file.

Figure 6 Vertex AI

Figure 6. Custom training with pre-built containers on Vertex Training

The find_packages() function inside includes the training code in the package as dependencies.

  # contents of
from setuptools import find_packages
from setuptools import setup
import setuptools

from import build as _build
import subprocess


    description='Vertex AI | Training | PyTorch | Text Classification | Python Package'

We use Vertex SDK for Python to create and submit the training job to the Vertex training service by configuring a Custom Job resource with the pre-built container image for PyTorch and specifying the training code packaged as Python source distribution. We are attaching a NVIDIA Tesla T4 GPU to the training job for accelerating the training.  

  # submit training job to Vertex Training with 
# pre-built container using Vertex SDK
from import aiplatform
from datetime import datetime

# initialize the Vertex SDK for Python
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

# define variable names
APP_NAME = "finetuned-bert-classifier"
source_package_file_name = f"{PYTHON_PACKAGE_APPLICATION_DIR}/dist/trainer-0.1.tar.gz"
python_package_gcs_uri = 
python_module_name = "trainer.task"
TIMESTAMP ="%Y%m%d%H%M%S")
JOB_NAME = f"{APP_NAME}-pytorch-pkg-ar-{TIMESTAMP}"

# configure the Custom Job resource
job = aiplatform.CustomPythonPackageTrainingJob(

training_args = ["--num-epochs", "2", "--model-name", "finetuned-bert-classifier"]

# submit the Custom Job to Vertex Training service
model =

Alternatively, you can also submit the training job to Vertex AI training service using gcloud beta ai custom-jobs create command. gcloud command stages your training application on GCS bucket and submits the training job.

  # submit training job to Vertex Training with 
# pre-built container using gcloud CLI
gcloud beta ai custom-jobs create \
    --display-name=${JOB_NAME} \
    --region ${REGION} \
    --python-package-uris=${PACKAGE_PATH} \
    --worker-pool-spec=replica-count=1,machine-type='n1-standard-8',accelerator-type='NVIDIA_TESLA_T4',accelerator-count=1,executor-image-uri=${IMAGE_URI},python-module='trainer.task',local-package-path="../python_package/" \

worker-pool-spec parameter in the command defines the worker pool configuration used by the custom job. Following are the fields within worker-pool-spec:

  • Set the executor-image-uri to for training on pre-built PyTorch v1.7 image for GPU
  • Set the local-package-path to the path to the training code
  • Set the python-module to the trainer.task which is the main module to start the training application
  • Set the accelerator-type and machine-type to set the compute type to run the application

Refer to documentation for the  gcloud beta ai custom-jobs create command for details.

Run Custom Job on Vertex Training with custom container

To create a training job with a custom container, you define a Dockerfile to install or add the dependencies required for the training job. Then, you build and test your Docker image locally to verify, push the image to Container Registry and submit a Custom Job to Vertex Training service.

Figure 7 Vertext AI
Figure 7. Custom training with custom containers on Vertex Training

We create a Dockerfile with a pre-built PyTorch container image provided by Vertex AI as the base image, install the dependencies - transformers, datasets , tqdm and cloudml-hypertune and copy the training application code.

  # Dockerfile to build custom container for training
# use PyTorch DLVM as base image


# installs transformers, datasets and google-cloud-storage sdk
RUN pip install google-cloud-storage transformers datasets tqdm

# Copies the trainer code to the docker image.
COPY ./trainer/ ./trainer/
COPY ./trainer/ ./trainer/
COPY ./trainer/ ./trainer/
COPY ./trainer/ ./trainer/
COPY ./trainer/ ./trainer/
COPY ./trainer/ ./trainer/

# Set up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Now, build and push the image to Google Cloud Container Registry.

  # build the image
docker build -f Dockerfile -t $CUSTOM_TRAIN_IMAGE_URI ../python_package
# push the container to Container Registry
# validate the custom container image in Container Registry
gcloud container images describe $CUSTOM_TRAIN_IMAGE_URI

Submit the custom training job to Vertex Training using Vertex SDK for Python. 

  # submit training job to Vertex Training with 
# custom container using Vertex SDK

# initialize the Vertex SDK for Python
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

# define variable names
TIMESTAMP ="%Y%m%d%H%M%S")
JOB_NAME = f"{APP_NAME}-pytorch-cstm-cntr-{TIMESTAMP}"

# configure the job with container image spec
job = aiplatform.CustomContainerTrainingJob(
    display_name=f"{JOB_NAME}", container_uri=f"{CUSTOM_TRAIN_IMAGE_URI}"

# define training code arguments
training_args = ["--num-epochs", "2", "--model-name", "finetuned-bert-classifier"]

# submit the Custom Job to Vertex Training service
model =

Alternatively, you can also submit the training job to Vertex AI training service using gcloud beta ai custom-jobs create command with custom container spec. gcloud command submits the training job and launches worker pool with the custom container image specified.

  # worker pool spec

# submit training job to Vertex Training with 
# custom container using gcloud CLI
gcloud beta ai custom-jobs create \
    --display-name=${JOB_NAME} \
    --region ${REGION} \
    --python-package-uris=${PACKAGE_PATH} \
    --worker-pool-spec="${worker_pool_spec}" \

worker-pool-spec parameter defines the worker pool configuration used by the custom job. Following are the fields within worker-pool-spec:

  • Set the container-image-uri to the custom container image pushed to Google Cloud Container Registry for training

  • Set the accelerator-type and machine-type to set the compute type to run the application

Once the job is submitted, you can monitor the status and progress of training job either in Google Cloud Console or use gcloud CLI command gcloud beta ai custom-jobs stream-logs as shown below:

  # get job id and status of the job based on JOB_NAME
gcloud ai custom-jobs list --region=$REGION --filter="displayName:"$JOB_NAME --format="get(name,state)"
# stream logs to console
gcloud ai custom-jobs stream-logs <job_id> --region=$REGION
Figure 8 Training Pipelines
Figure 8. Monitor progress and logs of custom training jobs from Google Cloud Console

Hyperparameter tuning on Vertex AI

The training application code for fine-tuning a transformer model uses hyperparameters such as learning rate and weight decay. These hyperparameters control the behavior of the training algorithm and can have a substantial effect on the performance of the resulting model. In this section, we show how you can automate tuning these hyperparameters with Vertex Training.

We submit a Hyperparameter Tuning job to Vertex Training service by packaging the training application code and dependencies in a Docker container and push the container to Google Container Registry, similar to running a CustomJob on Vertex AI with Custom Container shown in the earlier section.

Figure 9 Vertex AI

Figure 9. Hyperparameter Tuning on Vertex Training

How does hyperparameter tuning work in Vertex AI?

Following are the high level steps involved in running a Hyperparameter Tuning job on Vertex Training service:

  • Define the hyperparameters to tune the model along with the metric to optimize
  • Vertex Training service runs multiple trials of the training application with the hyperparameters and limits you specify - maximum number of trials to run and number of parallel trials.
  • Vertex AI keeps track of the results from each trial and makes adjustments for subsequent trials. This requires your training application to report the metrics to Vertex AI using the Python package cloudml-hypertune.
  • When the job is finished, get the summary of all the trials with the most effective configuration of values based on the criteria you configured

Refer to the Vertex AI documentation to understand how to configure and select hyperparameters for tuning, configure tuning strategy and how Vertex AI optimizes the hyperparameter tuning jobs. The default tuning strategy uses results from previous trials to inform the assignment of values in subsequent trials.

Changes to training application code for hyperparameter tuning

There are few requirements to follow that are specific to hyperparameter tuning in Vertex AI:

  1. To pass the hyperparameter values to training code, you must define a command-line argument in the main training module for each tuned hyperparameter. Use the value passed in those arguments to set the corresponding hyperparameter in the training application's code.
  2. You must pass metrics from the training application to Vertex AI to evaluate the efficacy of a trial. You can use cloudml-hypertune Python package to report metrics.

Previously, in the training application code, we instantiated Trainer with hyperparameters passed as training arguments (training_args). These hyperparameters are passed as command line arguments to the training module trainer.task which are then passed to the training_args. Refer to ./python_package/trainer module for training application code.

  # define hyperparameters as arguments in the main training module
    help='Learning rate value for the optimizers.',
    help="""The factor by which the learning rate should decay by the end of the training.""",

# enable hyperparameter
    help='Enable hyperparameter tuning. Valid values are: "y" - enable, "n" - disable')

To report metrics to Vertex AI when hyperparameter tuning is enabled, we call cloudml-hypertune Python package after the evaluation phase as a callback to the trainer object. The trainer object passes the metrics computed in the last evaluation phase to the callback that will be reported by the hypertune library to Vertex AI for evaluating trials.

  # initialize Trainer object
trainer = Trainer(

# add hyperparameter tuning callback to report metrics when enabled
if args.hp_tune == "y":
    trainer.add_callback(HPTuneCallback("accuracy", "eval_accuracy"))

class HPTuneCallback(TrainerCallback):
    A custom callback class that reports a metric to hypertuner
    at the end of each epoch.

    def __init__(self, metric_tag, metric_value):
        super(HPTuneCallback, self).__init__()
        self.metric_tag = metric_tag
        self.metric_value = metric_value
        self.hpt = hypertune.HyperTune()

    def on_evaluate(self, args, state, control, **kwargs):
        print(f"HP metric {self.metric_tag}={kwargs['metrics'][self.metric_value]}")

Run Hyperparameter Tuning Job on Vertex AI

Before submitting the Hyperparameter Tuning job to Vertex AI, push the custom container image with the training application to Cloud Container Registry repository and then submit the job to Vertex AI using Vertex SDK for Python. We use the same image as before when running the Custom Job on Vertex Training service.

  # initialize the Vertex SDK for Python
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

# define variable names
TIMESTAMP ="%Y%m%d%H%M%S")
JOB_NAME = f"{APP_NAME}-pytorch-hptune-{TIMESTAMP}"

Define the training arguments with hp-tune argument set to y so that training application code can report metrics to Vertex Training service.

  training_args = ["--num-epochs", "2",
                 "--model-name", "finetuned-bert-classifier",
                 "--hp-tune", "y"]

Create a CustomJob with worker pool specs to define machine types, accelerators and customer container spec with the training application code.

  # The spec of the worker pools including machine type and Docker image
worker_pool_specs = [
        "machine_spec": {
            "machine_type": "n1-standard-8",
            "accelerator_type": "NVIDIA_TESLA_T4",
            "accelerator_count": 1,
        "replica_count": 1,
        "container_spec": {"image_uri": CUSTOM_TRAIN_IMAGE_URI, "args": training_args},

# define the custom job
custom_job = aiplatform.CustomJob(
    display_name=JOB_NAME, worker_pool_specs=worker_pool_specs

Next, define the parameter and metric specifications:

  • parameter_spec defines the search space i.e. parameters to search and optimize. The spec requires to specify the hyperparameter data type as an instance of a parameter value specification. Refer to the documentation on selecting the hyperparameter to tune and how to define them.
  • metric_spec defines the goal of the metric to optimize. The goal specifies whether you want to tune your model to maximize or minimize the value of this metric.
  # Dictionary representing parameters to optimize.
# The dictionary key is the parameter_id, which is passed into your training
# job as a command line argument,
# And the dictionary value is the parameter specification of the metric.
parameter_spec = {
    "learning-rate": hpt.DoubleParameterSpec(min=1e-6, max=0.001, scale="log"),
    "weight-decay": hpt.DiscreteParameterSpec(
        values=[0.0001, 0.001, 0.01, 0.1], scale=None

# Dictionary representing metrics to optimize.
# The dictionary key is the metric_id, which is reported by your training job,
# And the dictionary value is the optimization goal of the metric.
metric_spec = {"accuracy": "maximize"}

Configure and submit a HyperparameterTuningJob with the CustomJob, metric_spec, parameter_spec and trial limits. Trial limits define how many trials to allow the service to run: 

  • max_trial_count: Maximum # of Trials run by the service. Start with a smaller value to understand the impact of the hyperparameters chosen before scaling up.
  • parallel_trial_count: Number of Trials to run in parallel. Start with a smaller value as Vertex AI uses results from the previous trials to inform the assignment of values in subsequent trials. Higher number of parallel trials mean these trials start without having the benefit of the results of any trials still running.
  • search_algorithm: Search algorithm specified for the study. When not specified, Vertex AI by default applies Bayesian optimization to arrive at the optimal solution to search over the parameter space.
  hp_job = aiplatform.HyperparameterTuningJob(
model =

Refer to the documentation to understand the hyperparameter training job configuration.

Alternatively, you can submit a hyperparameter tuning job to Vertex AI training service using gcloud beta ai hp-tuning-jobs create. The gcloud command submits the hyperparameter tuning job and launches multiple trials with a worker pool based on custom container image specified, number of trials and the criteria set. The command requires hyperparameter tuning job configuration provided as configuration file in YAML format with job name. Refer to the Jupyter notebook on creating the YAML configuration and submitting the job via gcloud command.

  # submit hyperparameter tuning job
gcloud beta ai hp-tuning-jobs create \
   --config ./python_package/hptuning_job.yaml \
   --display-name $JOB_NAME \
   --algorithm algorithm-unspecified \
   --max-trial-count 5 \
   --parallel-trial-count 2 \

You can monitor the hyperparameter tuning job launched from Cloud Console following the link here or use gcloud CLI command gcloud beta ai custom-jobs stream-logs.

Figure 10 Vertex AI

Figure 10. Monitor progress  and logs of hyperparameter tuning jobs from Google Cloud Console

After the job is finished, you can view and format the results of the hyperparameter tuning Trials (run by Vertex Training service) and pick the best performing Trial to deploy to Vertex Prediction service.

Run predictions locally

Let’s run prediction calls on the trained model locally with a few examples (refer to the notebook for the complete code). The next post in this series will show you how to deploy this model on Vertex Prediction service.

  label_text = {0 : 'Negative', 1: 'Positive'}

def predict(input_text, saved_model_path):   
   # initialize tokenizer
   tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, 
   # preprocess and encode input text
   predict_input = tokenizer.encode(review_text, 
​   # load trained model
   loaded_model = AutoModelForSequenceClassification.from_pretrained(
   # get predictions
   output = loaded_model(predict_input)
   # return labels
 label_id = torch.argmax(*output.to_tuple(), dim=1)
   print(f'Sentiment : {label_text[label_id.item()]}\n')

Cleaning up the Notebook environment

After you are done experimenting, you can either stop or delete the Notebooks instance. Delete the Notebooks instance to prevent any further charges. If you want to save your work, you can choose to stop the instance instead.

What’s next?

In this article, we explored Notebooks for PyTorch model development. We then trained and tuned the model on Vertex Training service, a fully managed service for training machine learning models at scale. We looked at how you can submit training jobs as Custom Job and Hyperparameter Tuning Job to Vertex Training using Vertex SDK for Python and gcloud CLI commands with both pre-built and custom containers for PyTorch.

In the next installments of this series, we will show how to deploy PyTorch models on Vertex Prediction service and orchestrate a machine learning workflow using Vertex Pipelines. We encourage you to explore the Vertex AI features and read the reference guide on best practices for implementing machine learning on Google Cloud.


Stay tuned. Thank you for reading! Have a question or want to chat? Find authors here - Rajesh [Twitter | LinkedIn] and Vaibhav [LinkedIn].

Thanks to Karl Weinmeister and Jordan Totten  for helping and reviewing the post.