Training code requirements

Perform custom training on Vertex AI to run your own machine learning (ML) training code in the cloud, instead of using AutoML. This document describes requirements to consider as you write training code.

Choose a training code structure

First, determine what structure you want your ML training code to take. You can provide training code to Vertex AI in one of the following forms:

  • A Python training application to use with a pre-built container. Create a Python source distribution with code that trains an ML model and exports it to Cloud Storage. This training application can use any of the dependencies included in the pre-built container that you plan to use it with.

    Use this option if one of the Vertex AI pre-built containers for training includes all the dependencies that you need for training. For example, if you want to train with PyTorch, scikit-learn, TensorFlow, or XGBoost, then this is likely the better option.

    To learn about requirements specific to this option, read the guide to creating a Python training application.

  • A custom container image. Create a Docker container image with code that trains an ML model and exports it to Cloud Storage. Include any dependencies required by your code in the container image.

    Use this option if you want to use dependencies that are not included in one of the Vertex AI pre-built containers for training. For example, if you want to train using a Python ML framework that is not available in a pre-built container, or if you want to train using a programming language other than Python, then this is the better option.

    To learn about requirements specific to this option, read the guide to creating a custom container image.

The rest of this document describes requirements relevant to both training code structures.

Requirements for all custom training code

When you write custom training code for Vertex AI, keep in mind that the code will run on one or more virtual machine (VM) instances managed by Google Cloud. This section describes requirements applicable to all custom training code.

Access Google Cloud services in your code

Several of the following sections describe accessing other Google Cloud services from your code. To access Google Cloud services, write your training code to use Application Default Credentials (ADC). Many Google Cloud client libraries authenticate with ADC by default. You don't need to configure any environment variables; Vertex AI automatically configures ADC to authenticate as either the Vertex AI Custom Code Service Agent for your project (by default) or a custom service account (if you have configured one).

However, when you use a Google Cloud client library in your code, Vertex AI might not always connect to the correct Google Cloud project by default. If you encounter permission errors, connecting to the wrong project might be the problem.

This problem occurs because Vertex AI does not run your code directly in your Google Cloud project. Instead, Vertex AI runs your code in one of several separate projects managed by Google. Vertex AI uses these projects exclusively for operations related to your project. Therefore, don't try to infer a project ID from the environment in your training or prediction code; specify project IDs explicitly.

If you don't want to hardcode a project ID in your training code, you can reference the CLOUD_ML_PROJECT_ID environment variable: Vertex AI sets this environment variable in every custom training container to contain the project number of the project where you initiated custom training. Many Google Cloud tools can accept a project number wherever they take a project ID.

For example, if you want to use the Python Client for Google BigQuery to access a BigQuery table in the same project, then do not try to infer the project in your training code:

Implicit project selection

from google.cloud import bigquery

client = bigquery.Client()

Instead use code that explicitly selects a project:

Explicit project selection

import os

from google.cloud import bigquery

project_number = os.environ["CLOUD_ML_PROJECT_ID"]

client = bigquery.Client(project=project_number)

If you encounter permission errors after configuring your code in this way, then read the following section about which resources your code can access to adjust the permissions available to your training code.

Which resources your code can access

By default, your training code has access to any resources available to the Vertex AI Custom Code Service Agent for your project. You can also configure Vertex AI so that your training code can access more or fewer resources.

For example, consider your training code's access to Cloud Storage resources:

By default, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. You can also grant Vertex AI access to Cloud Storage buckets in other projects, or you can precisely customize what buckets a specific job can access by using a custom service account.

Similar access rules and techniques apply to BigQuery tables and other Google Cloud resources. In general, you can change the resources available to your training code in one of the following ways:

Read and write Cloud Storage files with Cloud Storage FUSE

In all custom training jobs, Vertex AI mounts Cloud Storage buckets that you have access to in the /gcs/ directory of each training node's filesystem. As a convenient alternative to using the Python Client for Cloud Storage or another library to access Cloud Storage, you can read and write directly to the local filesystem in order to read data from Cloud Storage or write data to Cloud Storage. For example, to load data from gs://BUCKET/data.csv, you can use the following Python code:

file = open('/gcs/BUCKET/data.csv', 'r')

Vertex AI uses Cloud Storage FUSE to mount the storage buckets. Note that directories mounted by Cloud Storage FUSE are not POSIX compliant.

The credentials that you are using for custom training determine which buckets you can access in this way. The preceding section about which resources your code can access describes exactly which buckets you can access by default and how to customize this access.

Load input data

ML code usually operates on training data in order to train a model. Don't store training data together with your code, whether you create a Python training application or a custom container image. Storing data with code can lead to a poorly organized project, make it difficult to reuse code on different datasets, and cause errors for large datasets.

We recommend that you write your code to load data from a source outside of Vertex AI, such as BigQuery or Cloud Storage. Alternatively, you can use a Vertex AI managed dataset, but we don't recommend it; managed datasets are better suited for AutoML models.

For best performance when you load data from Cloud Storage, use a bucket in the region where you are performing custom training. To learn how to store data in Cloud Storage, read Creating storage buckets and Uploading objects.

To learn about which Cloud Storage buckets you can load data from, read the previous section about which resources your code can access.

To load data from Cloud Storage in your training code, use the Cloud Storage FUSE feature described in the preceding section, or use any library that supports ADC. You don't need to explicitly provide any authentication credentials in your code.

For example, you can use one of the client libraries demonstrated in the Cloud Storage guide to Downloading objects. The Python Client for Cloud Storage, in particular, is included in pre-built containers. TensorFlow's tf.io.gfile.GFile class also supports ADC.

Load a large dataset

Depending on which machine types you plan to use during custom training, your VMs might not be able to load the entirety of a large dataset into memory.

If you need to read data that is too large to fit in memory, stream the data or read it incrementally. Different ML frameworks have different best practices for doing this. For example, TensorFlow's tf.data.Dataset class can stream TFRecord or text data from Cloud Storage.

Performing custom training on multiple VMs with data parallelism is another way to reduce the amount of data each VM loads into memory. See the Writing code for distributed training section of this document.

Export a trained ML model

ML code usually exports a trained model at the end of training in the form of one or more model artifacts. You can then use the model artifacts to get predictions.

After custom training completes, you can no longer access the VMs that ran your training code. Therefore, your training code must export model artifacts to a location outside of Vertex AI.

We recommend that you export model artifacts to a Cloud Storage bucket. As described in the previous section about which resources your code can access, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. Use a library that supports ADC to export your model artifacts. For example, TensorFlow's APIs for saving Keras models can export artifacts directly to a Cloud Storage path.

If you want to use your trained model to serve predictions on Vertex AI, then your code must export model artifacts in a format compatible with one of the pre-built containers for prediction. Learn more in the guide to exporting model artifacts for prediction.

Environment variables for special Cloud Storage directories

If you specify the baseOutputDirectory API field, Vertex AI sets the following environment variables when it runs your training code:

The values of these environment variables differ slightly depending on whether you are using hyperparameter tuning. To learn more, see the API reference for baseOutputDirectory.

Using these environment variables makes it easy to reuse the same training code multiple times—for example with different data or configuration options—and save model artifacts and checkpoints to different locations, just by changing the baseOutputDirectory API field. However, you are not required to use the environment variables in your code if you don't want to. For example, you can alternatively hardcode locations for saving checkpoints and exporting model artifacts.

Additionally, if you use a TrainingPipeline for custom training and do not specify the modelToUpload.artifactUri field, then Vertex AI uses the value of the AIP_MODEL_DIR environment variable for modelToUpload.artifactUri. (For hyperparameter tuning, Vertex AI uses the value of the AIP_MODEL_DIR environment variable from the best trial.)

Ensure resilience to restarts

The VMs that run your training code restart occasionally. For example, Google Cloud might need to restart a VM for maintenance reasons. When a VM restarts, Vertex AI starts running your code again from its start.

If you expect your training code to run for a non-negligible amount of time, add several behaviors to your code to make it resilient to restarts:

  • Frequently export your training progress to Cloud Storage, so that you don't lose it if your VMs restart.

  • At the start of your training code, check whether any training progress already exists in your export location. If so, load the saved training state instead of starting training from scratch.

How to accomplish these behaviors depends on which ML framework you use. For example, if you use TensorFlow Keras, learn how to use the ModelCheckpoint callback for this purpose.

Requirements for optional custom training features

If you want to use certain optional custom training features, you might need to make additional changes to your training code. This section describes code requirements for hyperparameter tuning, GPUs, distributed training, and Vertex TensorBoard.

Write code for hyperparameter tuning

Vertex AI can perform hyperparameter tuning on your ML training code. Learn more about how hyperparameter tuning on Vertex AI works and how to configure a HyperparameterTuningJob resource.

If you want to use hyperparameter tuning, your training code must do the following:

  • Parse command-line arguments representing the hyperparameters that you want to tune, and use the parsed values to set the hyperparameters for training.

  • Intermittently report the hyperparameter tuning metric to Vertex AI.

Parse command-line arguments

For hyperparameter tuning, Vertex AI runs your training code multiple times, with different command-line arguments each time. Your training code must parse these command-line arguments and use them as hyperparameters for training. For example, to tune your optimizer's learning rate, you might want to parse a command-line argument named --learning_rate. Learn how to configure which command-line arguments Vertex AI provides.

We recommend that you use Python's argparse library to parse command-line arguments.

Report the hyperparameter tuning metric

Your training code must intermittently report the hyperparameter metric that you are trying to optimize to Vertex AI. For example, if you want to maximize your model's accuracy, you might want to report this metric at the end of every training epoch. Vertex AI uses this information to decide what hyperparameters to use for the next training trial. Learn more about selecting and specifying a hyperparameter tuning metric.

Use the cloudml-hypertune Python library to report the hyperparameter tuning metric. This library is included in all pre-built containers for training, and you can use pip to install it in a custom container.

To learn how to install and use this library, see the cloudml-hypertune GitHub repository, or refer to the Vertex AI: Hyperparameter Tuning codelab.

Write code for GPUs

You can select VMs with graphics processing units (GPUs) to run your custom training code. Learn more about configuring custom training to use GPU-enabled VMs.

If you want to train with GPUs, make sure your training code can take advantage of them. Depending on which ML framework you use, this might require changes to your code. For example, if you use TensorFlow Keras, you only need to adjust your code if you want to use more than one GPU. Some ML frameworks can't use GPUs at all.

In addition, make sure that your container supports GPUs: Select a pre-built container for training that supports GPUs, or install the NVIDIA CUDA Toolkit and NVIDIA cuDNN on your custom container. One way to do this is to use base image from the nvidia/cuda Docker repository; another way is to use an Deep Learning Containers as your base image.

Write code for distributed training

To train on large datasets, you can run your code on multiple VMs in a distributed cluster managed by Vertex AI. Learn how to configure multiple VMs for training.

Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines which automatically coordinate how to divide the work based on environment variables set on each machine. Find out if Vertex AI sets environment variables to make this possible for your desired ML framework.

Alternatively, you can run a different container on each of several worker pools. A worker pool is a group of VMs that you configure to use the same compute options and container. In this case, you still probably want to rely on the environment variables set by Vertex AI to coordinate communication between the VMs. You can customize the training code of each worker pool to perform whatever arbitrary tasks you want; how you do this depends on your goal and which ML framework you use.

Track and visualize custom training experiments using Vertex TensorBoard

Vertex TensorBoard is a managed version of TensorBoard, a Google open source project for visualizing machine learning experiments. With Vertex TensorBoard you can track, visualize, and compare ML experiments and then share them with your team.

To use Vertex TensorBoard with custom training, you must do the following:

  • Create a Vertex TensorBoard instance in your project to store your experiments.

  • Configure a service account to run the custom training job with appropriate permissions.

  • Adjust your custom training code to write out TensorBoard compatible logs to Cloud Storage.

For a step-by-step guide on getting started with these requirements, see Using Vertex TensorBoard with custom training.

What's next