Perform custom training on Vertex AI to run your own machine learning (ML) training code in the cloud, instead of using AutoML. This document describes requirements to consider as you write training code.
Choose a training code structure
First, determine what structure you want your ML training code to take. You can provide training code to Vertex AI in one of the following forms:
A Python script to use with a prebuilt container. Use the Vertex AI SDK to create a custom job. This method lets you provide your training application as a single Python script.
A Python training application to use with a prebuilt container. Create a Python source distribution with code that trains an ML model and exports it to Cloud Storage. This training application can use any of the dependencies included in the prebuilt container that you plan to use it with.
Use this option if one of the Vertex AI prebuilt containers for training includes all the dependencies that you need for training. For example, if you want to train with PyTorch, scikit-learn, TensorFlow, or XGBoost, then this is likely the better option.
To learn about requirements specific to this option, read the guide to creating a Python training application.
A custom container image. Create a Docker container image with code that trains an ML model and exports it to Cloud Storage. Include any dependencies required by your code in the container image.
Use this option if you want to use dependencies that are not included in one of the Vertex AI prebuilt containers for training. For example, if you want to train using a Python ML framework that is not available in a prebuilt container, or if you want to train using a programming language other than Python, then this is the better option.
To learn about requirements specific to this option, read the guide to creating a custom container image.
The rest of this document describes requirements relevant to both training code structures.
Requirements for all custom training code
When you write custom training code for Vertex AI, keep in mind that the code will run on one or more virtual machine (VM) instances managed by Google Cloud. This section describes requirements applicable to all custom training code.
Access Google Cloud services in your code
Several of the following sections describe accessing other Google Cloud services from your code. To access Google Cloud services, write your training code to use Application Default Credentials (ADC). Many Google Cloud client libraries authenticate with ADC by default. You don't need to configure any environment variables; Vertex AI automatically configures ADC to authenticate as either the Vertex AI Custom Code Service Agent for your project (by default) or a custom service account (if you have configured one).
However, when you use a Google Cloud client library in your code, Vertex AI might not always connect to the correct Google Cloud project by default. If you encounter permission errors, connecting to the wrong project might be the problem.
This problem occurs because Vertex AI does not run your code directly in your Google Cloud project. Instead, Vertex AI runs your code in one of several separate projects managed by Google. Vertex AI uses these projects exclusively for operations related to your project. Therefore, don't try to infer a project ID from the environment in your training or prediction code; specify project IDs explicitly.
If you don't want to hardcode a project ID in your training code, you can
CLOUD_ML_PROJECT_ID environment variable: Vertex AI
sets this environment variable in every custom training container to contain the
project number of the project where you initiated
custom training. Many Google Cloud tools can accept a project
number wherever they take a project ID.
For example, if you want to use the Python Client for Google BigQuery to access a BigQuery table in the same project, then do not try to infer the project in your training code:
Implicit project selection
from google.cloud import bigquery client = bigquery.Client()
Instead use code that explicitly selects a project:
Explicit project selection
import os from google.cloud import bigquery project_number = os.environ["CLOUD_ML_PROJECT_ID"] client = bigquery.Client(project=project_number)
If you encounter permission errors after configuring your code in this way, then read the following section about which resources your code can access to adjust the permissions available to your training code.
Which resources your code can access
By default, your training code has access to any resources available to the Vertex AI Custom Code Service Agent for your project. You can also configure Vertex AI so that your training code can access more or fewer resources.
For example, consider your training code's access to Cloud Storage resources:
By default, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. You can also grant Vertex AI access to Cloud Storage buckets in other projects, or you can precisely customize what buckets a specific job can access by using a custom service account.
Similar access rules and techniques apply to BigQuery tables and other Google Cloud resources. In general, you can change the resources available to your training code in one of the following ways:
Grant additional roles to the Vertex AI Custom Code Service Agent for your project.
Use a custom service account with the permissions you need.
If you want your training code to obtain an OAuth 2.0 access token with the
https://www.googleapis.com/auth/cloud-platformscope, then you must use a custom service account. You can't give this level of access to the Vertex AI Custom Code Service Agent.
Read and write Cloud Storage files with Cloud Storage FUSE
In all custom training jobs, Vertex AI mounts Cloud Storage
buckets that you have access to in the
/gcs/ directory of each training node's
file system. As a convenient alternative to using the Python Client for
Cloud Storage or another library to access Cloud Storage, you
can read and write directly to the local file system in order to read data from
Cloud Storage or write data to Cloud Storage. For example, to
load data from
gs://BUCKET/data.csv, you can use the
following Python code:
file = open('/gcs/BUCKET/data.csv', 'r')
Vertex AI uses Cloud Storage FUSE to mount the storage buckets. Note that directories mounted by Cloud Storage FUSE are not POSIX compliant.
The credentials that you are using for custom training determine which buckets you can access in this way. The preceding section about which resources your code can access describes exactly which buckets you can access by default and how to customize this access.
Load input data
ML code usually operates on training data in order to train a model. Don't store training data together with your code, whether you create a Python training application or a custom container image. Storing data with code can lead to a poorly organized project, make it difficult to reuse code on different datasets, and cause errors for large datasets.
You can load data from a Vertex AI managed dataset or write your own code to load data from a source outside of Vertex AI, such as BigQuery or Cloud Storage.
For best performance when you load data from Cloud Storage, use a bucket in the region where you are performing custom training. To learn how to store data in Cloud Storage, read Creating storage buckets and Uploading objects.
To learn about which Cloud Storage buckets you can load data from, read the previous section about which resources your code can access.
To load data from Cloud Storage in your training code, use the Cloud Storage FUSE feature described in the preceding section, or use any library that supports ADC. You don't need to explicitly provide any authentication credentials in your code.
For example, you can use one of the client libraries demonstrated in the
Cloud Storage guide to Downloading
Python Client for
in particular, is included in prebuilt containers.
also supports ADC.
Load a large dataset
Depending on which machine types you plan to use during custom training, your VMs might not be able to load the entirety of a large dataset into memory.
If you need to read data that is too large to fit in memory, stream the data or
read it incrementally. Different ML frameworks have different best practices for
doing this. For example, TensorFlow's
can stream TFRecord or text data from Cloud Storage.
Performing custom training on multiple VMs with data parallelism is another way to reduce the amount of data each VM loads into memory. See the Writing code for distributed training section of this document.
Export a trained ML model
ML code usually exports a trained model at the end of training in the form of one or more model artifacts. You can then use the model artifacts to get predictions.
After custom training completes, you can no longer access the VMs that ran your training code. Therefore, your training code must export model artifacts to a location outside of Vertex AI.
We recommend that you export model artifacts to a Cloud Storage bucket. As described in the previous section about which resources your code can access, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. Use a library that supports ADC to export your model artifacts. For example, TensorFlow's APIs for saving Keras models can export artifacts directly to a Cloud Storage path.
If you want to use your trained model to serve predictions on Vertex AI, then your code must export model artifacts in a format compatible with one of the prebuilt containers for prediction. Learn more in the guide to exporting model artifacts for prediction.
Environment variables for special Cloud Storage directories
If you specify the
Vertex AI sets the following environment variables when it runs
your training code:
AIP_MODEL_DIR: a Cloud Storage URI of a directory intended for saving model artifacts.
AIP_CHECKPOINT_DIR: a Cloud Storage URI of a directory intended for saving checkpoints.
AIP_TENSORBOARD_LOG_DIR: a Cloud Storage URI of a directory intended for saving TensorBoard logs. See Using Vertex AI TensorBoard with custom training.
The values of these environment variables differ slightly depending on whether
you are using hyperparameter tuning. To learn more, see the API reference for
Using these environment variables makes it easy to reuse the same training code
multiple times—for example with different data or configuration options—and save
model artifacts and checkpoints to different locations, just by changing the
baseOutputDirectory API field. However, you are not required to use the
environment variables in your code if you don't want to. For example, you can
alternatively hardcode locations for saving checkpoints and exporting model
Additionally, if you use a
TrainingPipeline for custom
training and do not specify the
Vertex AI uses the value of the
modelToUpload.artifactUri. (For hyperparameter tuning,
Vertex AI uses the value of the
variable from the best trial.)
Ensure resilience to restarts
The VMs that run your training code restart occasionally. For example, Google Cloud might need to restart a VM for maintenance reasons. When a VM restarts, Vertex AI starts running your code again from its start.
If you expect your training code to run for more than four hours, add several behaviors to your code to make it resilient to restarts:
Frequently export your training progress to Cloud Storage, at least once every four hours, so that you don't lose progress if your VMs restart.
At the start of your training code, check whether any training progress already exists in your export location. If so, load the saved training state instead of starting training from scratch.
Four hours is a guideline, not a hard limit. If ensuring resilience is a priority, consider adding these behaviors to your code even if you don't expect it to run for that long.
How to accomplish these behaviors depends on which ML framework you use. For
example, if you use TensorFlow Keras, learn how to use the
callback for this
To learn more about how Vertex AI manages VMs, see Understand the custom training service.
Requirements for optional custom training features
If you want to use certain optional custom training features, you might need to make additional changes to your training code. This section describes code requirements for hyperparameter tuning, GPUs, distributed training, and Vertex AI TensorBoard.
Write code to enable autologging
You can enable auotologging using the Vertex AI SDK for Python to automatically capture parameters and performance metrics when submitting the custom job. For details, see Run training job with experiment tracking.
Write code for hyperparameter tuning
Vertex AI can perform hyperparameter tuning on your ML training
code. Learn more about how hyperparameter tuning on Vertex AI
works and how to configure a
If you want to use hyperparameter tuning, your training code must do the following:
Parse command-line arguments representing the hyperparameters that you want to tune, and use the parsed values to set the hyperparameters for training.
Intermittently report the hyperparameter tuning metric to Vertex AI.
Parse command-line arguments
For hyperparameter tuning, Vertex AI runs your training code
multiple times, with different command-line arguments each time. Your training
code must parse these command-line arguments and use them as hyperparameters for
training. For example, to tune your optimizer's learning
you might want to parse a command-line argument named
how to configure which command-line arguments Vertex AI
We recommend that you use Python's
library to parse
Report the hyperparameter tuning metric
Your training code must intermittently report the hyperparameter metric that you are trying to optimize to Vertex AI. For example, if you want to maximize your model's accuracy, you might want to report this metric at the end of every training epoch. Vertex AI uses this information to decide what hyperparameters to use for the next training trial. Learn more about selecting and specifying a hyperparameter tuning metric.
cloudml-hypertune Python library to report the hyperparameter tuning
metric. This library is included in all prebuilt containers for
training, and you can use
pip to install
it in a custom container.
To learn how to install and use this library, see the
repository, or refer to the
Vertex AI: Hyperparameter Tuning codelab.
Write code for GPUs
You can select VMs with graphics processing units (GPUs) to run your custom training code. Learn more about configuring custom training to use GPU-enabled VMs.
If you want to train with GPUs, make sure your training code can take advantage of them. Depending on which ML framework you use, this might require changes to your code. For example, if you use TensorFlow Keras, you only need to adjust your code if you want to use more than one GPU. Some ML frameworks can't use GPUs at all.
In addition, make sure that your container supports GPUs: Select a prebuilt
container for training that supports GPUs,
or install the NVIDIA CUDA
Toolkit and NVIDIA
cuDNN on your custom container.
One way to do this is to use base image from the
repository; another way is
to use a Deep Learning Containers instance as your base
Write code for distributed training
To train on large datasets, you can run your code on multiple VMs in a distributed cluster managed by Vertex AI. Learn how to configure multiple VMs for training.
Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines which automatically coordinate how to divide the work based on environment variables set on each machine. Find out if Vertex AI sets environment variables to make this possible for your desired ML framework.
Alternatively, you can run a different container on each of several worker pools. A worker pool is a group of VMs that you configure to use the same compute options and container. In this case, you still probably want to rely on the environment variables set by Vertex AI to coordinate communication between the VMs. You can customize the training code of each worker pool to perform whatever arbitrary tasks you want; how you do this depends on your goal and which ML framework you use.
Track and visualize custom training experiments using Vertex AI TensorBoard
Vertex AI TensorBoard is a managed version of TensorBoard, a Google open source project for visualizing machine learning experiments. With Vertex AI TensorBoard you can track, visualize, and compare ML experiments and then share them with your team. You can also use TensorBoard Profiler to pinpoint and fix performance bottlenecks to train models faster and cheaper.
To use Vertex AI TensorBoard with custom training, you must do the following:
Create a Vertex AI TensorBoard instance in your project to store your experiments.
Configure a service account to run the custom training job with appropriate permissions.
Adjust your custom training code to write out TensorBoard compatible logs to Cloud Storage.
For a step-by-step guide on getting started with these requirements, see Using Vertex AI TensorBoard with custom training.
Learn the details of creating a Python training application to use with a prebuilt container or creating a custom container image.
If you aren't sure that you want to perform custom training, read a comparison of custom training and AutoML.