Prepare training code

Perform custom training on Vertex AI to run your own machine learning (ML) training code in the cloud, instead of using AutoML. This document describes best practices to consider as you write training code.

Choose a training code structure

First, determine what structure you want your ML training code to take. You can provide training code to Vertex AI in one of the following forms:

A Python script to use with a prebuilt container. Use the Vertex AI SDK to create a custom job. This method lets you provide your training application as a single Python script.
A Python training application to use with a prebuilt container. Create a Python source distribution with code that trains an ML model and exports it to Cloud Storage. This training application can use any of the dependencies included in the prebuilt container that you plan to use it with.

Note: If you use the Vertex AI SDK for Python to create a TrainingPipeline resource, then you can provide your training application as a single Python script, rather than as a Python source distribution.

Use this option if one of the Vertex AI prebuilt containers for training includes all the dependencies that you need for training. For example, if you want to train with PyTorch, scikit-learn, TensorFlow, or XGBoost, then this is likely the better option.

To learn about best practices specific to this option, read the guide to creating a Python training application.
A custom container image. Create a Docker container image with code that trains an ML model and exports it to Cloud Storage. Include any dependencies required by your code in the container image.

Use this option if you want to use dependencies that are not included in one of the Vertex AI prebuilt containers for training. For example, if you want to train using a Python ML framework that is not available in a prebuilt container, or if you want to train using a programming language other than Python, then this is the better option.

To learn about best practices specific to this option, read the guide to creating a custom container image.

The rest of this document describes best practices relevant to both training code structures.

Best practices for all custom training code

When you write custom training code for Vertex AI, keep in mind that the code will run on one or more virtual machine (VM) instances managed by Google Cloud. This section describes best practices applicable to all custom training code.

Access Google Cloud services in your code

Several of the following sections describe accessing other Google Cloud services from your code. To access Google Cloud services, write your training code to use Application Default Credentials (ADC). Many Google Cloud client libraries authenticate with ADC by default. You don't need to configure any environment variables; Vertex AI automatically configures ADC to authenticate as either the Vertex AI Custom Code Service Agent for your project (by default) or a custom service account (if you have configured one).

However, when you use a Google Cloud client library in your code, Vertex AI might not always connect to the correct Google Cloud project by default. If you encounter permission errors, connecting to the wrong project might be the problem.

This problem occurs because Vertex AI does not run your code directly in your Google Cloud project. Instead, Vertex AI runs your code in one of several separate projects managed by Google. Vertex AI uses these projects exclusively for operations related to your project. Therefore, don't try to infer a project ID from the environment in your training or inference code; specify project IDs explicitly.

If you don't want to hardcode a project ID in your training code, you can reference the CLOUD_ML_PROJECT_ID environment variable: Vertex AI sets this environment variable in every custom training container to contain the project number of the project where you initiated custom training. Many Google Cloud tools can accept a project number wherever they take a project ID.

For example, if you want to use the Python Client for Google BigQuery to access a BigQuery table in the same project, then don't try to infer the project in your training code:

Implicit project selection

from google.cloud import bigquery

client = bigquery.Client()

Instead use code that explicitly selects a project:

Explicit project selection

import os

from google.cloud import bigquery

project_number = os.environ["CLOUD_ML_PROJECT_ID"]

client = bigquery.Client(project=project_number)

If you encounter permission errors after configuring your code in this way, then read the following section about which resources your code can access to adjust the permissions available to your training code.

Which resources your code can access

By default, your training application can access any Google Cloud resources that are available to the Vertex AI Custom Code Service Agent (CCSA) of your project. You can grant the CCSA, and thereby your training application, access to a limited number of other resources by following the instructions in Grant Vertex AI service agents access to other resources. If your training application needs more than read-level access to Google Cloud resources that are not listed in that page, it needs to acquire an OAuth 2.0 access token with the https://www.googleapis.com/auth/cloud-platform scope, which can only be done by using a custom service account.

For example, consider your training code's access to Cloud Storage resources:

By default, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. You can also grant Vertex AI access to Cloud Storage buckets in other projects, or you can precisely customize what buckets a specific job can access by using a custom service account.

Read and write Cloud Storage files with Cloud Storage FUSE

In all custom training jobs, Vertex AI mounts Cloud Storage buckets that you have access to in the /gcs/ directory of each training node's file system. As a convenient alternative to using the Python Client for Cloud Storage or another library to access Cloud Storage, you can read and write directly to the local file system in order to read data from Cloud Storage or write data to Cloud Storage. For example, to load data from gs://BUCKET/data.csv, you can use the following Python code:

file = open('/gcs/BUCKET/data.csv', 'r')

Vertex AI uses Cloud Storage FUSE to mount the storage buckets. Note that directories mounted by Cloud Storage FUSE are not POSIX compliant.

The credentials that you are using for custom training determine which buckets you can access in this way. The preceding section about which resources your code can access describes exactly which buckets you can access by default and how to customize this access.

Load input data

ML code usually operates on training data in order to train a model. Don't store training data together with your code, whether you create a Python training application or a custom container image. Storing data with code can lead to a poorly organized project, make it difficult to reuse code on different datasets, and cause errors for large datasets.

You can load data from a Vertex AI managed dataset or write your own code to load data from a source outside of Vertex AI, such as BigQuery or Cloud Storage.

For best performance when you load data from Cloud Storage, use a bucket in the region where you are performing custom training. To learn how to store data in Cloud Storage, read Creating storage buckets and Uploading objects.

To learn about which Cloud Storage buckets you can load data from, read the previous section about which resources your code can access.

To load data from Cloud Storage in your training code, use the Cloud Storage FUSE feature described in the preceding section, or use any library that supports ADC. You don't need to explicitly provide any authentication credentials in your code.

For example, you can use one of the client libraries demonstrated in the Cloud Storage guide to Downloading objects. The Python Client for Cloud Storage, in particular, is included in prebuilt containers. TensorFlow's tf.io.gfile.GFile class also supports ADC.

Load a large dataset

Depending on which machine types you plan to use during custom training, your VMs might not be able to load the entirety of a large dataset into memory.

If you need to read data that is too large to fit in memory, stream the data or read it incrementally. Different ML frameworks have different best practices for doing this. For example, TensorFlow's tf.data.Dataset class can stream TFRecord or text data from Cloud Storage.

Performing custom training on multiple VMs with data parallelism is another way to reduce the amount of data each VM loads into memory. See the Writing code for distributed training section of this document.

Export a trained ML model

ML code usually exports a trained model at the end of training in the form of one or more model artifacts. You can then use the model artifacts to get inferences.

After custom training completes, you can no longer access the VMs that ran your training code. Therefore, your training code must export model artifacts to a location outside of Vertex AI.

We recommend that you export model artifacts to a Cloud Storage bucket. As described in the previous section about which resources your code can access, Vertex AI can access any Cloud Storage bucket in the Google Cloud project where you are performing custom training. Use a library that supports ADC to export your model artifacts. For example, the TensorFlow APIs for saving Keras models can export artifacts directly to a Cloud Storage path.

If you want to use your trained model to serve inferences on Vertex AI, then your code must export model artifacts in a format compatible with one of the prebuilt containers for inference. Learn more in the guide to exporting model artifacts for inference and explanation.

Environment variables for special Cloud Storage directories

If you specify the baseOutputDirectory API field, Vertex AI sets the following environment variables when it runs your training code:

AIP_MODEL_DIR: a Cloud Storage URI of a directory intended for saving model artifacts.
AIP_CHECKPOINT_DIR: a Cloud Storage URI of a directory intended for saving checkpoints.
AIP_TENSORBOARD_LOG_DIR: a Cloud Storage URI of a directory intended for saving TensorBoard logs. See Using Vertex AI TensorBoard with custom training.

The values of these environment variables differ slightly depending on whether you are using hyperparameter tuning. To learn more, see the API reference for baseOutputDirectory.

Using these environment variables makes it easier to reuse the same training code multiple times—for example with different data or configuration options—and save model artifacts and checkpoints to different locations, just by changing the baseOutputDirectory API field. However, you are not required to use the environment variables in your code if you don't want to. For example, you can alternatively hardcode locations for saving checkpoints and exporting model artifacts.

Additionally, if you use a TrainingPipeline for custom training and don't specify the modelToUpload.artifactUri field, then Vertex AI uses the value of the AIP_MODEL_DIR environment variable for modelToUpload.artifactUri. (For hyperparameter tuning, Vertex AI uses the value of the AIP_MODEL_DIR environment variable from the best trial.)

Ensure resilience to restarts

The VMs that run your training code restart occasionally. For example, Google Cloud might need to restart a VM for maintenance reasons. When a VM restarts, Vertex AI starts running your code again from its start.

If you expect your training code to run for more than four hours, add several behaviors to your code to make it resilient to restarts:

Frequently export your training progress to Cloud Storage, at least once every four hours, so that you don't lose progress if your VMs restart.
At the start of your training code, check whether any training progress already exists in your export location. If so, load the saved training state instead of starting training from scratch.

Four hours is a guideline, not a hard limit. If ensuring resilience is a priority, consider adding these behaviors to your code even if you don't expect it to run for that long.

How to accomplish these behaviors depends on which ML framework you use. For example, if you use TensorFlow Keras, learn how to use the ModelCheckpoint callback for this purpose.

To learn more about how Vertex AI manages VMs, see Understand the custom training service.

Best practices for optional custom training features

If you want to use certain optional custom training features, you might need to make additional changes to your training code. This section describes code best practices for hyperparameter tuning, GPUs, distributed training, and Vertex AI TensorBoard.

Write code to enable autologging

You can enable autologging using the Vertex AI SDK for Python to automatically capture parameters and performance metrics when submitting the custom job. For details, see Run training job with experiment tracking.

Write code to return container logs

When you write logs from your service or job, they will be picked up automatically by Cloud Logging so long as the logs are written to any of these locations:

Standard output (stdout) or standard error (stderr) streams
Log files in /var/log-storage/ that follow the output*.log naming convention.
syslog (/dev/log)
Logs written using Cloud Logging client libraries, which are available for many popular languages

Most developers are expected to write logs using standard output and standard error.

The container logs written to these supported locations are automatically associated with the Vertex AI custom training service, revision, and location, or with the custom training job. Exceptions contained in these logs are captured by and reported in Error Reporting.

Use simple text versus structured JSON in logs

When you write logs, you can send a simple text string or send a single line of serialized JSON, also called "structured" data. This is picked up and parsed by Cloud Logging and is placed into jsonPayload. In contrast, the simple text message is placed in textPayload.

Write structured logs

You can pass structured JSON logs in multiple ways. The most common ways are by using the Python Logging library or by passing raw JSON using print.

Python logging library

import json
import logging
from pythonjsonlogger import jsonlogger


class CustomJsonFormatter(jsonlogger.JsonFormatter):
 """Formats log lines in JSON."""
  def process_log_record(self, log_record):
    """Modifies fields in the log_record to match Cloud Logging's expectations."""
    log_record['severity'] = log_record['levelname']
    log_record['timestampSeconds'] = int(log_record['created'])
    log_record['timestampNanos'] = int(
        (log_record['created'] % 1) * 1000 * 1000 * 1000)


    return log_record




def configure_logger():
  """Configures python logger to format logs as JSON."""
  formatter = CustomJsonFormatter(
        '%(name)s|%(levelname)s|%(message)s|%(created)f'
        '|%(lineno)d|%(pathname)s', '%Y-%m-%dT%H:%M:%S')
  root_logger = logging.getLogger()
  handler = logging.StreamHandler()
  handler.setFormatter(formatter)
  root_logger.addHandler(handler)
  root_logger.setLevel(logging.WARNING)


logging.warning("This is a warning log")

Raw JSON

import json


def log(severity, message):
  global_extras = {"debug_key": "debug_value"}
  structured_log = {"severity": severity, "message": message, **global_extras}
  print(json.dumps(structured_log))


def main(args):
  log("DEBUG", "Debugging the application.")
  log("INFO", "Info.")
  log("WARNING", "Warning.")
  log("ERROR", "Error.")
  log("CRITICAL", "Critical.")

Special JSON fields in messages

When you provide a structured log as a JSON dictionary, some special fields are stripped from the jsonPayload and are written to the corresponding field in the generated LogEntry as described in the documentation for special fields.

For example, if your JSON includes a severity property, it is removed from the jsonPayload and appears instead as the log entry's severity. The message property is used as the main display text of the log entry if present.

Correlate your container logs with a request log (services only)

In the Logs Explorer, logs correlated by the same trace are viewable in "parent-child" format: when you click the triangle icon at the left of the request log entry, the container logs related to that request show up nested under the request log.

Container logs are not automatically correlated to request logs unless you use a Cloud Logging client library. To correlate container logs with request logs without using a client library, you can use a structured JSON log line that contains a logging.googleapis.com/trace field with the trace identifier extracted from the X-Cloud-Trace-Context header.

View logs

To view your container logs in the Google Cloud console, do the following:

In the Google Cloud console, go to the Vertex AI custom jobs page.

Go to Custom jobs
Click the name of the custom job that you want to see logs for.
Click View logs.

Write code for hyperparameter tuning

Vertex AI can perform hyperparameter tuning on your ML training code. Learn more about how hyperparameter tuning on Vertex AI works and how to configure a HyperparameterTuningJob resource.

If you want to use hyperparameter tuning, your training code must do the following:

Parse command-line arguments representing the hyperparameters that you want to tune, and use the parsed values to set the hyperparameters for training.
Intermittently report the hyperparameter tuning metric to Vertex AI.

Parse command-line arguments

For hyperparameter tuning, Vertex AI runs your training code multiple times, with different command-line arguments each time. Your training code must parse these command-line arguments and use them as hyperparameters for training. For example, to tune your optimizer's learning rate, you might want to parse a command-line argument named --learning_rate. Learn how to configure which command-line arguments Vertex AI provides.

We recommend that you use Python's argparse library to parse command-line arguments.

Report the hyperparameter tuning metric

Your training code must intermittently report the hyperparameter metric that you are trying to optimize to Vertex AI. For example, if you want to maximize your model's accuracy, you might want to report this metric at the end of every training epoch. Vertex AI uses this information to decide what hyperparameters to use for the next training trial. Learn more about selecting and specifying a hyperparameter tuning metric.

Use the cloudml-hypertune Python library to report the hyperparameter tuning metric. This library is included in all prebuilt containers for training, and you can use pip to install it in a custom container.

To learn how to install and use this library, see the cloudml-hypertune GitHub repository, or refer to the Vertex AI: Hyperparameter Tuning codelab.

Write code for GPUs

You can select VMs with graphics processing units (GPUs) to run your custom training code. Learn more about configuring custom training to use GPU-enabled VMs.

If you want to train with GPUs, make sure your training code can take advantage of them. Depending on which ML framework you use, this might require changes to your code. For example, if you use TensorFlow Keras, you only need to adjust your code if you want to use more than one GPU. Some ML frameworks can't use GPUs at all.

In addition, make sure that your container supports GPUs: Select a prebuilt container for training that supports GPUs, or install the NVIDIA CUDA Toolkit and NVIDIA cuDNN on your custom container. One way to do this is to use base image from the nvidia/cuda Docker repository; another way is to use a Deep Learning Containers instance as your base image.

Write code for distributed training

To train on large datasets, you can run your code on multiple VMs in a distributed cluster managed by Vertex AI. Learn how to configure multiple VMs for training.

Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines which automatically coordinate how to divide the work based on environment variables set on each machine. Find out if Vertex AI sets environment variables to make this possible for your ML framework.

Alternatively, you can run a different container on each of several worker pools. A worker pool is a group of VMs that you configure to use the same compute options and container. In this case, you still probably want to rely on the environment variables set by Vertex AI to coordinate communication between the VMs. You can customize the training code of each worker pool to perform whatever arbitrary tasks you want; how you do this depends on your goal and which ML framework you use.

Track and visualize custom training experiments using Vertex AI TensorBoard

Vertex AI TensorBoard is a managed version of TensorBoard, a Google open source project for visualizing machine learning experiments. With Vertex AI TensorBoard you can track, visualize, and compare ML experiments and then share them with your team. You can also use Cloud Profiler to pinpoint and fix performance bottlenecks to train models faster and cheaper.

To use Vertex AI TensorBoard with custom training, you must do the following:

Create a Vertex AI TensorBoard instance in your project to store your experiments (see Create a TensorBoard instance).
Configure a service account to run the custom training job with appropriate permissions.
Adjust your custom training code to write out TensorBoard compatible logs to Cloud Storage (see Changes to your training script)

For a step-by-step guide, see Using Vertex AI TensorBoard with custom training.

What's next

Learn the details of creating a Python training application to use with a prebuilt container or creating a custom container image.
If you aren't sure that you want to perform custom training, read a comparison of custom training and AutoML.