Use Vertex AI TensorBoard with custom training

If you are using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard. For existing Vertex AI TensorBoard users this requires no change to your model training code.

You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they are written to Cloud Storage.

Follow the steps below to use Vertex AI TensorBoard with custom training.

Create a service account with required permissions

The Vertex AI TensorBoard integration with custom training requires attaching a service account.

  1. Create a service account:

    gcloud --project=PROJECT_NAME iam service-accounts create USER_SA_NAME
    

    Replace the following:

    • PROJECT_NAME: the name of the project in which you are creating a service account

    • USER_SA_NAME: a unique name for the service account you are creating

  2. The new service account is used by the Vertex AI Training Service to access Google Cloud services and resources. Use the following commands to grant these roles if needed:

    SA_EMAIL="USER_SA_NAME@PROJECT_NAME.iam.gserviceaccount.com"
    
    gcloud projects add-iam-policy-binding PROJECT_NAME \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/storage.admin"
    
    gcloud projects add-iam-policy-binding PROJECT_NAME \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/aiplatform.user"
    

Create a Cloud Storage bucket to store Vertex AI TensorBoard logs

A Cloud Storage bucket is required to store the Vertex AI TensorBoard logs your training script generates. The bucket must be regional that is, not multi-region or dual-region, and the following resources must be in same region:

  • the Cloud Storage bucket
  • the Vertex AI training job
  • the Vertex AI TensorBoard instance

You can use an existing bucket instead of following the bucket creation step below.

REGION=REGION
GCS_BUCKET_NAME="PROJECT_NAME-tensorboard-logs-${REGION}"
gsutil mb -l ${REGION} "gs://${GCS_BUCKET_NAME}"

Replace REGION with the region that your Vertex AI TensorBoard instance was created in, for example us-central1.

Changes to your training script

Your training script must be configured to write TensorBoard logs to the Cloud Storage bucket, the location of which the Vertex AI Training Service will automatically make available via a predefined environment variable AIP_TENSORBOARD_LOG_DIR.

This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR'] as the log directory to the open source TensorBoard log writing APIs.

For example, in TensorFlow 2.x, change the following:

tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)

For more examples, see the TensorBoard open source docs.

Learn more about how Vertex AI sets environment variables in your custom training environment.

Create a custom training job

The following example request shows how to create your own custom training job.

For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.

The following three fields are required:

  1. base_output_directory, BASE_OUTPUT_DIR: the Google Cloud path where all the output of the training is written
  2. serviceAccount, USER_SA_EMAIL: the service account created in previous steps, or your own service account.
  3. tensorboard: the name of the Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs,TENSORBOARD_RESOURCE_NAME : projects/PROJECT_NUMBER_OR_ID/locations/REGION/tensorboards/TENSORBOARD_INSTANCE_ID.

# Fill in *all* the following environment varirables for your project.
REGION=REGION
PROJECT_NAME=PROJECT_NAME
TENSORBOARD_INSTANCE_NAME=TENSORBOARD_INSTANCE
GCS_BUCKET_NAME="${PROJECT_NAME}-tensorboard-logs-${REGION}"
USER_SA_EMAIL="USER_SA_NAME@${PROJECT_NAME}.iam.gserviceaccount.com"
TRAINING_CONTAINER=TRAINING_CONTAINER

ENDPOINT="$REGION-aiplatform.googleapis.com"
INVOCATION_TIMESTAMP="$(date  +'%Y%m%d-%H%M%S')"
JOB_NAME="tensorboard-example-job-${INVOCATION_TIMESTAMP}"
BASE_OUTPUT_DIR="gs://$GCS_BUCKET_NAME/$JOB_NAME"

curl \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-X POST \
-d "{
'displayName':'$JOB_NAME',
'jobSpec':{
'workerPoolSpecs':[
     {
        'replicaCount': '1',
        'machineSpec': {
          'machineType': 'n1-standard-8',
        },
        'containerSpec': {
          'imageUri': '$TRAINING_CONTAINER',
        }
      }
],

'base_output_directory': {
'output_uri_prefix': '$BASE_OUTPUT_DIR',
 },
'serviceAccount': '$USER_SA_EMAIL',
'tensorboard':'$TENSORBOARD_INSTANCE_NAME',
}
}" \
https://${ENDPOINT}/v1/projects/${PROJECT_NAME}/locations/us-central1/customJobs

View your Vertex AI TensorBoard in the Google Cloud Console

To view a Vertex AI TensorBoard associated with a training job,

  1. navigate to the Training page in the Vertex AI section of the Google Cloud Console,

  2. select the Custom Jobs tab,

  3. select the training job to view the Training Detail page,

  4. click the Open TensorBoard button at the top of the page.

view tensorboard for your project

Alternatively, you can navigate to the Experiments page and view the list of all experiments. Your experiment will have the same name as the training job.

view tensorboard for your project via traning option