Use Vertex AI TensorBoard with custom training

When using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard. For existing Vertex AI TensorBoard users, this requires no change to your model training code.

You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they are written to Cloud Storage.

Follow the steps below to use Vertex AI TensorBoard with custom training.

Create a service account with required permissions

The Vertex AI TensorBoard integration with custom training requires attaching a service account.

  1. Create a service account:

    gcloud --project=PROJECT_ID iam service-accounts create USER_SA_NAME
    

    Replace the following:

    • PROJECT_ID: the ID of the project in which you are creating a service account

    • USER_SA_NAME: a unique name for the service account you're creating

  2. The new service account is used by the Vertex AI Training Service to access Google Cloud services and resources. Use the following commands to grant these roles if needed:

    SA_EMAIL="USER_SA_NAME@PROJECT_ID.iam.gserviceaccount.com"
    
    gcloud projects add-iam-policy-binding PROJECT_ID \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/storage.admin"
    
    gcloud projects add-iam-policy-binding PROJECT_ID \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/aiplatform.user"
    

Create a Cloud Storage bucket to store Vertex AI TensorBoard logs

A Cloud Storage bucket is required to store the Vertex AI TensorBoard logs your training script generates. The bucket must be regional that is, not multi-region or dual-region, and the following resources must be in same region:

  • the Cloud Storage bucket
  • the Vertex AI training job
  • the Vertex AI TensorBoard instance

You can use an existing bucket instead of following the bucket creation step below. When using an existing bucket, the REGION of the bucket has to be in the same region your Vertex AI TensorBoard instance was created in.

LOCATION=LOCATION_ID
GCS_BUCKET_NAME="PROJECT_ID-tensorboard-logs-${LOCATION_ID}"
gsutil mb -l ${LOCATION_ID} "gs://${GCS_BUCKET_NAME}"

Replace LOCATION_ID with the region that your Vertex AI TensorBoard instance was created in, for example us-central1.

Changes to your training script

Your training script must be configured to write TensorBoard logs to the Cloud Storage bucket, the location of which the Vertex AI Training Service will automatically make available through a predefined environment variable AIP_TENSORBOARD_LOG_DIR.

This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR'] as the log directory to the open source TensorBoard log writing APIs.

For example, in TensorFlow 2.x, change the following:

tensorboard_callback = tf.keras.callbacks.TensorBoard(
  log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
  histogram_freq=1)

For more examples, see the TensorBoard open source docs.

Learn more about how Vertex AI sets environment variables in your custom training environment.

Create a custom training job

The following example request shows how to create your own custom training job.

For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.

REST

Before using any of the request data, make the following replacements:

  • LOCATION_ID: Your region.
  • PROJECT_ID: Your project ID.
  • TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
    projects/PROJECT_ID/locations/LOCATION_ID/tensorboards/TENSORBOARD_INSTANCE_ID
    Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404.
  • GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
  • USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
  • TRAINING_CONTAINER: TRAINING_CONTAINER.
  • INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
  • JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
  • BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"

HTTP method and URL:

POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs

Request JSON body:

{
"displayName": JOB_NAME,
"jobSpec":{
"workerPoolSpecs":[
  {
    "replicaCount": "1",
     "machineSpec": {
        "machineType": "n1-standard-8",
      },
      "containerSpec": {
        "imageUri": TRAINING_CONTAINER,
      }
    }
  ],
  
  "base_output_directory": {
  "output_uri_prefix": BASE_OUTPUT_DIR,
   },
  "serviceAccount": USER_SA_EMAIL,
  "tensorboard": TENSORBOARD_INSTANCE_NAME,
  }

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID",
  "displayName": "DISPLAY_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "n1-standard-8"
        },
        "replicaCount": "1",
        "diskSpec": {
          "bootDiskType": "pd-ssd",
          "bootDiskSizeGb": 100
        },
        "containerSpec": {
          "imageUri": "IMAGE_URI"
        }
      }
    ],
    "serviceAccount": "SERVICE_ACCOUNT",
    "baseOutputDirectory": {
      "outputUriPrefix": "OUTPUT_URI_PREFIX"
    },
    "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id"
  },
  "state": "JOB_STATE_PENDING",
  "createTime": "CREATE-TIME",
  "updateTime": "UPDATE-TIME"
}

Access your Vertex AI TensorBoard through the Google Cloud console

To view a Vertex AI TensorBoard associated with a training job,

  1. navigate to the Training page in the Vertex AI section of the Google Cloud console,

  2. select the Custom Jobs tab,

  3. select the training job to view the Training Detail page,

  4. click the Open TensorBoard button at the top of the page.

View tensorboard for your project

Alternatively, navigate to the Experiments page and view the list of all experiments. Your experiment will have the same name as the training job.

View TensorBoard

View appears. You can share the link to this view with others who have permission.

TensorBoard view appears

View your Vertex AI TensorBoard experiment timeseries metrics in the Google Cloud console

The Compare and analyze runs document explains how to view details of your experiment runs and compare the experiment runs to each other in the Google Cloud console. Timeseries metrics appear under Charts tab in the navigation bar.

Vertex AI navigation bar with timeseries data charts

Notebooks