Use Vertex TensorBoard with custom training

If you are using custom training to train models, you can set up your training job to automatically upload your Vertex TensorBoard logs to Vertex TensorBoard. For existing Vertex TensorBoard users this requires no change to your model training code.

You can use this integration to monitor your training in near real time as Vertex TensorBoard streams in Vertex TensorBoard logs as they are written to Cloud Storage.

Follow the steps below to use Vertex TensorBoard with custom training.

Create a service account with required permissions

The Vertex TensorBoard integration with custom training requires attaching a service account.

  1. Create a service account:

    gcloud beta --project=PROJECT_NAME iam service-accounts create USER_SA_NAME
    

    Replace the following:

    • PROJECT_NAME: the name of the project you are creating a service account in

    • USER_SA_NAME: the unique name of the service account you are creating

  2. Grant your new service account IAM roles that provide access to the Google Cloud services and resources that you want Vertex AI to be able to use during custom training. The Storage Admin role (roles/storage.admin) and Vertex AI User role (roles/aiplatform.user) are required. Use the following commands to grant these roles if needed:

    SA_EMAIL="USER_SA_NAME@PROJECT_NAME.iam.gserviceaccount.com"
    
    gcloud beta projects add-iam-policy-binding PROJECT_NAME \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/storage.admin"
    
    gcloud beta projects add-iam-policy-binding PROJECT_NAME \
       --member="serviceAccount:${SA_EMAIL}" \
       --role="roles/aiplatform.user"
    

Create a Cloud Storage bucket to store Vertex TensorBoard logs

Set up your training script to write Vertex TensorBoard logs to a Cloud Storage bucket with Vertex TensorBoard APIs. The custom training service will ingest them to Vertex TensorBoard service. The Cloud Storage bucket should be a regionalized bucket and should be in the same region where the training job is created.

If you have an existing regionalized Cloud Storage bucket, you can skip the following bucket creation step and use the existing bucket instead.

Create a new regionalized bucket:

REGION=REGION
GCS_BUCKET_NAME="PROJECT_NAME-tensorboard-logs-${REGION}"
gsutil mb -l ${REGION} "gs://${GCS_BUCKET_NAME}"

Replace REGION with the region that your Vertex TensorBoard resource was created in; for example us-central1.

Changes to your training script

To use Vertex TensorBoard with custom training, you must make a change to your training script.

  • Configure the destination of the tensorboard logs to the value specified by the environment variable AIP_TENSORBOARD_LOG_DIR, which is automatically populated by the training service.

    For example, in TensorFlow 2.x, change the following:

    tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
      histogram_freq=1)
    

Configure your training job for Vertex TensorBoard

You will need to add the following three parameters in your custom training resource configuration:

  • base_output_directory: the Cloud Storage path where all the output of the training is written

  • serviceAccount: The service account to be used for running your training job. It must have the necessary permissions for accessing Cloud Storage and Vertex TensorBoard

  • tensorboard: Full name of the Vertex TensorBoard instance where Vertex TensorBoard logs should be stored.

Create a custom training job

The following example request shows how to create your own custom training job.

For a detailed example of how to create a custom training job, see Hello custom training.

# Fill in *all* the following env vars for your project
PROJECT_NAME=PROJECT_NAME
TENSORBOARD_RESOURCE_NAME=TENSORBOARD_RESOURCE_NAME
GCS_BUCKET_NAME="PROJECT_NAME-tensorboard-logs-${REGION}"
USER_SA_EMAIL="USER_SA_NAME@PROJECT_NAME.iam.gserviceaccount.com"
TRAINING_CONTAINER=TRAINING_CONTAINER
REGION=REGION

ENDPOINT="$REGION-aiplatform.googleapis.com"
INVOCATION_TIMESTAMP="$(date  +'%Y%m%d-%H%M%S')"
JOB_NAME="tensorboard-example-job-${INVOCATION_TIMESTAMP}"
BASE_OUTPUT_DIR="gs://$GCS_BUCKET_NAME/$JOB_NAME"

curl \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-X POST \
-d "{
'displayName':'$JOB_NAME',
'jobSpec':{
'workerPoolSpecs':[
     {
        'replicaCount': '1',
        'machineSpec': {
          'machineType': 'n1-standard-8',
        },
        'containerSpec': {
          'imageUri': '$TRAINING_CONTAINER',
        }
      }
],
'base_output_directory': {
'output_uri_prefix': '$BASE_OUTPUT_DIR',
 },
'serviceAccount': '$USER_SA_EMAIL',
'tensorboard':'$TENSORBOARD_RESOURCE_NAME',
}
}" \
https://${ENDPOINT}/v1beta1/projects/${PROJECT_NAME}/locations/us-central1/customJobs

View your Vertex TensorBoard

To view a Vertex TensorBoard associated with a training job, navigate to the Training page in the Vertex AI section of the Google Cloud Console. Click the training job to view the Training Detail page, then click the Open TensorBoard button on the top of the page.

Alternatively, you can navigate to the Experiments tab and view the list of all experiments. Your experiment will have the same name as the training job.