Using Vertex TensorBoard with custom training

If you are training models using custom training, you can set up your training job to automatically upload your Vertex TensorBoard logs to Vertex TensorBoard. For existing Vertex TensorBoard users this requires no change to your model training code.

You can use this integration to monitor your training in near real time as Vertex TensorBoard will stream in Vertex TensorBoard logs as they are written to Cloud Storage.

Follow the steps below to use Vertex TensorBoard with custom training.

Create a service account with required permissions

The Vertex TensorBoard integration with custom training requires the Bring Your Own Service Account (BYOSA) feature. If you already have a service account that you use for custom training, you can skip this step.

  1. Create a service account:

    gcloud beta --project=PROJECT_NAME iam service-accounts create USER_SA_NAME
    

    Replace the following:

    • PROJECT_NAME: the name of the project you are creating a service account in

    • USER_SA_NAME: the unique name of the service account you are creating

  2. Provide required IAM roles to the service account:

    GOOGLE_SA=$(gcloud projects get-iam-policy PROJECT_NAME \
        --flatten="bindings[].members" --format="table(bindings.members)" \
        --filter="bindings.role:roles/aiplatform.customCodeServiceAgent" | \
        grep "serviceAccount:" | head -n1)
    
  3. GOOGLE_SA is the service account for Vertex AI. The following command gives GOOGLE_SA permission to impersonate as your service account (SA_EMAIL).

    SA_EMAIL="USER_SA_NAME@PROJECT_NAME.iam.gserviceaccount.com"
    
    gcloud beta --project=PROJECT_NAME iam service-accounts add-iam-policy-binding \
        --role roles/iam.serviceAccountAdmin \
        --member $GOOGLE_SA $SA_EMAIL
    
  4. Provide your service account access for writing Vertex TensorBoard logs to Cloud Storage buckets as well as the Vertex TensorBoard service.

    gcloud beta projects add-iam-policy-binding PROJECT_NAME \
        --member="serviceAccount:${SA_EMAIL}" \
        --role="roles/storage.admin"
    
    gcloud beta projects add-iam-policy-binding PROJECT_NAME \
        --member="serviceAccount:${SA_EMAIL}" \
        --role="roles/aiplatform.user"
    

Create a Cloud Storage bucket to store Vertex TensorBoard logs

Set up your training script to write Vertex TensorBoard logs to a Cloud Storage bucket with Vertex TensorBoard APIs. The custom training service will ingest them to Vertex TensorBoard service. The Cloud Storage bucket should be a regionalized bucket and should be in the same region where the training job is created.

If you have an existing regionalized Cloud Storage bucket, you can skip the following bucket creation step and use the existing bucket instead.

Create a new regionalized bucket:

REGION=REGION
GCS_BUCKET_NAME="PROJECT_NAME-tensorboard-logs-${REGION}"
gsutil mb -l ${REGION} "gs://${GCS_BUCKET_NAME}"

Replace REGION with the region that your Vertex TensorBoard resource was created in; for example us-central1.

Changes to your training script

To use Vertex TensorBoard with custom training, you must make a change to your training script.

  • Configure the destination of the tensorboard logs to the value specified by the environmental variable AIP_TENSORBOARD_LOG_DIR, which is automatically populated by the training service.

    For example, in TensorFlow 2.x, change the following:

    tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
      histogram_freq=1)
    

Configure your training job for Vertex TensorBoard

You will need to add the following three parameters in your custom training resource configuration:

  • base_output_directory: the Cloud Storage path where all the output of the training is written

  • serviceAccount: The service account to be used for running your training job. It must have the necessary permissions for accessing Cloud Storage and Vertex TensorBoard

  • tensorboard: Full name of the Vertex TensorBoard instance where Vertex TensorBoard logs should be stored.

Creating a custom training job

Use the following example request to help you understand how to create your own custom training job.

For a detailed example of how to create a custom training job, please refer to Hello custom training.

# Fill in *all* the following env vars for your project
PROJECT_NAME=PROJECT_NAME
TENSORBOARD_RESOURCE_NAME=TENSORBOARD_RESOURCE_NAME
GCS_BUCKET_NAME="PROJECT_NAME-tensorboard-logs-${REGION}"
USER_SA_EMAIL="USER_SA_NAME@PROJECT_NAME.iam.gserviceaccount.com"
TRAINING_CONTAINER=TRAINING_CONTAINER
REGION=REGION

ENDPOINT="$REGION-aiplatform.googleapis.com"
INVOCATION_TIMESTAMP="$(date  +'%Y%m%d-%H%M%S')"
JOB_NAME="tensorboard-example-job-${INVOCATION_TIMESTAMP}"
BASE_OUTPUT_DIR="gs://$GCS_BUCKET_NAME/$JOB_NAME"

curl \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-X POST \
-d "{
'displayName':'$JOB_NAME',
'jobSpec':{
'workerPoolSpecs':[
     {
        'replicaCount': '1',
        'machineSpec': {
          'machineType': 'n1-standard-8',
        },
        'containerSpec': {
          'imageUri': '$TRAINING_CONTAINER',
        }
      }
],
'base_output_directory': {
'output_uri_prefix': '$BASE_OUTPUT_DIR',
 },
'serviceAccount': '$USER_SA_EMAIL',
'tensorboard':'$TENSORBOARD_RESOURCE_NAME',
}
}" \
https://${ENDPOINT}/v1beta1/projects/${PROJECT_NAME}/locations/us-central1/customJobs

Viewing your Vertex TensorBoard

To view a Vertex TensorBoard associated with a training job, navigate to the Training page in the Vertex AI section of the Google Cloud Console. Click the training job to view the Training Detail page, then click the Open TensorBoard button on the top of the page.

Alternatively, you can navigate to the Experiments tab and view the list of all experiments. Your experiment will have the same name as the training job.