When using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard. For existing Vertex AI TensorBoard users, this requires no change to your model training code.
You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they are written to Cloud Storage.
Follow the steps below to use Vertex AI TensorBoard with custom training.
Create a service account with required permissions
The Vertex AI TensorBoard integration with custom training requires attaching a service account.
Create a service account:
gcloud --project=PROJECT_ID iam service-accounts create USER_SA_NAME
Replace the following:
PROJECT_ID
: the ID of the project in which you are creating a service accountUSER_SA_NAME
: a unique name for the service account you're creating
The new service account is used by the Vertex AI Training Service to access Google Cloud services and resources. Use the following commands to grant these roles if needed:
SA_EMAIL="USER_SA_NAME@PROJECT_ID.iam.gserviceaccount.com" gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:${SA_EMAIL}" \ --role="roles/storage.admin" gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:${SA_EMAIL}" \ --role="roles/aiplatform.user"
Create a Cloud Storage bucket to store Vertex AI TensorBoard logs
A Cloud Storage bucket is required to store the Vertex AI TensorBoard logs your training script generates. The bucket must be regional that is, not multi-region or dual-region, and the following resources must be in same region:
- the Cloud Storage bucket
- the Vertex AI training job
- the Vertex AI TensorBoard instance
You can use an existing bucket instead of following the bucket creation step below. When using an existing bucket, the REGION of the bucket has to be in the same region your Vertex AI TensorBoard instance was created in.
LOCATION=LOCATION_ID
GCS_BUCKET_NAME="PROJECT_ID-tensorboard-logs-${LOCATION_ID}"
gsutil mb -l ${LOCATION_ID} "gs://${GCS_BUCKET_NAME}"
Replace LOCATION_ID with the region that your
Vertex AI TensorBoard instance was created in, for example
us-central1
.
Changes to your training script
Your training script must be configured to write TensorBoard logs to the
Cloud Storage bucket, the location of which the Vertex AI Training
Service will automatically make available through a predefined environment
variable AIP_TENSORBOARD_LOG_DIR
.
This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR']
as the log directory to the open source TensorBoard log writing APIs.
For example, in TensorFlow 2.x, change the following:
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'],
histogram_freq=1)
For more examples, see the TensorBoard open source docs.
Learn more about how Vertex AI sets environment variables in your custom training environment.
Create a custom training job
The following example request shows how to create your own custom training job.
For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: Your region.
- PROJECT_ID: Your project ID.
- TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name
of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
projects/
PROJECT_ID
/locations/LOCATION_ID
/tensorboards/TENSORBOARD_INSTANCE_ID
Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404. - GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
- USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
- TRAINING_CONTAINER: TRAINING_CONTAINER.
- INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
- JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
- BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs
Request JSON body:
{ "displayName": JOB_NAME, "jobSpec":{ "workerPoolSpecs":[ { "replicaCount": "1", "machineSpec": { "machineType": "n1-standard-8", }, "containerSpec": { "imageUri": TRAINING_CONTAINER, } } ], "base_output_directory": { "output_uri_prefix": BASE_OUTPUT_DIR, }, "serviceAccount": USER_SA_EMAIL, "tensorboard": TENSORBOARD_INSTANCE_NAME, }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID", "displayName": "DISPLAY_NAME", "jobSpec": { "workerPoolSpecs": [ { "machineSpec": { "machineType": "n1-standard-8" }, "replicaCount": "1", "diskSpec": { "bootDiskType": "pd-ssd", "bootDiskSizeGb": 100 }, "containerSpec": { "imageUri": "IMAGE_URI" } } ], "serviceAccount": "SERVICE_ACCOUNT", "baseOutputDirectory": { "outputUriPrefix": "OUTPUT_URI_PREFIX" }, "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id" }, "state": "JOB_STATE_PENDING", "createTime": "CREATE-TIME", "updateTime": "UPDATE-TIME" }
Access your Vertex AI TensorBoard through the Google Cloud console
To view a Vertex AI TensorBoard associated with a training job,
navigate to the Training page in the Vertex AI section of the Google Cloud console,
select the Custom Jobs tab,
select the training job to view the Training Detail page,
click the Open TensorBoard button at the top of the page.
Alternatively, navigate to the Experiments page and view the list of all experiments. Your experiment will have the same name as the training job.
View appears. You can share the link to this view with others who have
permission.
View your Vertex AI TensorBoard experiment timeseries metrics in the Google Cloud console
The Compare and analyze runs
document explains how to view details of your experiment runs and compare the
experiment runs to each other in the Google Cloud console. Timeseries metrics
appear under Charts tab in the navigation bar.