When using custom training to train models, you can set up your training job to automatically upload your Vertex AI TensorBoard logs to Vertex AI TensorBoard.
You can use this integration to monitor your training in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard logs as they're written to Cloud Storage.
For initial setup see Set up for Vertex AI TensorBoard.
Changes to your training script
Your training script must be configured to write TensorBoard logs to the
Cloud Storage bucket, the location of which the Vertex AI Training
Service will automatically make available through a predefined environment
variable AIP_TENSORBOARD_LOG_DIR
.
This can usually be done by providing os.environ['AIP_TENSORBOARD_LOG_DIR']
as the log directory to the open source TensorBoard log writing APIs. The location
of the AIP_TENSORBOARD_LOG_DIR
is typically set with the staging_bucket
variable.
To configure your training script in TensorFlow 2.x, create a TensorBoard
callback and set the log_dir
variable to os.environ['AIP_TENSORBOARD_LOG_DIR']
The TensorBoard callback is then included in the TensorFlow model.fit
callbacks
list.
tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir=os.environ['AIP_TENSORBOARD_LOG_DIR'], histogram_freq=1 ) model.fit( x=x_train, y=y_train, epochs=epochs, validation_data=(x_test, y_test), callbacks=[tensorboard_callback], )
Learn more about how Vertex AI sets environment variables in your custom training environment.
Create a custom training job
The following example shows how to create your own custom training job.
For a detailed example of how to create a custom training job, see Hello custom training. For steps to build custom training containers, see Create a custom container image for training.
To create a custom training job use either Vertex AI SDK for Python or REST.
Python
Python
project
: Your project ID. You can find these IDs in the Google Cloud console welcome page.
location
: The location to run the CustomJob in. This should be the same location as the provided TensorBoard instance.staging_bucket
: The Cloud Storage bucket to stage artifacts during API calls, including TensorBoard logs.display_name
: Display name of the custom training job.script_path
: The path, relative to the working directory on your local file system, to the script that is the entry point for your training code.container_uri
: The URI of the training container image can be Vertex AI. prebuilt training container or a custom container.model_serving_container_image_uri
: The URI of the model serving container suitable for serving the model produced by the training script.dataset_id
: The ID number for the dataset to use for training.model_display_name
: Display name of the trained model.args
: Command line arguments to be passed to the Python script.replica_count
: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.machine_type
: The type of VM to use. For a list of supported VMs, see Machine typesaccelerator_type
: The type of GPU to attach to each VM in the resource pool. For a list of supported GPUs, see GPUs.accelerator_count
The number of GPUs to attach to each VM in the resource pool. The default the value is1
.training_fraction_split
: The fraction of the dataset to use to train your model.validation_fraction_split
: The fraction of the dataset to use to validate your model.test_fraction_split
: The fraction of the dataset to use to evaluate your model.sync
: Whether to execute this method synchronously.tensorboard_resource_name
: The resource name of the Vertex TensorBoard instance to which CustomJob will upload TensorBoard logs.service_account
: Required when running with TensorBoard. See Create a service account with required permissions.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The location to run the
CustomJob
in, for example, us-central1. This should be the same location as the provided TensorBoard instance. - PROJECT_ID: Your project ID.
- TENSORBOARD_INSTANCE_NAME: (Obligatory) The full name
of the existing Vertex AI TensorBoard instance storing your Vertex AI TensorBoard logs:
projects/
PROJECT_ID
/locations/LOCATION_ID
/tensorboards/TENSORBOARD_INSTANCE_ID
Note: If the tensorboard instance is not an existing one, the customJobs creation throws a 404. - GCS_BUCKET_NAME: "${PROJECT_ID}-tensorboard-logs-${LOCATION}"
- USER_SA_EMAIL: (Obligatory) The service account created in previous steps, or your own service account. "USER_SA_NAME@${PROJECT_ID}.iam.gserviceaccount.com"
- TRAINING_CONTAINER: TRAINING_CONTAINER.
- INVOCATION_TIMESTAMP: "$(date +'%Y%m%d-%H%M%S')"
- JOB_NAME: "tensorboard-example-job-${INVOCATION_TIMESTAMP}"
- BASE_OUTPUT_DIR: (Obligatory) the Google Cloud path where all the output of the training is written to. "gs://$GCS_BUCKET_NAME/$JOB_NAME"
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/customJobs
Request JSON body:
{ "displayName": JOB_NAME, "jobSpec":{ "workerPoolSpecs":[ { "replicaCount": "1", "machineSpec": { "machineType": "n1-standard-8", }, "containerSpec": { "imageUri": TRAINING_CONTAINER, } } ], "base_output_directory": { "output_uri_prefix": BASE_OUTPUT_DIR, }, "serviceAccount": USER_SA_EMAIL, "tensorboard": TENSORBOARD_INSTANCE_NAME, } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/customJobs/CUSTOM_JOB_ID", "displayName": "DISPLAY_NAME", "jobSpec": { "workerPoolSpecs": [ { "machineSpec": { "machineType": "n1-standard-8" }, "replicaCount": "1", "diskSpec": { "bootDiskType": "pd-ssd", "bootDiskSizeGb": 100 }, "containerSpec": { "imageUri": "IMAGE_URI" } } ], "serviceAccount": "SERVICE_ACCOUNT", "baseOutputDirectory": { "outputUriPrefix": "OUTPUT_URI_PREFIX" }, "tensorboard": "projects//locations/LOCATION_ID/tensorboards/tensorboard-id" }, "state": "JOB_STATE_PENDING", "createTime": "CREATE-TIME", "updateTime": "UPDATE-TIME" }
What's next
- Check out View Vertex AI TensorBoard.
- Learn how to optimize the performance of your custom training jobs using Cloud Profiler.