This page shows you how to enable Cloud Profiler so you can debug model training performance for your custom training jobs.
Training models can be computationally expensive. Profiler lets you monitor and optimize your model training performance by helping you understand the resource consumption of training operations. With this information, you can pinpoint and fix performance bottlenecks to train models faster and cheaper.
Before you begin
- Ensure that you're using TensorFlow 2.4 or a later version.
Install the Vertex AI SDK with the
cloud_profiler
plugin. From your local Docker container, run:pip install google-cloud-aiplatform[cloud_profiler]
You must have a Vertex AI TensorBoard instance. See Create a Vertex AI TensorBoard instance for instructions.
You must have a service account with
roles/storage.admin
androles/aiplatform.user
roles. See Create a service account with required permissions for instructions.You must have a Cloud Storage bucket to store Vertex AI TensorBoard logs. See Create a Cloud Storage bucket to store Vertex AI TensorBoard logs for instructions.
Enable Profiler
To enable Profiler for your training job, add the following to your training script:
Add the
cloud_profiler
import at your top level imports:from google.cloud.aiplatform.training_utils import cloud_profiler
Initialize the
cloud_profiler
plugin by adding:cloud_profiler.init()
Example
Here's a sample training script:
#!/usr/bin/env python
import tensorflow as tf
import argparse
import os
from google.cloud.aiplatform.training_utils import cloud_profiler
import time
"""Train an mnist model and use cloud_profiler for profiling."""
def _create_model():
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)
return model
def main(args):
strategy = None
if args.distributed:
strategy = tf.distribute.MultiWorkerMirroredStrategy()
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
if args.distributed:
strategy = tf.distribute.MultiWorkerMirroredStrategy()
with strategy.scope():
model = _create_model()
model.compile(
optimizer="adam",
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=["accuracy"],
)
else:
model = _create_model()
model.compile(
optimizer="adam",
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=["accuracy"],
)
# Initialize the profiler.
cloud_profiler.init()
# Use AIP_TENSORBOARD_LOG_DIR to update where logs are written to.
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=os.environ["AIP_TENSORBOARD_LOG_DIR"], histogram_freq=1
)
model.fit(
x_train,
y_train,
epochs=args.epochs,
verbose=0,
callbacks=[tensorboard_callback],
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--epochs", type=int, default=100, help="Number of epochs to run model."
)
parser.add_argument(
"--distributed", action="store_true", help="Use MultiWorkerMirroredStrategy"
)
args = parser.parse_args()
main(args)
Access the Profiler dashboard
Once you have configured your training script to enable Profiler, run the training script with a Vertex AI TensorBoard instance.
In your training script, ensure the following configurations:
- Set
BASE_OUTPUT_DIR:
to the Cloud Storage bucket where you want to store the Vertex AI TensorBoard logs that's generated by your training script. - Set
'serviceAccount':
to the service account that you created withroles/storage.admin
androles/aiplatform.user
roles. Set
'tensorboard':
to the fully qualified name of the Vertex AI TensorBoard instance that you want to use with this training job. The fully qualified name has the following format:projects/PROJECT_NUMBER_OR_ID/locations/REGION/tensorboards/TENSORBOARD_INSTANCE_ID
There are two ways to access the Profiler dashboard from the Google Cloud console:
- From the Custom jobs page.
- From the Experiments page.
Access the Profiler dashboard through the Custom jobs page
You can use this method to access the Profiler dashboard even if the training job is in the Finished state.
In the Google Cloud console, go to the Custom jobs tab on the Training page.
Click the name of the training job you just created to go to the job details page.
Click Open TensorBoard.
Click the Profile tab.
Access the Profiler dashboard through the Experiments page
You can use this method to access the Profiler dashboard only when the training job is in the Running state.
In the Google Cloud console, go to the Vertex AI Experiments page.
Select the region of the training job that you just created.
Click Open TensorBoard next to the name of the training job.
Click the Profile tab.
Capture a profiling session
To capture a profiling session, your training job must be in the Running state. From the Profile tab in the Vertex AI TensorBoard instance, perform the following steps:
- Click Capture profile.
In the Profile Service URL(s) or TPU name field, enter:
workerpool0-0
For Address type, select IP address.
Click Capture.
Notebook
What's next
- See the Tensorflow Profiler documentation to learn about the profiler tools and how to use them to optimize model performance.