This page shows you how to use an interactive shell to inspect the container where your training code is running. You can browse the file system and run debugging utilities in each prebuilt container or custom container running on Vertex AI.
Using an interactive shell to inspect your training container can help you debug problems with your training code or your Vertex AI configuration. For example, you can use an interactive shell to do the following:
- Run tracing and profiling tools.
- Analyze GPU usage.
- Check Google Cloud permissions available to the container.
You can also use Cloud Profiler to debug model training performance for your custom training jobs. For details, see Profile model training performance using Profiler.
Before you begin
You can use an interactive shell when you perform custom training with a
CustomJob
resource, a HyperparameterTuningJob
resource, or a custom
TrainingPipeline
resource. As you prepare your training
code and configure the custom training
resource of your choice, make sure to
meet the following requirements:
Ensure that your training container has
bash
installed.All prebuilt training containers have
bash
installed. If you create a custom container for training, use a base container that includesbash
or installbash
in your Dockerfile.Perform custom training in a region that supports interactive shells.
Ensure that anyone who wants to access an interactive shell has the following permissions for the Google Cloud project where custom training is running:
aiplatform.customJobs.create
aiplatform.customJobs.get
aiplatform.customJobs.cancel
If you initiate custom training yourself, then you most likely already have these permissions and can access an interactive shell. However, if you want to use an interactive shell to inspect a custom training resource created by someone else in your organization, then you might need to obtain these permissions.
One way to obtain these permissions is to ask an administrator of your organization to grant you the Vertex AI User role (
roles/aiplatform.user
).
Requirements for advanced cases
If you are using certain advanced features, meet the following additional requirements:
If you attach a custom service account to your custom training resource, then make sure that any user who wants to access an interactive shell has the
iam.serviceAccounts.actAs
permission for the attached service account.The guide to custom service accounts notes that you must have this permission to attach a service account. You also need this permission to view an interactive shell during custom training.
For example, to create a
CustomJob
with a service account attached, you must have theiam.serviceAccounts.actAs
permission for the service account. If one of your colleagues then wants to view an interactive shell for thisCustomJob
, they must also have the sameiam.serviceAccounts.actAs
permission.If you have configured your project to use VPC Service Controls with Vertex AI, then account for the following additional limitations:
You can't use private IP for custom training. If you require VPC-SC with VPC Peering, there is extra setup required to use the interactive shell. Follow the instructions covered in Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering to configure the interactive shell setup with VPC-SC and VPC Peering in your user project.
From within an interactive shell, you can't access the public internet or Google Cloud resources outside your service perimeter.
To secure access to interactive shells, you must add
notebooks.googleapis.com
as a restricted service in your service perimeter, in addition toaiplatform.googleapis.com
. If you only restrictaiplatform.googleapis.com
and notnotebooks.googleapis.com
, then users can access interactive shells from machines outside the service perimeter, which reduces the security benefit of using VPC Service Controls.
Enable interactive shells
To enable interactive shells for a custom training resource, set the
enableWebAccess
API
field to true
when you create a CustomJob
, HyperparameterTuningJob
, or custom
TrainingPipeline
.
The following examples show how to do this using several different tools:
Console
Follow the guide to
creating a custom TrainingPipeline
in the Google Cloud console. In the
Train new model pane, when you reach the Model details step, do the
following:
Click Advanced options.
Select the Enable training debugging checkbox.
Then, complete the rest of the Train new model workflow.
gcloud
If you want to create a
CustomJob
, run thegcloud ai custom-jobs create
command, and specify the--enable-web-access
flag on this command.If you want to create a
HyperparameterTuningJob
, run thegcloud ai hp-tuning-jobs create
command, and specify the--enable-web-access
flag on this command.
To learn how to use these commands, see the guide to
creating a CustomJob
and the guide
to creating a HyperparameterTuningJob
.
API
The following partial REST request bodies show where to specify the
enableWebAccess
field for each type of custom training resource:
CustomJob
The following example is a partial request body for the
projects.locations.customJobs.create
API
method:
{
...
"jobSpec": {
...
"enableWebAccess": true
}
...
}
For an example of sending an API request to create a CustomJob
, see
Creating custom training jobs.
HyperparameterTuningJob
The following example is a partial request body for the
projects.locations.hyperparameterTuningJobs.create
API
method:
{
...
"trialJobSpec": {
...
"enableWebAccess": true
}
...
}
For an example of sending an API request to create a
HyperparameterTuningJob
, see Using hyperparameter
tuning.
Custom TrainingPipeline
The following examples show partial request bodies for the
projects.locations.trainingPipelines.create
API
method.
Select one of the following tabs, depending on whether you are using
hyperparameter tuning:
Without hyperparameter tuning
{
...
"trainingTaskInputs": {
...
"enableWebAccess": true
}
...
}
With hyperparameter tuning
{
...
"trainingTaskInputs": {
...
"trialJobSpec": {
...
"enableWebAccess": true
}
}
...
}
For an example of sending an API request to create a custom
TrainingPipeline
, see Creating training
pipelines.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Set the enable_web_access
parameter to true
when you run one of the
following methods:
If you want to create a
CustomJob
, use theCustomJob.run
method.If you want to create a
HyperparameterTuningJob
use theHyperparameterTuningJob.run
method.If you want to create a custom
TrainingPipeline
, use one of the following methods:
Navigate to an interactive shell
After you have initiated custom training according to the guidance in the preceding section, Vertex AI generates one or more URIs that you can use to access interactive shells. Vertex AI generates a unique URI for each training node in your job.
You can navigate to an interactive shell in one of the following ways:
- Click a link in the Google Cloud console
- Use the Vertex AI API to get the shell's web access URI
Navigate from the Google Cloud console
In the Google Cloud console, in the Vertex AI section, go to one of the following pages:
If you aren't using hyperparameter tuning, go to the Custom jobs page:
If you are using hyperparameter tuning, go to the Hyperparameter tuning jobs page:
Click the name of your custom training resource.
If you created a
TrainingPipeline
for custom training, click the name of theCustomJob
orHyperparameterTuningJob
that was created by yourTrainingPipeline
. For example, if your pipeline has the namePIPELINE_NAME
, this might be calledPIPELINE_NAME-custom-job
orPIPELINE_NAME-hyperparameter-tuning-job
.On the page for your job, click Launch web terminal. If your job uses multiple nodes, click Launch web terminal next to the node for which you want an interactive shell.
Note that you can only access an interactive shell while the job is running. If you don't see Launch web terminal, this might be because Vertex AI hasn't started running your job yet, or because the job has already finished or failed. If the job's Status is
Queued
orPending
, wait a minute; then try refreshing the page.If you are using hyperparameter tuning, there are separate Launch web terminal links for each trial.
Get the web access URI from the API
Use the projects.locations.customJobs.get
API
method or the
projects.locations.hyperparameterTuningJobs.get
API
method to
see the URIs that you can use to access interactive shells.
Depending on which type of custom training resource you are using, select one of
the following tabs to see examples of how to find the webAccessUris
API field,
which contains an interactive shell URI for each node in your job:
CustomJob
The following tabs show different ways to send a
projects.locations.customJobs.get
request:
gcloud
Run the gcloud ai custom-jobs describe
command:
gcloud ai custom-jobs describe JOB_ID \
--region=LOCATION \
--format=json
Replace the following:
JOB_ID: The numerical ID of your job. This ID is the last last part of the job's
name
field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai custom-jobs list
command and look for the appropriate job.)LOCATION: The region where you created the job.
REST
Before using any of the request data, make the following replacements:
LOCATION: The region where you created the job.
PROJECT_ID: Your project ID.
JOB_ID: The numerical ID of your job. This ID is the last last part of the job's
name
field. You might have seen the ID when you created the job.
HTTP method and URL:
GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs/JOB_ID
To send your request, expand one of these options:
In the output, look for the following:
{
...
"state": "JOB_STATE_RUNNING",
...
"webAccessUris": {
"workerpool0-0": "INTERACTIVE_SHELL_URI"
}
}
If you don't see the webAccessUris
field, this might be because
Vertex AI hasn't started running your job yet. Verify that you see
JOB_STATE_RUNNING
in the state
field. If the state is JOB_STATE_QUEUED
or
JOB_STATE_PENDING
, wait a minute; then try getting the project info again.
HyperparameterTuningJob
The following tabs show different ways to send a
projects.locations.hyperparameterTuningJobs.get
request:
gcloud
Run the gcloud ai hp-tuning-jobs describe
command:
gcloud ai hp-tuning-jobs describe JOB_ID \
--region=LOCATION \
--format=json
Replace the following:
JOB_ID: The numerical ID of your job. This ID is the last last part of the job's
name
field. You might have seen the ID when you created the job. (If you don't know your job's ID, you can run thegcloud ai hp-tuning-jobs list
command and look for the appropriate job.)LOCATION: The region where you created the job.
REST
Before using any of the request data, make the following replacements:
LOCATION: The region where you created the job.
PROJECT_ID: Your project ID.
JOB_ID: The numerical ID of your job. This ID is the last last part of the job's
name
field. You might have seen the ID when you created the job.
HTTP method and URL:
GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/hyperparameterTuningJobs/JOB_ID
To send your request, expand one of these options:
In the output, look for the following:
{
...
"state": "JOB_STATE_RUNNING",
...
"trials": [
...
{
...
"state": "ACTIVE",
...
"webAccessUris": {
"workerpool0-0": "INTERACTIVE_SHELL_URI"
}
}
],
}
If you don't see the webAccessUris
field, this might be because
Vertex AI hasn't started running your job yet. Verify that you see
JOB_STATE_RUNNING
in the state
field. If the state is JOB_STATE_QUEUED
or
JOB_STATE_PENDING
, wait a minute; then try getting the project info again.
Vertex AI provides a set of interactive shell URIs for each
hyperparameter tuning
trial as
the trial enters the ACTIVE
state. If you want to get interactive shell URIs
for later trials, get the job info again after those trials start.
The preceding example shows the expected output for single-replica training: one URI for the primary training node. If you are performing distributed training, the output contains one URI for each training node, identified by worker pool.
For example, if your job has a primary worker pool with one replica and a
secondary worker pool with two replicas, then the webAccessUris
field looks
similar to the following:
{
"workerpool0-0": "URI_FOR_PRIMARY",
"workerpool1-0": "URI_FOR_FIRST_SECONDARY",
"workerpool1-1": "URI_FOR_SECOND_SECONDARY"
}
Use an interactive shell
To use the interactive shell for a training node, navigate to one of the URIs that you found in the preceding section. A Bash shell appears in your browser, giving you access to the file system of the container where Vertex AI is running your training code.
The following sections describe some things to consider as you use the shell and provide some examples of monitoring tools you might use in the shell.
Prevent the job from ending
When Vertex AI finishes running your job or trial, you will
immediately lose access to your interactive shell. If this happens, you might
see the message command terminated with exit code 137
or the shell might stop
responding. If you created any files in the container's file system, they will
not persist after the job ends.
In some cases, you might want to purposefully make your job run longer in order to debug with an interactive shell. For example, you can add code like the following to your training code in order to make the job keep running for at least an hour after an exception occurs:
import time
import traceback
try:
# Replace with a function that runs your training code
train_model()
except Exception as e:
traceback.print_exc()
time.sleep(60 * 60) # 1 hour
However, note that you incur Vertex AI Training charges as long as the job keeps running.
Check permissions issues
The interactive shell environment is authenticated using application default
credentials (ADC) for the
service account that Vertex AI uses to run your training code. You
can run gcloud auth list
in the shell for more details.
In the shell, you can use bq
and other
tools that support ADC. This can help you verify that the job is able to access
a particular Cloud Storage bucket, BigQuery table, or other
Google Cloud resource that your training code needs.
Visualize Python execution with py-spy
py-spy
lets you profile
an executing Python program, without modifying it. To use py-spy
in an
interactive shell, do the following:
Install
py-spy
:pip3 install py-spy
Run
ps aux
in the shell, and look for the PID of the Python training program.Run any of the subcommands described in the
py-spy
documentation, using the PID that you found in the preceding step.If you use
py-spy record
to create an SVG file, copy this file to a Cloud Storage bucket so you can view it later on your local computer. For example:gcloud storage cp profile.svg gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
Analyze performance with perf
perf
lets you analyze the performance of your training node.
To install the version of perf
appropriate for your node's Linux kernel, run
the following commands:
apt-get update
apt-get install -y linux-tools-generic
rm /usr/bin/perf
LINUX_TOOLS_VERSION=$(ls /usr/lib/linux-tools | tail -n 1)
ln -s "/usr/lib/linux-tools/${LINUX_TOOLS_VERSION}/perf" /usr/bin/perf
After this, you can run any of the subcommands described in the perf
documentation.
Retrieve information about GPU usage
GPU-enabled containers running on nodes with GPUs typically have several command-line tools preinstalled that can help you monitor GPU usage. For example:
Use
nvidia-smi
to monitor GPU utilization of various processes.Use
nvprof
to collect a variety of GPU profiling information. Sincenvprof
can't attach to an existing process, you might want to use the tool to start an additional process running your training code. (This means your training code will run twice on the node.) For example:nvprof -o prof.nvvp python3 -m MODULE_NAME
Replace MODULE_NAME with the fully-qualified name of your training application's entry point module; for example,
trainer.task
.Then transfer the output file to a Cloud Storage bucket so you can analyze it later on your local computer. For example:
gcloud storage cp prof.nvvp gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
If you encounter a GPU error (not a problem with your configuration or with Vertex AI), use
nvidia-bug-report.sh
to create a bug report.Then transfer the report to a Cloud Storage bucket so you can analyze it later on your local computer or send it to NVIDIA. For example:
gcloud storage cp nvidia-bug-report.log.gz gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
If bash
can't find any of these NVIDIA commands, try adding
/usr/local/nvidia/bin
and /usr/local/cuda/bin
to the shell's PATH
:
export PATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}"
Ray Dashboard and Interactive Shell with VPC-SC + VPC Peering
-
Configure
peered-dns-domains
.{ VPC_NAME=NETWORK_NAME REGION=LOCATION gcloud services peered-dns-domains create training-cloud \ --network=$VPC_NAME \ --dns-suffix=$REGION.aiplatform-training.cloud.google.com. # Verify gcloud beta services peered-dns-domains list --network $VPC_NAME; }
-
NETWORK_NAME: Change to peered network.
-
LOCATION: Desired location (for example,
us-central1
).
-
-
Configure
DNS managed zone
.{ PROJECT_ID=PROJECT_ID ZONE_NAME=$PROJECT_ID-aiplatform-training-cloud-google-com DNS_NAME=aiplatform-training.cloud.google.com DESCRIPTION=aiplatform-training.cloud.google.com gcloud dns managed-zones create $ZONE_NAME \ --visibility=private \ --networks=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/global/networks/$VPC_NAME \ --dns-name=$DNS_NAME \ --description="Training $DESCRIPTION" }
-
PROJECT_ID: Your project ID. You can find these IDs in the Google Cloud console welcome page.
-
-
Record DNS transaction.
{ gcloud dns record-sets transaction start --zone=$ZONE_NAME gcloud dns record-sets transaction add \ --name=$DNS_NAME. \ --type=A 199.36.153.4 199.36.153.5 199.36.153.6 199.36.153.7 \ --zone=$ZONE_NAME \ --ttl=300 gcloud dns record-sets transaction add \ --name=*.$DNS_NAME. \ --type=CNAME $DNS_NAME. \ --zone=$ZONE_NAME \ --ttl=300 gcloud dns record-sets transaction execute --zone=$ZONE_NAME }
-
Submit a training job with the interactive shell + VPC-SC + VPC Peering enabled.
What's next
- Learn how to optimize the performance of your custom training jobs using Profiler.
- Learn more about how Vertex AI orchestrates custom training.
- Read about Training code requirements.