During training, you can use an interactive shell to inspect the container where your training code is running. You can browse the file system and run debugging utilities in each runtime version or custom container running on AI Platform Training.
Using an interactive shell to inspect your training container can help you debug problems with your training code or your AI Platform Training configuration. For example, you can use an interactive shell to do the following:
- Run tracing and profiling tools
- Analyze GPU usage
- Check Google Cloud permissions available to the container
Before you begin
You can use an interactive shell when you run a training or hyperparameter tuning job. As you prepare your training code and run a training job, make sure to meet the following requirements:
Ensure that your training container has
bash
installed.All runtime version containers have
bash
installed. If you create a custom container for training, use a base container that includesbash
or installbash
in your Dockerfile.Perform training in a region that supports interactive shells.
Ensure that anyone who wants to access an interactive shell has the following permissions for the Google Cloud project where training is running:
ml.jobs.create
ml.jobs.get
ml.jobs.cancel
If you initiate training yourself, then you most likely already have these permissions and can access an interactive shell. However, if you want to use an interactive shell to inspect a training job created by someone else in your organization, then you might need to obtain these permissions.
One way to obtain these permissions is to ask an administrator of your organization to grant you the AI Platform Training Admin role (
roles/ml.admin
). If you are granted the AI Platform Training Developer role (roles/ml.developer
), you will have access to the interactive shell for jobs you create.
Requirements for advanced cases
If you are using certain advanced features, meet the following additional requirements:
If you attach a custom service account to your training job, then make sure that any user who wants to access an interactive shell has the
iam.serviceAccounts.actAs
permission for the attached service account.The guide to custom service accounts notes that you must have this permission to attach a service account. You also need this permission to view an interactive shell during custom training.
For example, to create a job with a service account attached, you must have the
iam.serviceAccounts.actAs
permission for the service account. If one of your colleagues then wants to view an interactive shell for this job, they must also have the sameiam.serviceAccounts.actAs
permission.If you have configured your project to use VPC Service Controls with AI Platform Training, then account for the following additional limitations:
You cannot use VPC Service Controls with VPC Network Peering.
From within an interactive shell, you cannot access the public internet or Google Cloud resources outside your service perimeter.
To secure access to the interactive shells, you must add
notebooks.googleapis.com
as a restricted service in your service perimeter,in addition toml.googleapis.com
. If you only restrictml.googleapis.com
and notnotebooks.googleapis.com
, then users can access interactive shells from machines outside the service perimeter, which reduces the security benefit of using VPC Service Controls.
Enabling interactive shells
To enable interactive shells for a training job, set the
enableWebAccess
API field
to true
in your job's trainingInput
field when you create a training job.
The following example shows how to do this by adding the --enable-web-access
flag when using the gcloud CLI. You cannot currently create a
training job with an interactive shell enabled in the Google Cloud console.
The example assumes that you have a training
application on your local filesystem in a
directory named trainer
with a module named task
.
To create the training job, run the following command:
gcloud ai-platform jobs submit training JOB_ID \
--enable-web-access \
--job-dir=JOB_DIR \
--module-name=trainer.task \
--package-path=trainer \
--python-version=3.7 \
--region=REGION \
--runtime-version=2.11 \
--scale-tier=CUSTOM \
--master-machine-type=n1-highmem-8
In this command, replace the following placeholders:
- JOB_ID: A name that you choose for the job.
- JOB_DIR: A path to a Cloud Storage directory where your training application will be uploaded to.
REGION: The region where you plan to create the training job. Note that it must be a region that supports interactive shells.
The command produces the following output if successful:
Job [JOB_ID] submitted successfully. Your job is still active. You may view the status of your job with the command $ gcloud ai-platform jobs describe JOB_ID or continue streaming the logs with the command $ gcloud ai-platform jobs stream-logs JOB_ID jobId: JOB_ID state: QUEUED
Getting the web access URI
After you have initiated training according to the guidance in the
preceding section, use the Google Cloud console or the gcloud
command-line
tool to see the URIs that you can use to access interactive shells.
AI Platform Training provides a URI for each training
node that is part of your job.
Depending on which type of training job you created, select one of the following
tabs to see examples of how to find the webAccessUris
API field, which
contains an interactive shell URI for each node in your job.
Training Job
The following tabs show different ways to access the TrainingOutput
for a
standard training job.
gcloud
Run the gcloud ai-platform jobs describe
command:
gcloud ai-platform jobs describe JOB_ID
Replace the following:
-
JOB_ID: The ID for your job. You set this ID when you created the
job.
(If you don't know your job's ID, you can run the
gcloud ai-platform jobs list
command and look for the appropriate job.)
In the output, look for the following:
trainingOutput:
webAccessUris:
master-replica-0: INTERACTIVE_SHELL_URI
Console
Open the AI Platform Training Jobs page in the Google Cloud console.
Click your job name in the list to open the Job Details page.
Click the Show Json button in the Training output section to expand a JSON view of the
TrainingOutput
for the job.
In the output, look for the following:
{
"webAccessUris": {
"master-replica-0": "INTERACTIVE_SHELL_URI"
}
}
If you don't see the webAccessUris
field, this might be because
AI Platform Training hasn't started running your job or trial yet.
Verify that you see RUNNING
the state
field. If the state is QUEUED
or
PREPARING
, wait a minute; then try getting the job info again.
Hyperparameter Tuning Job
The following tabs show different ways to access the TrainingOutput
for a
hyperparameter tuning job.
gcloud
Run the gcloud ai-platform jobs describe
command:
gcloud ai-platform jobs describe JOB_ID
Replace the following:
-
JOB_ID: The ID for your job. You set this ID when you created the
job.
(If you don't know your job's ID, you can run the
gcloud ai-platform jobs list
command and look for the appropriate job.)
In the output, look for the following:
trainingOutput:
trials:
- trialId: '1'
webAccessUris:
master-replica-0: INTERACTIVE_SHELL_URI
Console
Open the AI Platform Training Jobs page in the Google Cloud console.
Click your job name in the list to open the Job Details page.
Click the Show Json button in the Training output section to expand a JSON view of the
TrainingOutput
for the job.
In the output, look for the following:
{
"trials": [
{
...
"webAccessUris": {
"master-replica-0": "INTERACTIVE_SHELL_URI"
}
},
...
]
}
If you don't see the webAccessUris
field, this might be because
AI Platform Training hasn't started running your job or trial yet.
Verify that you see RUNNING
the state
field. If the state is QUEUED
or
PREPARING
, wait a minute; then try getting the job info again.
AI Platform Training provides a set of interactive shell URIs for each
hyperparameter tuning
trial as
the trial enters the RUNNING
state. If you want to get the interactive
shell URIs for later trials, get the job info again after those trials start.
The preceding example shows the expected output for single-replica training: one URI for the primary training node. If you are performing distributed training, the output contains one URI for each training node, identified by task name.
For example, if your job has a master and two workers, then the webAccessUris
field looks similar to the following:
{
"master-replica-0": "URI_FOR_PRIMARY",
"worker-replica-0": "URI_FOR_FIRST_SECONDARY",
"worker-replica-1": "URI_FOR_SECOND_SECONDARY"
}
Available regions
Using an interactive shell for AI Platform Training is supported in the following regions:
Americas
- Oregon (us-west1)
- Los Angeles (us-west2)
- Iowa (us-central1)
- South Carolina (us-east1)
- N. Virginia (us-east4)
- Montréal (northamerica-northeast1)
Europe
- London (europe-west2)
- Belgium (europe-west1)
- Zurich (europe-west6)
- Frankfurt (europe-west3)
Asia Pacific
- Singapore (asia-southeast1)
- Taiwan (asia-east1)
- Tokyo (asia-northeast1)
- Sydney (australia-southeast1)
- Seoul (asia-northeast3)
AI Platform Training also provides additional regions for training.
Using an interactive shell
To use the interactive shell for a training node, navigate to one of the URIs that you found in the preceding section. A Bash shell appears in your browser, giving you access to the file system of the container where AI Platform Training is running your training code.
The following sections describe some things to consider as you use the shell and provide some examples of monitoring tools you might use in the shell.
Preventing the job from ending
When AI Platform Training finishes running your job or trial, you will
immediately lose access to your interactive shell. If this happens, you might
see the message command terminated with exit code 137
or the shell might stop
responding. If you created any files in the container's file system, they will
not persist after the job ends.
In some cases, you might want to purposefully make your job run longer in order to debug with an interactive shell. For example, you can add code like the following to your training code in order to make the job keep running for at least an hour after an exception occurs:
import time
import traceback
try:
# Replace with a function that runs your training code
train_model()
except Exception as e:
traceback.print_exc()
time.sleep(60 * 60) # 1 hour
However, note that you incur AI Platform Training training charges as long as the job keeps running.
Checking permissions issues
The interactive shell environment is authenticated using application default
credentials (ADC) for the
service account that AI Platform Training uses to run your training code. You
can run gcloud auth list
in the shell for more details.
In the shell, you can use gcloud storage
,
bq
, and other tools that support ADC.
This can help you verify that the job is able to access a particular
Cloud Storage bucket, BigQuery table, or other Google Cloud
resource that your training code needs.
Visualizing Python execution with py-spy
py-spy
lets you profile
an executing Python program, without modifying it. To use py-spy
in an
interactive shell, do the following:
Install
py-spy
:pip3 install py-spy
Run
ps aux
in the shell, and look for the PID of the Python training program.Run any of the subcommands described in the
py-spy
documentation, using the PID that you found in the preceding step.If you use
py-spy record
to create an SVG file, copy this file to a Cloud Storage bucket so you can view it later on your local computer. For example:gcloud storage cp profile.svg gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
Analyzing performance with perf
perf
lets you analyze the performance of your training node.
To install the version of perf
appropriate for your node's Linux kernel, run
the following commands:
apt-get update
apt-get install -y linux-tools-generic
rm /usr/bin/perf
LINUX_TOOLS_VERSION=$(ls /usr/lib/linux-tools | tail -n 1)
ln -s "/usr/lib/linux-tools/${LINUX_TOOLS_VERSION}/perf" /usr/bin/perf
After this, you can run any of the subcommands described in the perf
documentation.
Getting information about GPU usage
GPU-enabled containers running on nodes with GPUs typically have several command-line tools preinstalled that can help you monitor GPU usage. For example:
Use
nvidia-smi
to monitor GPU utilization of various processes.Use
nvprof
to collect a variety of GPU profiling information. Sincenvprof
can't attach to an existing process, you might want to use the tool to start an additional process running your training code. (This means your training code will run be running twice on the node.) For example:nvprof -o prof.nvvp python3 -m MODULE_NAME
Replace MODULE_NAME with the fully-qualified name of your training application's entry point module; for example,
trainer.task
.Then transfer the output file to a Cloud Storage bucket so you can analyze it later on your local computer. For example:
gcloud storage cp prof.nvvp gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
If you encounter a GPU error (not a problem with your configuration or with AI Platform Training), use
nvidia-bug-report.sh
to create a bug report.Then transfer the report to a Cloud Storage bucket so you can analyze it later on your local computer or send it to NVIDIA. For example:
gcloud storage cp nvidia-bug-report.log.gz gs://BUCKET
Replace BUCKET with the name of a bucket you have access to.
If bash
can't find any of these NVIDIA commands, try adding
/usr/local/nvidia/bin
and /usr/local/cuda/bin
to the shell's PATH
:
export PATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}"
What's next
- Learn more about the AI Platform Training service.
- Read about Packaging a training application.