Monitoring and debugging training with an interactive shell

During training, you can use an interactive shell to inspect the container where your training code is running. You can browse the file system and run debugging utilities in each runtime version or custom container running on AI Platform Training.

Using an interactive shell to inspect your training container can help you debug problems with your training code or your AI Platform Training configuration. For example, you can use an interactive shell to do the following:

  • Run tracing and profiling tools
  • Analyze GPU usage
  • Check Google Cloud permissions available to the container

Before you begin

You can use an interactive shell when you run a training or hyperparameter tuning job. As you prepare your training code and run a training job, make sure to meet the following requirements:

Enabling interactive shells

To enable interactive shells for a training job, set the enableWebAccess API field to true in your job's trainingInput field when you create a training job.

The following example shows how to do this using the gcloud tool. You cannot currently create a training job with an interactive shell enabled in the Cloud Console.

The example assumes that you have a training application on your local filesystem in a directory named trainer with a module named task.

  1. Create a config.yaml configuration file that contains the following:

    trainingInput:
      enableWebAccess: true
    
  2. To create the training job, run the following command:

    gcloud ai-platform jobs submit training JOB_ID \
      --config=config.yaml \
      --job-dir=JOB_DIR \
      --module-name=trainer.task \
      --package-path=trainer \
      --python-version=3.7 \
      --region=REGION \
      --runtime-version=2.5 \
      --scale-tier=CUSTOM \
      --master-machine-type=n1-highmem-8
    

    In this command, replace the following placeholders:

    • JOB_ID: A name that you choose for the job.
    • JOB_DIR: A path to a Cloud Storage directory where your training application will be uploaded to.
    • REGION: The region where you plan to create the training job. Note that it must be a region that supports interactive shells.

    The command produces the following output if successful:

    Job [JOB_ID] submitted successfully.
    Your job is still active. You may view the status of your job with the command
    
      $ gcloud ai-platform jobs describe JOB_ID
    
    or continue streaming the logs with the command
    
      $ gcloud ai-platform jobs stream-logs JOB_ID
    jobId: JOB_ID
    state: QUEUED
    

Getting the web access URI

After you have initiated training according to the guidance in the preceding section, use the Cloud Console or the gcloud command-line tool to see the URIs that you can use to access interactive shells. AI Platform Training provides a URI for each training node that is part of your job.

Depending on which type of training job you created, select one of the following tabs to see examples of how to find the webAccessUris API field, which contains an interactive shell URI for each node in your job.

Training Job

The following tabs show different ways to access the TrainingOutput for a standard training job.

gcloud

Run the gcloud ai-platform jobs describe command:

gcloud ai-platform jobs describe JOB_ID

Replace the following:

  • JOB_ID: The ID for your job. You set this ID when you created the job. (If you don't know your job's ID, you can run the gcloud ai-platform jobs list command and look for the appropriate job.)

In the output, look for the following:

trainingOutput:
  webAccessUris:
    master-replica-0: INTERACTIVE_SHELL_URI

Console

  1. Open the AI Platform Training Jobs page in the Cloud Console.

    Open Jobs in the Cloud Console

  2. Click your job name in the list to open the Job Details page.

  3. Click the Show Json button in the Training output section to expand a JSON view of the TrainingOutput for the job.

In the output, look for the following:

{
  "webAccessUris": {
    "master-replica-0": "INTERACTIVE_SHELL_URI"
  }
}

If you don't see the webAccessUris field, this might be because AI Platform Training hasn't started running your job or trial yet. Verify that you see RUNNING the state field. If the state is QUEUED or PREPARING, wait a minute; then try getting the job info again.

Hyperparameter Tuning Job

The following tabs show different ways to access the TrainingOutput for a hyperparameter tuning job.

gcloud

Run the gcloud ai-platform jobs describe command:

gcloud ai-platform jobs describe JOB_ID

Replace the following:

  • JOB_ID: The ID for your job. You set this ID when you created the job. (If you don't know your job's ID, you can run the gcloud ai-platform jobs list command and look for the appropriate job.)

In the output, look for the following:

trainingOutput:
  trials:
  - trialId: '1'
    webAccessUris:
      master-replica-0: INTERACTIVE_SHELL_URI

Console

  1. Open the AI Platform Training Jobs page in the Cloud Console.

    Open Jobs in the Cloud Console

  2. Click your job name in the list to open the Job Details page.

  3. Click the Show Json button in the Training output section to expand a JSON view of the TrainingOutput for the job.

In the output, look for the following:

{
  "trials": [
    {
      ...
      "webAccessUris": {
        "master-replica-0": "INTERACTIVE_SHELL_URI"
      }
    },
    ...
  ]
}

If you don't see the webAccessUris field, this might be because AI Platform Training hasn't started running your job or trial yet. Verify that you see RUNNING the state field. If the state is QUEUED or PREPARING, wait a minute; then try getting the job info again.

AI Platform Training provides a set of interactive shell URIs for each hyperparameter tuning trial as the trial enters the RUNNING state. If you want to get the interactive shell URIs for later trials, get the job info again after those trials start.

The preceding example shows the expected output for single-replica training: one URI for the primary training node. If you are performing distributed training, the output contains one URI for each training node, identified by task name.

For example, if your job has a master and two workers, then the webAccessUris field looks similar to the following:

{
  "master-replica-0": "URI_FOR_PRIMARY",
  "worker-replica-0": "URI_FOR_FIRST_SECONDARY",
  "worker-replica-1": "URI_FOR_SECOND_SECONDARY"
}

Available regions

Using an interactive shell for AI Platform Training is supported in the following regions:

Americas

  • Oregon (us-west1)
  • Los Angeles (us-west2)
  • Iowa (us-central1)
  • South Carolina (us-east1)
  • N. Virginia (us-east4)
  • Montréal (northamerica-northeast1)

Europe

  • London (europe-west2)
  • Belgium (europe-west1)
  • Zurich (europe-west6)
  • Frankfurt (europe-west3)

Asia Pacific

  • Singapore (asia-southeast1)
  • Taiwan (asia-east1)
  • Tokyo (asia-northeast1)
  • Sydney (australia-southeast1)
  • Seoul (asia-northeast3)

AI Platform Training also provides additional regions for training.

Using an interactive shell

To use the interactive shell for a training node, navigate to one of the URIs that you found in the preceding section. A Bash shell appears in your browser, giving you access to the file system of the container where AI Platform Training is running your training code.

The following sections describe some things to consider as you use the shell and provide some examples of monitoring tools you might use in the shell.

Preventing the job from ending

When AI Platform Training finishes running your job or trial, you will immediately lose access to your interactive shell. If this happens, you might see the message command terminated with exit code 137 or the shell might stop responding. If you created any files in the container's file system, they will not persist after the job ends.

In some cases, you might want to purposefully make your job run longer in order to debug with an interactive shell. For example, you can add code like the following to your training code in order to make the job keep running for at least an hour after an exception occurs:

import time
import traceback

try:
    # Replace with a function that runs your training code
    train_model()
except Exception as e:
    traceback.print_exc()
    time.sleep(60 * 60)  # 1 hour

However, note that you incur AI Platform Training training charges as long as the job keeps running.

Checking permissions issues

The interactive shell environment is authenticated using application default credentials (ADC) for the service account that AI Platform Training uses to run your training code. You can run gcloud auth list in the shell for more details.

In the shell, you can use gsutil, bq, and other tools that support ADC. This can help you verify that the job is able to access a particular Cloud Storage bucket, BigQuery table, or other Google Cloud resource that your training code needs.

Visualizing Python execution with py-spy

py-spy lets you profile an executing Python program, without modifying it. To use py-spy in an interactive shell, do the following:

  1. Install py-spy:

    pip3 install py-spy
    
  2. Run ps aux in the shell, and look for the PID of the Python training program.

  3. Run any of the subcommands described in the py-spy documentation, using the PID that you found in the preceding step.

  4. If you use py-spy record to create an SVG file, copy this file to a Cloud Storage bucket so you can view it later on your local computer. For example:

    gsutil cp profile.svg gs://BUCKET
    

    Replace BUCKET with the name of a bucket you have access to.

Analyzing performance with perf

perf lets you analyze the performance of your training node. To install the version of perf appropriate for your node's Linux kernel, run the following commands:

apt-get update
apt-get install -y linux-tools-generic
rm /usr/bin/perf
LINUX_TOOLS_VERSION=$(ls /usr/lib/linux-tools | tail -n 1)
ln -s "/usr/lib/linux-tools/${LINUX_TOOLS_VERSION}/perf" /usr/bin/perf

After this, you can run any of the subcommands described in the perf documentation.

Getting information about GPU usage

GPU-enabled containers running on nodes with GPUs typically have several command-line tools preinstalled that can help you monitor GPU usage. For example:

  • Use nvidia-smi to monitor GPU utilization of various processes.

  • Use nvprof to collect a variety of GPU profiling information. Since nvprof can't attach to an existing process, you might want to use the tool to start an additional process running your training code. (This means your training code will run be running twice on the node.) For example:

    nvprof -o prof.nvvp python3 -m MODULE_NAME
    

    Replace MODULE_NAME with the fully-qualified name of your training application's entry point module; for example, trainer.task.

    Then transfer the output file to a Cloud Storage bucket so you can analyze it later on your local computer. For example:

    gsutil cp prof.nvvp gs://BUCKET
    

    Replace BUCKET with the name of a bucket you have access to.

  • If you encounter a GPU error (not a problem with your configuration or with AI Platform Training), use nvidia-bug-report.sh to create a bug report.

    Then transfer the report to a Cloud Storage bucket so you can analyze it later on your local computer or send it to NVIDIA. For example:

    gsutil cp nvidia-bug-report.log.gz gs://BUCKET
    

    Replace BUCKET with the name of a bucket you have access to.

If bash can't find any of these NVIDIA commands, try adding /usr/local/nvidia/bin and /usr/local/cuda/bin to the shell's PATH:

export PATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}"

What's next