Monitoring training jobs

Depending on the size of your dataset and the complexity of your model, training can take a long time. Training from real-world data can last many hours. You can monitor several aspects of your job while it runs.

Checking job status

For overall status, the easiest way to check on your job is the AI Platform Jobs page on the GCP Console. You can get the same details programmatically and with the gcloud command-line tool.

console

  1. Open the AI Platform Jobs page in the GCP Console.

    Open Jobs in the GCP Console

  2. Click your job name in the list to open the Job Details page.

  3. Find your job status at the top of the report. The icon and text describe the current state of the job.

    The job status information at the top of the Job details page.

Filtering jobs

On the Jobs page, you can filter your jobs by several different parameters, including Type, JobID, State, and job creation time.

  1. Click within the Filter by prefix field, which is located above your list of jobs. Select a prefix you want to use for filtering. For example, select Type.
  2. To complete the filter, click the filter suffix you want to use. For example, the suffix options for the Type prefix are:

    • Custom code training
    • Built-in algorithms training
    • Prediction
  3. The filter is applied to your Jobs list, and the name of the filter displays in the filter field. For example, if you selected Custom code training, the filter Type:Custom code training displays at the top, and filters your jobs list. You can add multiple filters, if needed.

Viewing hyperparameter trials

On the Job Details page, you can view metrics for each trial in the HyperTune trials table. This table appears only for jobs that use hyperparameter tuning. You can toggle the metrics to display trials by highest or lowest rmse, Training steps and learning_rate.

To view logs for a specific trial, click the more_vert icon, and then click View logs.

gcloud

Use gcloud ai-platform jobs describe to get details about the current state of the job on the command line:

gcloud ai-platform jobs describe job_name

You can get a list of jobs associated with your project that includes job status and creation time with gcloud ai-platform jobs list. Note that this command in its simplest form lists all of the jobs ever created for your project. You should scope your request to limit the number of jobs reported. The following examples should get you started:

Use the --limit argument to restrict the number of jobs. This example lists the 5 most recent jobs:

gcloud ai-platform jobs list --limit=5

Use the --filter argument to restrict the list of jobs to those with a given attribute value. You can filter on one or more attributes of the Job object. As well as the core job attributes, you can filter on objects within the job, such as the TrainingInput object.

Examples of filtering the list:

  • List all jobs that were started after a particular time. This example uses 7 o'clock on the evening of January 15, 2017:

    gcloud ai-platform jobs list --filter='createTime>2017-01-15T19:00'
    
  • List the last three jobs with names that start with a given string. For example, the string may represent the name that you use for all training jobs for a particular model. This example uses a model where the job identifier is 'census' with a suffix that's an index incremented for each job:

    gcloud ai-platform jobs list --filter='jobId:census*' --limit=3
    
  • List all failed jobs with names that start with 'rnn':

    gcloud ai-platform jobs list --filter='jobId:rnn* AND state:FAILED'
    

For details of the expressions supported by the filter option, see the documentation for the gcloud command.

Python

  1. Assemble your job identifier string by combining your project name and job name into the form: 'projects/your_project_name/jobs/your_job_name':

    projectName = 'your_project_name'
    projectId = 'projects/{}'.format(projectName)
    jobName = 'your_job_name'
    jobId = '{}/jobs/{}'.format(projectId, jobName)
    
  2. Form the request to projects.jobs.get:

    request = ml.projects().jobs().get(name=jobId)
    
  3. Execute the request (this example puts the execute call in a try block to catch exceptions):

    response = None
    
    try:
        response = request.execute()
    except errors.HttpError, err:
        # Something went wrong. Handle the exception in an appropriate
        #  way for your application.
    
  4. Check the response to ensure that, irrespective of HTTP errors, the service call returned data.

    if response == None:
        # Treat this condition as an error as best suits your
        # application.
    
  5. Get status data. The response object is a dictionary containing all applicable members of the Job resource, including the full TrainingInput resource and the applicable members of the TrainingOutput resource. The following example prints the job status and the number of ML units consumed by the job.

    print('Job status for {}.{}:'.format(projectName, jobName))
    print('    state : {}'.format(response['state']))
    print('    consumedMLUnits : {}'.format(
        response['trainingOutput']['consumedMLUnits']))
    

Jobs can fail if there is a problem with your training application or with the AI Platform infrastructure. You can use Stackdriver Logging to start debugging.

Monitoring resource consumption

You can find the following resource utilization charts for your training jobs on the Job Details page:

  • The job's aggregate CPU or GPU utilization, and the memory utilization. These are broken down by master, worker, and parameter server.
  • The job's network usage, measured in bytes per second. There are separate charts for bytes sent, and bytes received.
  1. Go to the AI Platform Jobs page in the GCP Console.

    Go to the AI Platform Jobs page

  2. Find your job in the list.

  3. Click your job name in the list to open the Job Details page.

  4. Select the tabs labeled CPU, GPU, or Network to view the associated resource utilization charts.

You can also access information about the online resources that your training jobs use with Stackdriver Monitoring. AI Platform exports metrics to Stackdriver.

Each AI Platform Training metric type includes "training" in its name. For example, ml.googleapis.com/training/cpu/utilization or ml.googleapis.com/training/accelerator/memory/utilization.

Monitoring with TensorBoard

You can configure your training application to save summary data that you can examine and visualize using TensorBoard.

Save your summary data to a Cloud Storage location and point TensorBoard to that location to examine the data. You can also point TensorBoard to a directory with subdirectories that contain the output from multiple jobs.

See more information about TensorBoard and AI Platform in the getting-started guide.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform
Need help? Visit our support page.