Monitoring Training Jobs

Depending on the size of your dataset and the complexity of your model, training can take a long time; it's not uncommon for a job training from real-word data to last many hours. You can monitor several aspects of your job while it runs.

Before You Begin

Getting information about a running training job is one step in the larger model training process. You should have completed these steps before you can monitor training jobs:

  1. Develop your trainer application with TensorFlow.

  2. Package your application and upload it and any unusual dependencies to a Google Cloud Storage bucket (this step is usually included in job creation with gcloud).

  3. Configure and submit a training job.

Checking Job Status

For overall status, the easiest way to check on your job is the Machine Learning Jobs page on Google Cloud Platform console. You can get the same details programmatically and with the gcloud command-line tool.


  1. Open the Machine Learning Jobs page in the Google Cloud Platform Console.

    Open Jobs in the GCP Console

  2. Click on your job name in the list to open the Job details page.

  3. Find your job status at the top of the report. The icon and text describe the current state of the job.

    The job status information at the top of the Jbo details page.


Use gcloud ml-engine jobs describe to get details about the current state of the job on the command line:

gcloud ml-engine jobs describe job_name

You can get a list of jobs associated with your project that includes job status and creation time with gcloud ml-engine jobs list. Note that this command in its simplest form lists all of the jobs ever created for your project. You should scope your request to limit the number of jobs reported. The following examples should get you started:

Use the --limit argument to restrict the number of jobs. This example lists the 5 most recent jobs:

gcloud ml-engine jobs list --limit=5

Use the --filter argument to restrict the list of jobs to those with a given attribute value. You can filter on one or more attributes of the job object. As well as the core job attributes, you can filter on objects within the job, such as the TrainingInput object.

Examples of filtering the list:

  • List all jobs that were started after a particular time. This example uses 7 o'clock on the evening of January 15, 2017:

    gcloud ml-engine jobs list --filter='createTime>2017-01-15T19:00'
  • List the last three jobs with names that start with a given string. For example, the string may represent the name that you use for all training jobs for a particular model. This example uses a model where the job identifier is 'census' with a suffix that's an index incremented for each job:

    gcloud ml-engine jobs list --filter='jobId:census*' --limit=3
  • List all failed jobs with names that start with 'rnn':

    gcloud ml-engine jobs list --filter='jobId:rnn* AND state:FAILED'

For details of the expressions supported by the filter option, see the documentation for the gcloud command.


  1. Assemble your job identifier string by combining your project name and job name into the form: 'projects/your_project_name/jobs/your_job_name':

    projectName = 'your_project_name'
    projectId = 'projects/{}'.format(projectName)
    jobName = 'your_job_name'
    jobId = '{}/jobs/{}'.format(projectId, jobName)
  2. Form the request to

    request = ml.projects().jobs().get(name=jobId)
  3. Execute the request (this example puts the execute call in a try block to catch exceptions):

    response = None
        response = request.execute()
    except errors.HttpError, err:
        # Something went wrong. Handle the exception in an appropriate
        #  way for your application.
  4. Check the response to ensure that, irrespective of HTTP errors, the service call returned data.

    if response == None:
        # Treat this condition as an error as best suits your
        # application.
  5. Get status data. The response object is a dictionary containing all applicable members of the Job resource, including the full TrainingInput resource and the applicable members of the TrainingOutput resource. The following example prints the job status and the number of ML units consumed by the job.

    print('Job status for {}.{}:'.format(projectName, jobName))
    print('    state : {}'.format(response['state']))
    print('    consumedMLUnits : {}'.format(

Jobs can fail if there is a problem with your trainer or with the Cloud ML Engine infrastructure. You can use Stackdriver Logging to start debugging. Troubleshooting is described in its own page.

Monitoring Resource Consumption

You can find charts of your job's aggregate CPU and memory utilization on the Job details page:

  1. Go to the ML Engine Jobs page in the GCP Console.

    Go to the ML Engine Jobs page

  2. Find your job in the list.

  3. Click on the job name to open the Job details page.

  4. Scroll down to see the resource utilization charts.

    CPU and memory usage charts on the Cloud ML Engine Job details page

You can get more detailed information about the online resources that your training jobs use with Stackdriver Monitoring. Cloud ML Engine exports two metrics to Stackdriver:

  • ml/training/memory/utilization shows fraction of allocated memory that is currently in use.

  • ml/training/cpu/utilization shows fraction of allocated CPU that is currently in use.

You can see CPU and memory utilization for each task (worker, parameter server, and master) in a job using these metrics.

Monitoring with TensorBoard

You can configure your trainer to save summary data that you can examine and visualize using TensorBoard.

Save your summary data to a Google Cloud Storage location and point TensorBoard to that location to examine it. You can point TensorBoard to a directory that contains subdirectories for the output from multiple jobs. In this case it shows information about all of the jobs.

What's Next

Send feedback about...

Cloud ML Engine for TensorFlow