Managing models and jobs

During the process of training and deploying models and getting predictions, you need to manage resources on Google Cloud Platform. This page describes how to work with models, versions, and jobs.

Naming AI Platform Training resources

You must specify a name for every job you create. The rules for naming are consistent across all three types of resources. Each name:

  • May only contain letters, numbers, and underscores.
  • Is case-sensitive.
  • Must start with a letter.
  • Must be no more than 128 characters long.
  • Must be unique within its namespace (your project for models and jobs, the parent model for versions).

You should create names that are easy to distinguish in lists of resources, such as job logs. Here are some suggestions:

  • Name all jobs for the same model using the model name and a job index (the timestamp when the job is created works well).
  • Name your models so that they are easily identified by the dataset they use (census_wide_deep is usually better than my_new_model, for example).
  • Versions are best if easily readable. Instead of using a timestamp or a similar unique value, we recommend using simple version designators like v1.

Managing jobs

AI Platform Training supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.

The following table summarizes the job operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.jobs.create

Creating a job is described in detail in the training and batch prediction guides.

gcloud ai-platform jobs submit training

gcloud ai-platform jobs submit prediction

No console implementation.
cancel projects.jobs.cancel

Cancels a running job.

gcloud ai-platform jobs cancel

Cancel on the Job details page.
get projects.jobs.get The information you get is described in the Jobs resource reference.

gcloud ai-platform jobs describe

Job details page (enter with a link from the Jobs list).
list projects.jobs.list Only jobs created in the last 90 days will be displayed.

gcloud ai-platform jobs list

Jobs list.

Handling asynchronous operations

Most of the AI Platform Training resource management operations return as quickly as possible, and provide a complete response. However, there are two kinds of asynchronous operations that you should understand: jobs and long-running operations.

When you start an asynchronous operation, you usually want to know when it completes. The process for getting status is different for jobs and long-running operations:

Getting the status of a job

You can use projects.jobs.get to get the status of a job. This method is also provided as gcloud ai-platform jobs describe and in the Jobs page in the Google Cloud console. Regardless of how you get the status, the information is based on the members of the Job resource. You'll know the job is complete when Job.state in the response is equal to one of these values:

  • SUCCEEDED
  • FAILED
  • CANCELLED

Getting the status of a long-running operation

AI Platform Training has three long-running operations:

  • Creating a version
  • Deleting a model
  • Deleting a version

Of the long-running operations, only creating a version is likely to take much time to complete. Deleting models and versions is typically accomplished in near real time.

If you create a version by using the Google Cloud CLI or the Google Cloud console the interface automatically informs you when the operation is complete. If you create a version with the API, you can track the status of the operation yourself:

  1. Get the service-assigned operation name from the Operation object in the response to your call to projects.models.versions.create. The key for the name value is "name".

  2. Use projects.operations.get to periodically poll the status of the operation.

    1. Use the operation name from the first step to form a name string of the form:

      'projects/my_project/operations/operation_name'
      

      The response message contains an Operation object.

    2. Get the value for the "done" key. This is a Boolean indicator of operation completion. It is true if the operation is complete.

  3. The Operation object will include one of two keys on completion:

    • The "response" key is present if the operation was successful. Its value should be google.protobuf.Empty, as none of the AI Platform Training long-running operations have response objects.

    • The "error" key is present if there was an error. Its value is a Status object.

What's next