Managing models and jobs

During the process of training and deploying models and getting predictions, you need to manage resources on Google Cloud Platform. This page describes how to work with models, versions, and jobs.

Naming AI Platform Prediction resources

You must specify a name for every job you create. The rules for naming are consistent across all three types of resources. Each name:

  • May only contain letters, numbers, and underscores.
  • Is case-sensitive.
  • Must start with a letter.
  • Must be no more than 128 characters long.
  • Must be unique within its namespace (your project for models and jobs, the parent model for versions).

You should create names that are easy to distinguish in lists of resources, such as job logs. Here are some suggestions:

  • Name all jobs for the same model using the model name and a job index (the timestamp when the job is created works well).
  • Name your models so that they are easily identified by the dataset they use (census_wide_deep is usually better than my_new_model, for example).
  • Versions are best if easily readable. Instead of using a timestamp or a similar unique value, we recommend using simple version designators like v1.

Managing models

Your model resources in AI Platform Prediction are logical containers for individual implementations of your machine learning model. They are the simplest resources to work with because they have no complex operations or additional resources to allocate and maintain.

The following table summarizes the model operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.models.create
gcloud ai-platform models create
Create Model on the AI Platform Prediction Models page.
delete projects.models.delete

Deleting a model is a long-running operation.

The model must have no versions associated with it before you can delete it.

gcloud ai-platform models delete
Delete in the Models list, or on the Model details page.
get projects.models.get

The information you get is described in the Model resource reference.

gcloud ai-platform models describe
Model details page (enter with a link from the Models list.
list projects.models.list
gcloud ai-platform models list
AI Platform Prediction Models page.

Managing versions

Your versions are specific iterations of your models. The core of a model version is a TensorFlow SavedModel.

The following table summarizes the version operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.models.versions.create

Creating a version is deploying a SavedModel to AI Platform Prediction. Refer to the model deployment guide for more information.

gcloud ai-platform versions create
Create Version on the Model details page (enter with a link from the Models list).
delete projects.models.versions.delete

Deleting a version is a long-running operation.

You cannot delete the default version of a model unless it is the only version assigned to that model.

gcloud ai-platform versions delete
Delete in the Versions list on the Model details page.
get projects.models.versions.get

The information you get is described in the Version resource reference.

gcloud ai-platform versions describe
Version details page (from a link in the Versions list on the Model details page.
list projects.models.versions.list
gcloud ai-platform versions list
Versions list on the Model details page.
setDefault projects.models.versions.setDefault

This is the only way to assign a new default version for a model; after the first, creating a version doesn't make the new version the default.

gcloud ai-platform versions set-default
Set as default on the Versions list on the Model details page.

Managing jobs

AI Platform Prediction supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.

The following table summarizes the job operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.jobs.create

Creating a job is described in detail in the training and batch prediction guides.

gcloud ai-platform jobs submit training

gcloud ai-platform jobs submit prediction

No console implementation.
cancel projects.jobs.cancel

Cancels a running job.

gcloud ai-platform jobs cancel

Cancel on the Job details page.
get projects.jobs.get The information you get is described in the Jobs resource reference.

gcloud ai-platform jobs describe

Job details page (enter with a link from the Jobs list).
list projects.jobs.list Only jobs created in the last 90 days will be displayed.

gcloud ai-platform jobs list

Jobs list.

Handling asynchronous operations

Most of the AI Platform Prediction resource management operations return as quickly as possible, and provide a complete response. However, there are two kinds of asynchronous operations that you should understand: jobs and long-running operations.

When you start an asynchronous operation, you usually want to know when it completes. The process for getting status is different for jobs and long-running operations:

Getting the status of a job

You can use projects.jobs.get to get the status of a job. This method is also provided as gcloud ai-platform jobs describe and in the Jobs page in the Google Cloud console. Regardless of how you get the status, the information is based on the members of the Job resource. You'll know the job is complete when Job.state in the response is equal to one of these values:

  • SUCCEEDED
  • FAILED
  • CANCELLED

Getting the status of a long-running operation

AI Platform Prediction has three long-running operations:

  • Creating a version
  • Deleting a model
  • Deleting a version

Of the long-running operations, only creating a version is likely to take much time to complete. Deleting models and versions is typically accomplished in near real time.

If you create a version by using the Google Cloud CLI or the Google Cloud console the interface automatically informs you when the operation is complete. If you create a version with the API, you can track the status of the operation yourself:

  1. Get the service-assigned operation name from the Operation object in the response to your call to projects.models.versions.create. The key for the name value is "name".

  2. Use projects.operations.get to periodically poll the status of the operation.

    1. Use the operation name from the first step to form a name string of the form:

      'projects/my_project/operations/operation_name'
      

      The response message contains an Operation object.

    2. Get the value for the "done" key. This is a Boolean indicator of operation completion. It is true if the operation is complete.

  3. The Operation object will include one of two keys on completion:

    • The "response" key is present if the operation was successful. Its value should be google.protobuf.Empty, as none of the AI Platform Prediction long-running operations have response objects.

    • The "error" key is present if there was an error. Its value is a Status object.

What's next