Managing Models and Jobs

During the process of training and deploying models and getting predictions, you need to manage resources on Google Cloud Platform. This page describes how to work with models, versions, and jobs.

Creating names for models, versions, and jobs

You must specify a name for every model, version, and job you create. the rules for naming are consistent across all three types of resources. Each name:

  • May only contain letters, numbers, and underscores.
  • Is case-sensitive.
  • Must start with a letter.
  • Must be no more than 128 characters long.
  • Must be unique within its namespace (your project for models and jobs, the parent model for versions).

You should create names that are easy to distinguish in lists of resources, such as job logs. Here are some suggestions:

  • Name all jobs for the same model using the model name and a job index (the timestamp when the job is created works well).
  • Name your models so that they are easily identified by the dataset they use (census_wide_deep is usually better than my_new_model, for example).
  • Versions are best if easily readable. Instead of using a timestamp or a similar unique value, we recommend using simple version designators (v1 or v0.2.4 for example).

Managing models

Your model resources in Cloud ML Engine are logical containers for individual implementations of your machine learning model. They are the simplest resources to work with because they have no complex operations or additional resources to allocate and maintain.

The following table summarizes the model operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.models.create
gcloud ml-engine models create
Create Model on the ML Engine Models page.
delete projects.models.delete

Deleting a model is a long-running operation.

The model must have no versions associated with it before you can delete it.

gcloud ml-engine models delete
Delete in the Models list, or on the Model details page.
get projects.models.get

The information you get is described in the Model resource reference.

gcloud ml-engine models describe
Model details page (enter with a link from the Models list.
list projects.models.list
gcloud ml-engine models list
ML Engine Models page.

Managing versions

Your versions are specific iterations of your models. The core of a model version is a TensorFlow SavedModel.

The following table summarizes the version operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.models.versions.create

Creating a version is deploying a SavedModel to Cloud ML Engine. Refer to the model deployment guide for more information.

gcloud ml-engine versions create
Create Version on the Model details page (enter with a link from the Models list).
delete projects.models.versions.delete

Deleting a version is a long-running operation.

You cannot delete the default version of a model unless it is the only version assigned to that model.

gcloud ml-engine versions delete
Delete in the Versions list on the Model details page.
get projects.models.versions.get

The information you get is described in the Version resource reference.

gcloud ml-engine versions describe
Version details page (from a link in the Versions list on the Model details page.
list projects.models.versions.list
gcloud ml-engine versions list
Versions list on the Model details page.
setDefault projects.models.versions.setDefault

This is the only way to assign a new default version for a model; after the first, creating a version doesn't make the new version the default.

gcloud ml-engine versions set-default
Set as default on the Versions list on the Model details page.

Managing jobs

Cloud ML Engine supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.

The following table summarizes the job operations and lists the interfaces you can use to perform them:

Operation Interfaces Notes
create projects.jobs.create

Creating a job is described in detail in the training and batch prediction guides.

gcloud ml-engine jobs submit training

gcloud ml-engine jobs submit prediction

No console implementation.
cancel projects.jobs.cancel

Cancels a running job.

gcloud ml-engine jobs cancel

Stop on the Job details page.
get projects.jobs.get The information you get is described in the Jobs resource reference.

gcloud ml-engine jobs describe

Job details page (enter with a link from the Jobs list).
list projects.jobs.list

gcloud ml-engine jobs list

Jobs list.

Handling asynchronous operations

Most of the Cloud ML Engine resource management operations return as quickly as possible, and provide a complete response. However, there are two kinds of asynchronous operations that you should understand: jobs and long-running operations.

When you start an asynchronous operation, you usually want to know when it completes. The process for getting status is different for jobs and long-running operations:

Getting the status of a job

You can use projects.jobs.get to get the status of a job. This method is also provided as gcloud ml jobs describe and in the Jobs page in the console. Regardless of how you get the status, the information is based on the members of the Job resource. You'll know the job is complete when Job.state in the response is equal to one of these values:

  • SUCCEEDED
  • FAILED
  • CANCELLED

Getting the status of a long-running operation

Cloud ML Engine has three long-running operations:

  • Creating a version
  • Deleting a model
  • Deleting a version

Of the long-running operations, only creating a version is likely to take much time to complete. Deleting models and versions is typically accomplished in near real time.

If you create a version by using the gcloud command-line tool or the Google Cloud PLatform console the interface automatically informs you when the operation is complete. If you create a version with the API, you can track the status of the operation yourself:

  1. Get the service-assigned operation name from the Operation object in the response to your call to projects.models.versions.create. The key for the name value is "name".

  2. Use projects.operations.get to periodically poll the status of the operation.

    1. Use the operation name from the first step to form a name string of the form:

      'projects/my_project/operations/operation_name'
      

      The response message contains an Operation object.

    2. Get the value for the "done" key. This is a Boolean indicator of operation completion. It is true if the operation is complete.

  3. The Operation object will include one of two keys on completion:

    • The "response" key is present if the operation was successful. Its value should be google.protobuf.Empty, as none of the Cloud ML Engine long-running operations have response objects.

    • The "error" key is present if there was an error. Its value is a Status object.

What's next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)