During the process of training and deploying models and getting predictions, you need to manage resources on Google Cloud Platform. This page describes how to work with models, versions, and jobs.
Naming AI Platform Training resources
You must specify a name for every job you create. The rules for naming are consistent across all three types of resources. Each name:
- May only contain letters, numbers, and underscores.
- Is case-sensitive.
- Must start with a letter.
- Must be no more than 128 characters long.
- Must be unique within its namespace (your project for models and jobs, the parent model for versions).
You should create names that are easy to distinguish in lists of resources, such as job logs. Here are some suggestions:
- Name all jobs for the same model using the model name and a job index (the timestamp when the job is created works well).
- Name your models so that they are easily identified by the dataset they
use (
census_wide_deep
is usually better thanmy_new_model
, for example). - Versions are best if easily readable. Instead of using a timestamp or a
similar unique value, we recommend using simple version designators like
v1
.
Managing jobs
AI Platform Training supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.
The following table summarizes the job operations and lists the interfaces you can use to perform them:
Operation | Interfaces | Notes |
---|---|---|
create |
projects.jobs.create
|
Creating a job is described in detail in the training and batch prediction guides. |
No console implementation. | ||
cancel |
projects.jobs.cancel
|
Cancels a running job. |
Cancel on the Job details page. | ||
get |
projects.jobs.get
|
The information you get is described in the
Jobs resource reference.
|
Job details page (enter with a link from the Jobs list). | ||
list |
projects.jobs.list
|
Only jobs created in the last 90 days will be displayed. |
Jobs list. |
Handling asynchronous operations
Most of the AI Platform Training resource management operations return as quickly as possible, and provide a complete response. However, there are two kinds of asynchronous operations that you should understand: jobs and long-running operations.
When you start an asynchronous operation, you usually want to know when it completes. The process for getting status is different for jobs and long-running operations:
Getting the status of a job
You can use projects.jobs.get
to get the status of a job. This method is also provided as
gcloud ai-platform jobs describe
and in the Jobs page in the
Google Cloud console. Regardless of how you get the status, the information is based on the
members of the
Job resource. You'll know the
job is complete when Job.state
in the response is equal to one of these values:
SUCCEEDED
FAILED
CANCELLED
Getting the status of a long-running operation
AI Platform Training has three long-running operations:
- Creating a version
- Deleting a model
- Deleting a version
Of the long-running operations, only creating a version is likely to take much time to complete. Deleting models and versions is typically accomplished in near real time.
If you create a version by using the Google Cloud CLI or the Google Cloud console the interface automatically informs you when the operation is complete. If you create a version with the API, you can track the status of the operation yourself:
Get the service-assigned operation name from the Operation object in the response to your call to projects.models.versions.create. The key for the name value is
"name"
.Use projects.operations.get to periodically poll the status of the operation.
Use the operation name from the first step to form a name string of the form:
'projects/my_project/operations/operation_name'
The response message contains an Operation object.
Get the value for the
"done"
key. This is a Boolean indicator of operation completion. It is true if the operation is complete.
The Operation object will include one of two keys on completion:
The
"response"
key is present if the operation was successful. Its value should be google.protobuf.Empty, as none of the AI Platform Training long-running operations have response objects.The
"error"
key is present if there was an error. Its value is a Status object.
What's next
- Train a model.
- Learn about using labels to organize your resources.