Depending on the size of your dataset and the complexity of your model, training can take a long time. Training from real-world data can last many hours. You can monitor several aspects of your job while it runs.
Checking job status
For overall status, the easiest way to check on your job is the AI Platform Training Jobs page on the Google Cloud console. You can get the same details programmatically and with the Google Cloud CLI.
console
Open the AI Platform Training Jobs page in the Google Cloud console.
Click your job name in the list to open the Job Details page.
Find your job status at the top of the report. The icon and text describe the current state of the job.
Filtering jobs
On the Jobs page, you can filter your jobs by several different parameters, including Type, JobID, State, and job creation time.
- Click within the Filter by prefix field, which is located above your list of jobs. Select a prefix you want to use for filtering. For example, select Type.
To complete the filter, click the filter suffix you want to use. For example, the suffix options for the Type prefix are:
- Custom code training
- Built-in algorithms training
- Prediction
The filter is applied to your Jobs list, and the name of the filter displays in the filter field. For example, if you selected Custom code training, the filter Type:Custom code training displays at the top, and filters your jobs list. You can add multiple filters, if needed.
Viewing hyperparameter trials
On the Job Details page, you can view metrics for each trial in the
HyperTune trials table. This table appears only for jobs that use
hyperparameter tuning. You can toggle the metrics to display trials by
highest or lowest rmse
, Training steps
and learning_rate
.
To view logs for a specific trial, click the
gcloud
Use
gcloud ai-platform jobs describe
to get details about the current state of the job on the command line:
gcloud ai-platform jobs describe job_name
You can get a list of jobs associated with your project that includes job
status and creation time with
gcloud ai-platform jobs list
.
Note that this command in its simplest form lists all of the jobs ever
created for your project. You should scope your request to limit the number
of jobs reported. The following examples should get you started:
Use the --limit
argument to restrict the number of jobs. This example
lists the 5 most recent jobs:
gcloud ai-platform jobs list --limit=5
Use the --filter
argument to restrict the list of jobs to those with a
given attribute value. You can filter on one or more attributes of the
Job object. As well as the
core job attributes, you can filter on objects within the job, such as the
TrainingInput
object.
Examples of filtering the list:
List all jobs that were started after a particular time. This example uses 7 o'clock on the evening of January 15, 2017:
gcloud ai-platform jobs list --filter='createTime>2017-01-15T19:00'
List the last three jobs with names that start with a given string. For example, the string may represent the name that you use for all training jobs for a particular model. This example uses a model where the job identifier is 'census' with a suffix that's an index incremented for each job:
gcloud ai-platform jobs list --filter='jobId:census*' --limit=3
List all failed jobs with names that start with 'rnn':
gcloud ai-platform jobs list --filter='jobId:rnn* AND state:FAILED'
For details of the expressions supported by the filter option, see the
documentation for the gcloud
command.
Python
Assemble your job identifier string by combining your project name and job name into the form:
'projects/your_project_name/jobs/your_job_name'
:projectName = 'your_project_name' projectId = 'projects/{}'.format(projectName) jobName = 'your_job_name' jobId = '{}/jobs/{}'.format(projectId, jobName)
Form the request to projects.jobs.get:
request = ml.projects().jobs().get(name=jobId)
Execute the request (this example puts the
execute
call in atry
block to catch exceptions):response = None try: response = request.execute() except errors.HttpError, err: # Something went wrong. Handle the exception in an appropriate # way for your application.
Check the response to ensure that, irrespective of HTTP errors, the service call returned data.
if response == None: # Treat this condition as an error as best suits your # application.
Get status data. The response object is a dictionary containing all applicable members of the Job resource, including the full TrainingInput resource and the applicable members of the TrainingOutput resource. The following example prints the job status and the number of ML units consumed by the job.
print('Job status for {}.{}:'.format(projectName, jobName)) print(' state : {}'.format(response['state'])) print(' consumedMLUnits : {}'.format( response['trainingOutput']['consumedMLUnits']))
Jobs can fail if there is a problem with your training application or with the AI Platform Training infrastructure. You can use Cloud Logging to start debugging.
You can also use an interactive shell to inspect your training containers while the training job is running.
Monitoring resource consumption
You can find the following resource utilization charts for your training jobs on the Job Details page:
- The job's aggregate CPU or GPU utilization, and the memory utilization. These are broken down by master, worker, and parameter server.
- The job's network usage, measured in bytes per second. There are separate charts for bytes sent, and bytes received.
Go to the AI Platform Training Jobs page in the Google Cloud console.
Find your job in the list.
Click your job name in the list to open the Job Details page.
Select the tabs labeled CPU, GPU, or Network to view the associated resource utilization charts.
You can also access information about the online resources that your training jobs use with Cloud Monitoring. AI Platform Training exports metrics to Cloud Monitoring.
Each AI Platform Training metric type includes "training" in its name. For
example, ml.googleapis.com/training/cpu/utilization
or
ml.googleapis.com/training/accelerator/memory/utilization
.
Monitoring with TensorBoard
You can configure your training application to save summary data that you can examine and visualize using TensorBoard.
Save your summary data to a Cloud Storage location and point TensorBoard to that location to examine the data. You can also point TensorBoard to a directory with subdirectories that contain the output from multiple jobs.
See more information about TensorBoard and AI Platform Training in the getting-started guide.
What's next
- Troubleshoot problems with your training job.
- Deploy your trained model for online testing and prediction serving.