This page explains the state of a training cluster through the lifecycle of a training job, and how AI Platform handles training errors. You can use this information to adapt your training code accordingly.
Lifecycle of a training job
This section explains how AI Platform handles worker VMs through the lifecycle of a training job.
Queueing a new job
When you create a
HyperparameterTuningJob, the job might remain
JOB_STATE_QUEUED state for some time before
AI Platform runs it. This period is usually brief, but if your
Google Cloud project does not have sufficient remaining custom training
quotas for your job, then AI Platform
keeps the job queued until you have sufficient quotas.
Starting workers in parallel
When a training job starts, AI Platform schedules as many workers as
possible in a short amount of time. As a result, workers may start up in
parallel instead of sequentially. In order to reduce startup latency,
AI Platform starts running your code on each worker as soon as it
becomes available. When all the workers are available, AI Platform
sets the job state to
In most cases, your machine learning framework automatically handles the workers starting in parallel. If you're using a distribution strategy in your training code, you may need to adjust it manually to handle workers starting in parallel. Learn more about distribution strategies in TensorFlow and in PyTorch.
Restarting workers during the training job
During a training job, AI Platform can restart your workers from any worker pool with the same hostname. This can occur for the following reasons:
- VM maintenance: When the VM running a worker is subjected to VM maintenance, AI Platform restarts the worker on another VM. Learn more about live migration for VM maintenance.
Non-zero exits: If any worker exits with a non-zero exit code, AI Platform restarts that worker immediately in the same VM.
- If a worker fails due to a common error, it is treated as a permanent error, and AI Platform shuts down the entire job. If any containers restart before AI Platform shuts down the entire job, these containers may produce logs in Cloud Logging.
- If a worker fails due to a non-permanent error (any error not listed in the common errors), AI Platform allows the restarted worker to continue running, with up to five restarts per worker. After five restarts, if a worker fails again, AI Platform retries the entire job up to three times before failing the entire job.
To handle worker restarts in your training code, save checkpoints regularly during training so that you can restore from checkpoints when a worker restarts. Learn how to use training checkpoints in TensorFlow and in PyTorch.
Successfully completing a job
A training job completes successfully when its primary replica exits with exit code 0. At that point, AI Platform shuts down all the other running workers.
How AI Platform handles training job errors
This section explains how AI Platform handles common training job errors and internal errors.
About one minute after a job ends, AI Platform sets the error code on the training job object, based on the exit code.
Handling common errors
AI Platform shuts down all workers if it encounters any of the following issues:
|Error Type||Error Message/Log||Note|
|User code exception||The replica REPLICA_NAME exited with a non-zero status of EXIT_CODE. Termination reason: REASON.||If the job encountered exit codes that could be transient,
AI Platform tries to restart the job up to three times.
The potentially transient error codes that prompt AI Platform to
retry the job include the following:
|Out-of-memory||The replica REPLICA_NAME ran out of memory and exited with a non-zero status of EXIT_CODE.||
GKE reserves memory on AI Platform nodes. On
the smallest machine types (such as
|Insufficient capacity in your region (Compute Engine stockout)||Resources are insufficient in region: REGION_NAME. Please try
a different region. If you use
||A stockout happens when Compute Engine is at capacity for your selected CPU or GPU in your region. It is unrelated to your project quota. When this happens, AI Platform attempts to restart the job up to three times.|
Handling internal errors
If AI Platform has an internal error, it attempts to restart a job
twice (three attempts in total). If the restart attempts also fail,
AI Platform returns an internal error with the message:
Internal error occurred for the current attempt.