Learn about troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI.
Missing labels in the test, validation, or training set
When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data. To resolve this issue, add more training data, manually split your data to assign enough classes to every set, or remove the less frequently occurring labels from your dataset. For more information, see About data splits for AutoML models.
Custom training issues
The following issues can occur during custom training. The issues apply to
HyperparameterTuningJob resources, including those created
Replica exited with a non-zero status code
During distributed training, an error from any worker causes training to fail. To check the stack trace for the worker, view your custom training logs in the Google Cloud Console.
View the other troubleshooting topics to fix common errors and then create a new
TrainingPipeline resource. In many
cases, the error codes are caused by problems in your training code, not by
the Vertex AI service. To determine if this is the case, you can
run your training code on your local machine or on
Replica ran out of memory
This error occurs if a training virtual machine (VM) instance runs out of memory during training. You can view the memory usage of your training VMs in the Cloud Console.
Even when you get this error, you might not see 100% memory usage on the VM,
because services other than your training application that run on the VM also
consume resources. For machine
types that have less
memory, other services might consume a relatively large percentage of memory.
For example, on an
n1-standard-4 VM, services can consume up to 40% of the
You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.
Insufficient resources in a region
Vertex AI trains your models by using Compute Engine resources. Vertex AI cannot schedule your workload if Compute Engine is at capacity for a certain CPU or GPU in a region. This issue is also known as a stockout, and it is unrelated to your project quota.
When reaching Compute Engine capacity, Vertex AI automatically
HyperparameterTuningJob up to three times. The
job fails if all retries fail.
A stockout usually occurs when you are using GPUs. If you encounter this error when using GPUs, try switching to a different GPU type. If you can use another region, try training in a different region.
This error occurs if training failed because of a system error. The issue might
be transient; try to resubmit the
TrainingPipeline. If the error persists, contact
When using Vertex Vizier, you might get the following issues.
The internal error occurs when there is a system error. It might be transient. Try to resend the request, and if the error persists, contact support.