Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI.

AutoML models

Missing labels in the test, validation, or training set

When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data. To resolve this issue, add more training data, manually split your data to assign enough classes to every set, or remove the less frequently occurring labels from your dataset. For more information, see About data splits for AutoML models.

Custom-trained models

Custom training issues

The following issues can occur during custom training. The issues apply to CustomJob and HyperparameterTuningJob resources, including those created by TrainingPipeline resources.

Replica exited with a non-zero status code

During distributed training, an error from any worker causes training to fail. To check the stack trace for the worker, view your custom training logs in the Google Cloud Console.

View the other troubleshooting topics to fix common errors and then create a new CustomJob, HyperparameterTuningJob, or TrainingPipeline resource. In many cases, the error codes are caused by problems in your training code, not by the Vertex AI service. To determine if this is the case, you can run your training code on your local machine or on Compute Engine.

Replica ran out of memory

This error occurs if a training virtual machine (VM) instance runs out of memory during training. You can view the memory usage of your training VMs in the Cloud Console.

Even when you get this error, you might not see 100% memory usage on the VM, because services other than your training application that run on the VM also consume resources. For machine types that have less memory, other services might consume a relatively large percentage of memory. For example, on an n1-standard-4 VM, services can consume up to 40% of the memory.

You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.

Insufficient resources in a region

Vertex AI trains your models by using Compute Engine resources. Vertex AI cannot schedule your workload if Compute Engine is at capacity for a certain CPU or GPU in a region. This issue is also known as a stockout, and it is unrelated to your project quota.

When reaching Compute Engine capacity, Vertex AI automatically retries your CustomJob or HyperparameterTuningJob up to three times. The job fails if all retries fail.

A stockout usually occurs when you are using GPUs. If you encounter this error when using GPUs, try switching to a different GPU type. If you can use another region, try training in a different region.

Internal error

This error occurs if training failed because of a system error. The issue might be transient; try to resubmit the CustomJob, HyperparameterTuningJob, or TrainingPipeline. If the error persists, contact support.

Vertex Vizier

When using Vertex Vizier, you might get the following issues.

Internal error

The internal error occurs when there is a system error. It might be transient. Try to resend the request, and if the error persists, contact support.