This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI.
Troubleshooting steps for some Vertex AI components are listed separately. See the following:
To filter this page's content, click a topic:
AutoML models
This section describes troubleshooting steps that you might find helpful if you run into problems with AutoML.
Missing labels in the test, validation, or training set
Issue
When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data.
Solution
To resolve this issue, add more training data, manually split your data to assign enough classes to every set, or remove the less frequently occurring labels from your dataset. For more information, see About data splits for AutoML models.
Vertex AI Studio
When working with Vertex AI Studio you might receive the following errors:
Attempting to tune a model returns Internal error encountered
Issue
You encounter an Internal error encountered
error when trying to tune a model.
Solution
Run the following curl command to create an empty Vertex AI dataset. Ensure that you configure your project ID in the command.
PROJECT_ID=PROJECT_ID
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://europe-west4-aiplatform.googleapis.com/ui/projects/$PROJECT_ID/locations/europe-west4/datasets \
-d '{
"display_name": "test-name1",
"metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/image_1.0.0.yaml",
"saved_queries": [{"display_name": "saved_query_name", "problem_type": "IMAGE_CLASSIFICATION_MULTI_LABEL"}]
}'
After the command completes, wait five minutes and try model tuning again.
Error code: 429
Issue
You encounter the following error:
429: The online prediction request quota is exceeded for PUBLIC_BASE_MODEL_NAME.
Solution
Try again later with backoff. If you still experience errors, contact Vertex AI support.
Error code: 410
Issue
You encounter the following error:
410: The request is missing the required authentication credential. Expected OAuth 2.0 access token, login cookie, or other valid authentication credential.
Solution
See the Authentication overview to learn more.
Error code: 403
Issue
You encounter the following error:
403: Permission denied.
Solution
Ensure that the account accessing the API has the right permissions.
Vertex AI Pipelines
This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI Pipelines.
You don't have permission to act as service account
Issue
When you run your Vertex AI Pipelines workflow, you might encounter the following error message:
You do not have permission to act as service account: SERVICE_ACCOUNT. (or it may not exist).
Solution
This error means that the service account running your workflow doesn't have access to the resources that it needs to use.
To resolve this issue, try one of the following:
- Add the
Vertex AI Service Agent
role to the service account. - Grant the user the
iam.serviceAccounts.actAs
permission on the service account.
Error Internal error happened
Issue
If your pipeline fails with an Internal error happened
message,
check Logs Explorer and search for the pipeline's name. You might see an
error like the following:
java.lang.IllegalStateException: Failed to validate vpc network projects/PROJECT_ID/global/networks/VPC_NETWORK.APPLICATION_ERROR;google.cloud.servicenetworking.v1/ServicePeeringManagerV1.GetConsumerConfig;Reserved range: 'RANGE_NAME' not found for consumer project: 'PROJECT_ID' network: 'VPC_NETWORK'. com.google.api.tenant.error.TenantManagerException: Reserved range: 'RANGE_NAME' not found for consumer project
This means that VPC peering for Vertex AI includes an IP range that has been deleted.
Solution
To resolve this issue, update VPC peering using the update command and include valid IP ranges.
Invalid OAuth scope or ID token audience provided
Issue
When you run your Vertex AI Pipelines workflow, you encounter the following error message:
google.auth.exceptions.RefreshError: ('invalid_scope: Invalid OAuth scope
or ID token audience provided.', {'error': 'invalid_scope',
'error_description': 'Invalid OAuth scope or ID token audience provided.'})
Solution
This means that you haven't provided credentials in one of the pipeline's
components or didn't use ai_platform.init()
to set credentials.
To resolve this issue, set the credentials for the relevant pipeline
component or set the environment credentials and use ai_platform.init()
at the beginning of your code.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = PATH_TO_JSON_KEY
Vertex AI Pipelines components require more disk space than 100 GB
Issue
The default disk space allocated to Vertex AI Pipelines components is 100 GB and increasing the disk space isn't supported. See the Public Issue Tracker for this issue.
Solution
For a component to use more than 100 GB disk space, convert the component to a custom job by using components method. With this operator, you can assign the machine type and disk size that the component uses.
For an example of how to use this operator, see Vertex AI Pipelines: Custom training with prebuilt Google Cloud Pipeline Components, in the Convert the component to a Vertex AI Custom Job section.
Vertex AI networking issues
This section describes troubleshooting steps that you might find helpful if you run into problems with networking for Vertex AI.
gcloud services vpc-peerings get-vpc-service-controls \
--network YOUR_NETWORK
Workloads can't access endpoints in your VPC network when using privately-used public IP ranges for Vertex AI
Issue
Privately used public IP ranges are not imported by default.
Solution
To use privately used public IP ranges, you must enable import of privately used public IP ranges
com.google.api.tenant.error.TenantManagerException: Reserved range: xxx not found for consumer project
Issue
You receive errors of the form com.google.api.tenant.error.TenantManagerException:
Reserved range: xxx not found for consumer project
when running workloads or
deploying endpoints.
This occurs when you change the private services access reservations for your workloads. Any deleted ranges may not have been registered with the Vertex AI API.
Solution
Run gcloud services vpc-peerings update
for servicenetworking
after updating private services access allocations.
Pipeline or job can't access endpoints within your peered VPC network
Issue
Your Vertex AI pipeline times out when it attempts to connect to resources in your VPC network.
Solution
Try the following to resolve the problem:
- Ensure that you have completed all of the steps in Set up VPC Network Peering.
Review the configuration of your peered VPC network. Ensure that your network imports routes from the correct service networking range while your job is running.
Ensure that you have a firewall rule that allows connections from this range to the target in your network.
If the peering connection does not import any routes while your job is running, this means the service networking configuration is not being used. This is likely because you completed the peering configuration with a network other than the default network. If this is the case, ensure that you specify your network when you launch a job. Use the fully qualified network name in the following format:
projects/$PROJECT_ID/global/networks/$NETWORK_NAME
.For more information, see the Routes overview.
Pipeline or job can't access to reach endpoints in other networks beyond your network
Issue
Your Pipeline or job is unable to access endpoints in networks beyond your network.
Solution
By default, your peering configuration only exports routes to the local subnets in your VPC.
Additionally, transitive peering is not supported and only directly peered networks can communicate.
- To allow Vertex AI to connect through
your network and reach endpoints in other networks, you must export your network
routes to your peering connection. Edit the configuration of your peered VPC
network and enable
Export custom routes
.
Because transitive peering is not supported, the Vertex AI does
not learn routes to other peered networks and services, even with
Export Custom Routes
enabled. For information about workarounds, see
Extending network reachability of Vertex AI Pipelines.
No route to host
without route conflicts evident in Google Cloud console
Issue
The only routes you can see in the Google Cloud console are those known to your own VPC as well as the ranges reserved when you complete the VPC Network Peering configuration.
On rare occasions, a Vertex AI job might throw a no route to host
complaint
when trying to reach an IP address that your VPC is exporting to the Vertex AI
network.
This might be because Vertex AI jobs run within a networking namespace in a managed GKE cluster whose IP range conflicts with the target IP. See GKE networking fundamentals for further discussion.
Under these conditions, the workload tries to connect to the IP within its own networking namespace and throws the error if it's unable to reach it.
Solution
Craft your workload to return its local namespace IP addresses and confirm that this doesn't
conflict with any routes you are exporting over the peering connection.
If there is a conflict, pass a list of reservedIpRanges[]
in the job parameters that don't overlap with any ranges in your VPC network.
The job uses these ranges for the workload's internal IP addresses.
RANGES_EXHAUSTED
, RANGES_NOT_RESERVED
Issue
Errors of the form RANGES_EXHAUSTED
and RANGES_NOT_RESERVED
and
RANGES_DELETED_LATER
indicate a problem with the underlying
VPC network peering configuration. These are networking errors
and not errors from the Vertex AI service itself.
Solution
When faced with a RANGES_EXHAUSTED
error, you should first consider whether
this complaint is valid.
- Visit Network Analyzer in cloud console and look for insights of the form "Summary of IP address allocation" in the VPC network. If these indicate that the allocation is at or near 100%, you can add a new range to the reservation.
- Also consider the maximum number of parallel jobs that can be run with an reservation of a given size.
For more information, see Service Infrastructure Validation Errors
If the error persists, contact support.
Router status is temporarily unavailable
Issue
When you launch Vertex AI Pipelines, you receive an error message similar to the following:
Router status is temporarily unavailable. Please try again later
Solution
The error message indicates that this is a temporary condition. Try launching Vertex AI Pipelines again.
If the error persists, contact support.
Vertex AI prediction
This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI prediction.
Exceeded retries error
Issue
You get an error such as the following when running batch prediction jobs, indicating that the machine running the custom model might not be able to complete the predictions within the time limit.
('Post request fails. Cannot get predictions. Error: Exceeded retries: Non-OK
result 504 (upstream request timeout) from server, retry=3, elapsed=600.04s.', 16)
This can happen when the Vertex AI prediction service registers itself with the Google Front End service, which proxies connections from the client to the Vertex AI Prediction API.
The Google Front End service times out the connection and returns a 500 HTTP response code to the client if it doesn't receive a response from the API within 10 minutes.
Solution
To resolve this issue, you try either of the following;
- Increase the compute nodes, or change the machine type.
- Craft your prediction container to send periodic 102 HTTP response codes. This resets the 10 minute timer on the Google Front End service.
Project already linked to VPC
Issue
When deploying an endpoint, you might see an error message such as the following, which indicates that your Vertex AI endpoints have previously used a Virtual Private Cloud network and the resources were not appropriately cleaned.
Currently only one VPC network per user project is supported. Your project is
already linked to "projects/YOUR_SHARED_VPC_HOST_PROJECT/global/networks/YOUR_SHARED_VPC_NETWORK".
To change the VPC network, please undeploy all Vertex AI deployment resources,
delete all endpoint resources, and then retry creating resources in 30 mins.
Solution
To resolve this issue, try running this command in Cloud Shell.
gcloud services vpc-peerings delete \
--service=servicenetworking.googleapis.com \
--network=YOUR_SHARED_VPC_NETWORK \
--project=YOUR_SHARED_VPC_HOST_PROJECT
This manually disconnects your old VPC network from the Service Networking VPC.
Unexpected deployment failure or endpoint deletion
Issue
A model deployment unexpectedly fails, an endpoint is found to be deleted, or a previously deployed model has become undeployed.
Your billing account may be invalid. If it remains invalid for a long time, some resources might be removed from the projects associated with your account. For example, your endpoints and models might be deleted. Removed resources aren't recoverable.
Solution
To resolve this issue, you can try the following:
- Verify the billing status of your projects.
- Contact Cloud Billing Support to request help with billing questions.
For more information, see Billing questions.
Vertex AI custom service account issues
This section describes troubleshooting steps that you might find helpful if you run into problems with service accounts.
Model deployment fails with service account serviceAccountAdmin
error
Issue
Your model deployment fails with an error such as the following:
Failed to deploy model MODEL_NAME to
endpoint ENDPOINT_NAME due to the error: Failed to add IAM policy binding.
Please grant SERVICE_ACC_NAME@gcp-sa-aiplatform.iam.gserviceaccount.com the
iam.serviceAccountAdmin role on service account
vertex-prediction-role@PROJECT_INFO.iam.gserviceaccount.com
Solution
This error means that your custom service account might not have been configured correctly. To create a custom service account with the correct IAM permissions, see Use a custom service account.
Unable to fetch identity token when using custom service account
Issue
When using a custom service account, training jobs that run on a single replica are not able to reach the Compute Engine metadata service required to retrieve a token.
You will see an error similar to:
Failed to refresh jwt, retry number 0: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=...&format=full
from the Google Compute Engine Metadata service. Status: 404 Response:
\nb'Not Found\n'", <google.auth.transport.requests._Response object at
0x7fb19f058c50>)
Solution
To fetch the identity token with a custom service account, you must use iamcredentials.googleapis.com.
Custom-trained models
This section describes troubleshooting steps that you might find helpful if you run into problems with custom-trained models.
Custom training issues
The following issues can occur during custom training. The issues apply to
CustomJob
and HyperparameterTuningJob
resources, including those created
by TrainingPipeline
resources.
Error code: 400
Issue
You encounter the following error:
400 Machine type MACHINE_TYPE is not supported.
You may see this error message if the selected machine type isn't supported for Vertex AI training, or if a specific resource isn't available in the selected region.
Solution
Use only available machine types in the appropriate regions.
Replica exited with a non-zero status code
Issue
During distributed training, an error from any worker causes training to fail.
Solution
To check the stack trace for the worker, view your custom training logs in the Google Cloud console.
View the other troubleshooting topics to fix common errors and then create a new
CustomJob
, HyperparameterTuningJob
, or TrainingPipeline
resource. In many
cases, the error codes are caused by problems in your training code, not by
the Vertex AI service. To determine if this is the case, you can
run your training code on your local machine or on
Compute Engine.
Replica ran out of memory
Issue
An error can occur if a training virtual machine (VM) instance runs out of memory during training.
Solution
You can view the memory usage of your training VMs in the Google Cloud console.
Even when you get this error, you might not see 100% memory usage on the VM,
because services other than your training application that run on the VM also
consume resources. For machine
types that have less
memory, other services might consume a relatively large percentage of memory.
For example, on an n1-standard-4
VM, services can consume up to 40% of the
memory.
You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.
Insufficient resources in a region
Issue
You encounter a stockout issue in a region.
Solution
Vertex AI trains your models by using Compute Engine resources. Vertex AI can't schedule your workload if Compute Engine is at capacity for a certain CPU or GPU in a region. This issue is unrelated to your project quota.
When reaching Compute Engine capacity, Vertex AI automatically
retries your CustomJob
or HyperparameterTuningJob
up to three times. The
job fails if all retries fail.
A stockout usually occurs when you are using GPUs. If you encounter this error when using GPUs, try switching to a different GPU type. If you can use another region, try training in a different region.
Permission error when accessing another Google Cloud service
If you encounter a permission error when accessing another Google Cloud
service from your training code (for example:
google.api_core.exceptions.PermissionDenied: 403
), then you might have one
of the following issues:
-
Issue
The service agent or service account running your code (either the Vertex AI Custom Code Service Agent for your project or a custom service account doesn't have the required permission.
Solution
Learn how to give the Vertex AI Custom Code Service Agent permissions or configure a custom service account with the necessary permissions.
-
Issue
The service agent or service account running your code does have the required permission, but your code is trying to access a resource in the wrong project. This is especially likely to be the problem if the error message references a project ID ending with
-tp
.Solution
Due to the way Vertex AI runs your training code, this problem can occur inadvertently if you don't explicitly specify a project ID or project number in your code.
Learn how to fix this problem by specifying a project ID or project number.
Internal error
Issue
Your training failed because of a system error.
Solution
The issue might be transient; try to resubmit the CustomJob
,
HyperparameterTuningJob
, or TrainingPipeline
. If the error persists,
contact support.
Error code 500 when using a customer container image
Issue
You see a 500 error in your logs.
Solution
This type of error is likely to be a problem with your custom container image and not a Vertex AI error.
Service account can't access Cloud Storage bucket when deploying to an endpoint
Issue
When you try to deploy a model to an endpoint and your service account
doesn't have
storage.objects.list
access to the related Cloud Storage bucket,
you might see the following error:
custom-online-prediction@TENANT_PROJECT_ID.iam.gserviceaccount.com
does not have storage.objects.list access to the Cloud Storage bucket.
By default, the custom container that deploys your model uses a service account that doesn't have access to your Cloud Storage bucket.
Solution
To resolve this, try one of the following:
Copy the file that you are trying to access from the container into model artifacts when uploading the model. Vertex AI will copy it to a location the default service account has access to, similar to all the other model artifacts.
Copy the file into the container as part of the container build process.
Specify a custom service account.
Neural Architecture Search
Known issues
- After cancelling the NAS job, the main job (the parent) stops, but some of the child trials keep showing a Running state. Ignore the child trial state that shows Running in this case. The trials have stopped, but the UI continues to show the Running state. As long as the main job has stopped, you won't be charged extra.
- After reporting rewards in the trainer, wait (sleep) for 10 minutes before the trial jobs exit.
When using Cloud Shell to run
TensorBoard
, the generated output link might not work. In this case, write down the port number, use the Web Preview tool, and select the correct port number to display the plots.Accessing the
Web Preview
tool:If you see error messages like the following in the trainer logs:
gcsfuse errors: fuse: writeMessage: no such file or directory [16 0 0 0 218 255 255 255 242 25 111 1 0 0 0 0]
use a machine with more RAM, because an OOM condition is causing this error.
If your custom trainer isn't able to find the job directory
job-dir
FLAG, importjob_dir
with an underscore rather than a hyphen. A note in tutorial-1 explains this.NAN error during training There might be NaN errors in the training job like
NaN : Tensor had NaN values
. The learning rate might be too big for the suggested architecture. For more information, see Out-of-memory (OOM) and learning rate related errors.OOM error during training There might be OOM (out-of-memory) errors in the training job. The batch size might be too large for the accelerator memory. For more information, see Out-of-memory (OOM) and learning rate related errors.
Proxy-task model selection controller job dies In the rare case that the proxy-task model selection controller job dies, you can resume the job by following these steps.
Proxy-task search controller job dies In the rare case that the proxy-task search controller job dies, you can resume the job by following these steps.
Service account does not have permission to access Artifact Registry or bucket. If you get an error such as
Vertex AI Service Agent service-123456789@gcp-sa-aiplatform-cc.iam.gserviceaccount.com does not have permission to access Artifact Registry repository projects/my-project/locations/my-region/repositories/nas
or a similar error for bucket access, give this service account a storage editor role in your project.
Vertex AI Feature Store
This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI Feature Store.
Resource not found
error when sending a streaming ingestion or online serving request
Issue
After you set up a featurestore, entity type, or feature resources, there's a
delay before those resources are propagated to the
FeaturestoreOnlineServingService
service. Sometimes this delayed propagation
might cause a resource not found
error when you submit a streaming ingestion
or online serving request immediately after you create a resource.
Solution
If you receive this error, wait a few minutes and then try your request again.
Batch ingestion succeeded for newly created features but online serving request returns empty values
Issue
For newly created features only, there is a delay before those features are
propagated to the FeaturestoreOnlineServingService
service. The features and
values exist but take time to propagate. This might result in your online
serving request returning empty values.
Solution
If you do see this inconsistency, wait a few minutes and then try your online serving request again.
CPU utilization is high for an online serving node
Issue
Your CPU utilization for an online serving node is high.
Solution
To mitigate this issue, you can either increase the number of online serving nodes by manually increasing the node count or by enabling autoscaling. Note that even if auto scaling is enabled, Vertex AI Feature Store needs time to rebalance the data when nodes are added or removed. For information about how to view feature value distribution metrics over time, see View feature value metrics.
CPU utilization is high for the hottest online serving node
Issue
If the CPU utilization is high for the hottest node, you can either increase the number of serving nodes or change the entity access pattern to pseudo-random.
Solution
Setting the entity access pattern to pseudo-random mitigates high CPU utilization resulting from frequently accessing entities that are located near to each other in the featurestore. If neither solution is effective, implement a client-side cache to avoid accessing the same entities repeatedly.
Online serving latency is high when QPS is low
Issue
The period of inactivity or low activity at low QPS might result in some server-side caches expiring. This can result in high latency when traffic to online serving nodes resumes at regular or higher QPS.
Solution
To mitigate this issue, you need to keep the connection active by sending artificial traffic of at least 5 QPS to the featurestore.
Batch ingestion job fails after six hours
Issue
The batch ingestion job can fail because the read session expires after six hours.
Solution
To avoid the timeout, increase the number of workers to complete the ingestion job within the six hour time limit.
Resource exceeded
error when exporting feature values
Issue
Exporting a high volume of data can fail with a resource exceeded error if the export job exceeds the internal quota.
Solution
To avoid this error, you can configure the time range parameters, start_time
and end_time
, to process smaller amounts of data at a time. For information
about full export, see Full export.
Vertex AI Vizier
When using Vertex AI Vizier, you might get the following issues.
Internal error
Issue
The internal error occurs when there is a system error.
Solution
It might be transient. Try to resend the request, and if the error persists, contact support.