Troubleshooting Vertex AI

This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI.

Troubleshooting steps for some Vertex AI components are listed separately. See the following:

To filter this page's content, click a topic:

AutoML models

This section describes troubleshooting steps that you might find helpful if you run into problems with AutoML.

Missing labels in the test, validation, or training set

Issue

When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data.

Solution

To resolve this issue, add more training data, manually split your data to assign enough classes to every set, or remove the less frequently occurring labels from your dataset. For more information, see About data splits for AutoML models.

Vertex AI Studio

When working with Vertex AI Studio you might receive the following errors:

Attempting to tune a model returns Internal error encountered

Issue

You encounter an Internal error encountered error when trying to tune a model.

Solution

Run the following curl command to create an empty Vertex AI dataset. Ensure that you configure your project ID in the command.

PROJECT_ID=PROJECT_ID

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://europe-west4-aiplatform.googleapis.com/ui/projects/$PROJECT_ID/locations/europe-west4/datasets \
-d '{
    "display_name": "test-name1",
    "metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/image_1.0.0.yaml",
    "saved_queries": [{"display_name": "saved_query_name", "problem_type": "IMAGE_CLASSIFICATION_MULTI_LABEL"}]
}'

After the command completes, wait five minutes and try model tuning again.

Error code: 429

Issue

You encounter the following error:

429: The online prediction request quota is exceeded for
PUBLIC_BASE_MODEL_NAME.

Solution

Try again later with backoff. If you still experience errors, contact Vertex AI support.

Error code: 410

Issue

You encounter the following error:

410: The request is missing the required authentication credential. Expected
OAuth 2.0 access token, login cookie, or other valid authentication credential.

Solution

See the Authentication overview to learn more.

Error code: 403

Issue

You encounter the following error:

403: Permission denied.

Solution

Ensure that the account accessing the API has the right permissions.

Vertex AI Pipelines

This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI Pipelines.

You don't have permission to act as service account

Issue

When you run your Vertex AI Pipelines workflow, you might encounter the following error message:

You do not have permission to act as service account: SERVICE_ACCOUNT. (or it may not exist).

Solution

This error means that the service account running your workflow doesn't have access to the resources that it needs to use.

To resolve this issue, try one of the following:

  • Add the Vertex AI Service Agent role to the service account.
  • Grant the user the iam.serviceAccounts.actAs permission on the service account.

Error Internal error happened

Issue

If your pipeline fails with an Internal error happened message, check Logs Explorer and search for the pipeline's name. You might see an error like the following:

java.lang.IllegalStateException: Failed to validate vpc
network projects/PROJECT_ID/global/networks/VPC_NETWORK.

APPLICATION_ERROR;google.cloud.servicenetworking.v1/ServicePeeringManagerV1.GetConsumerConfig;Reserved
range: 'RANGE_NAME' not found for consumer project:
'PROJECT_ID' network: 'VPC_NETWORK'.
com.google.api.tenant.error.TenantManagerException: Reserved range:
'RANGE_NAME' not found for consumer project

This means that VPC peering for Vertex AI includes an IP range that has been deleted.

Solution

To resolve this issue, update VPC peering using the update command and include valid IP ranges.

Invalid OAuth scope or ID token audience provided

Issue

When you run your Vertex AI Pipelines workflow, you encounter the following error message:

google.auth.exceptions.RefreshError: ('invalid_scope: Invalid OAuth scope
or ID token audience provided.', {'error': 'invalid_scope',
'error_description': 'Invalid OAuth scope or ID token audience provided.'})

Solution

This means that you haven't provided credentials in one of the pipeline's components or didn't use ai_platform.init() to set credentials.

To resolve this issue, set the credentials for the relevant pipeline component or set the environment credentials and use ai_platform.init() at the beginning of your code.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = PATH_TO_JSON_KEY

Vertex AI Pipelines components require more disk space than 100 GB

Issue

The default disk space allocated to Vertex AI Pipelines components is 100 GB and increasing the disk space isn't supported. See the Public Issue Tracker for this issue.

Solution

For a component to use more than 100 GB disk space, convert the component to a custom job by using components method. With this operator, you can assign the machine type and disk size that the component uses.

For an example of how to use this operator, see Vertex AI Pipelines: Custom training with prebuilt Google Cloud Pipeline Components, in the Convert the component to a Vertex AI Custom Job section.

Vertex AI networking issues

This section describes troubleshooting steps that you might find helpful if you run into problems with networking for Vertex AI.

gcloud services vpc-peerings get-vpc-service-controls \
  --network YOUR_NETWORK

Workloads can't access endpoints in your VPC network when using privately-used public IP ranges for Vertex AI

Issue

Privately used public IP ranges are not imported by default.

Solution

To use privately used public IP ranges, you must enable import of privately used public IP ranges

com.google.api.tenant.error.TenantManagerException: Reserved range: xxx not found for consumer project

Issue

You receive errors of the form com.google.api.tenant.error.TenantManagerException: Reserved range: xxx not found for consumer project when running workloads or deploying endpoints.

This occurs when you change the private services access reservations for your workloads. Any deleted ranges may not have been registered with the Vertex AI API.

Solution

Run gcloud services vpc-peerings update for servicenetworking after updating private services access allocations.

Pipeline or job can't access endpoints within your peered VPC network

Issue

Your Vertex AI pipeline times out when it attempts to connect to resources in your VPC network.

Solution

Try the following to resolve the problem:

  • Ensure that you have completed all of the steps in Set up VPC Network Peering.
  • Review the configuration of your peered VPC network. Ensure that your network imports routes from the correct service networking range while your job is running.

    Go to VPC Network Peering

  • Ensure that you have a firewall rule that allows connections from this range to the target in your network.

  • If the peering connection does not import any routes while your job is running, this means the service networking configuration is not being used. This is likely because you completed the peering configuration with a network other than the default network. If this is the case, ensure that you specify your network when you launch a job. Use the fully qualified network name in the following format: projects/$PROJECT_ID/global/networks/$NETWORK_NAME.

    For more information, see the Routes overview.

Pipeline or job can't access to reach endpoints in other networks beyond your network

Issue

Your Pipeline or job is unable to access endpoints in networks beyond your network.

Solution

By default, your peering configuration only exports routes to the local subnets in your VPC.

Additionally, transitive peering is not supported and only directly peered networks can communicate.

  • To allow Vertex AI to connect through your network and reach endpoints in other networks, you must export your network routes to your peering connection. Edit the configuration of your peered VPC network and enable Export custom routes.

Go to VPC Network Peering

Because transitive peering is not supported, the Vertex AI does not learn routes to other peered networks and services, even with Export Custom Routes enabled. For information about workarounds, see Extending network reachability of Vertex AI Pipelines.

No route to host without route conflicts evident in Google Cloud console

Issue

The only routes you can see in the Google Cloud console are those known to your own VPC as well as the ranges reserved when you complete the VPC Network Peering configuration.

On rare occasions, a Vertex AI job might throw a no route to host complaint when trying to reach an IP address that your VPC is exporting to the Vertex AI network.

This might be because Vertex AI jobs run within a networking namespace in a managed GKE cluster whose IP range conflicts with the target IP. See GKE networking fundamentals for further discussion.

Under these conditions, the workload tries to connect to the IP within its own networking namespace and throws the error if it's unable to reach it.

Solution

Craft your workload to return its local namespace IP addresses and confirm that this doesn't conflict with any routes you are exporting over the peering connection. If there is a conflict, pass a list of reservedIpRanges[] in the job parameters that don't overlap with any ranges in your VPC network. The job uses these ranges for the workload's internal IP addresses.

RANGES_EXHAUSTED, RANGES_NOT_RESERVED

Issue

Errors of the form RANGES_EXHAUSTED and RANGES_NOT_RESERVED and RANGES_DELETED_LATER indicate a problem with the underlying VPC network peering configuration. These are networking errors and not errors from the Vertex AI service itself.

Solution

When faced with a RANGES_EXHAUSTED error, you should first consider whether this complaint is valid.

  • Visit Network Analyzer in cloud console and look for insights of the form "Summary of IP address allocation" in the VPC network. If these indicate that the allocation is at or near 100%, you can add a new range to the reservation.
  • Also consider the maximum number of parallel jobs that can be run with an reservation of a given size.

For more information, see Service Infrastructure Validation Errors

If the error persists, contact support.

Router status is temporarily unavailable

Issue

When you launch Vertex AI Pipelines, you receive an error message similar to the following:

Router status is temporarily unavailable. Please try again later

Solution

The error message indicates that this is a temporary condition. Try launching Vertex AI Pipelines again.

If the error persists, contact support.

Vertex AI prediction

This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI prediction.

Exceeded retries error

Issue

You get an error such as the following when running batch prediction jobs, indicating that the machine running the custom model might not be able to complete the predictions within the time limit.

('Post request fails. Cannot get predictions. Error: Exceeded retries: Non-OK
result 504 (upstream request timeout) from server, retry=3, elapsed=600.04s.', 16)

This can happen when the Vertex AI prediction service registers itself with the Google Front End service, which proxies connections from the client to the Vertex AI Prediction API.

The Google Front End service times out the connection and returns a 500 HTTP response code to the client if it doesn't receive a response from the API within 10 minutes.

Solution

To resolve this issue, you try either of the following;

  • Increase the compute nodes, or change the machine type.
  • Craft your prediction container to send periodic 102 HTTP response codes. This resets the 10 minute timer on the Google Front End service.

Project already linked to VPC

Issue

When deploying an endpoint, you might see an error message such as the following, which indicates that your Vertex AI endpoints have previously used a Virtual Private Cloud network and the resources were not appropriately cleaned.

Currently only one VPC network per user project is supported. Your project is
already linked to "projects/YOUR_SHARED_VPC_HOST_PROJECT/global/networks/YOUR_SHARED_VPC_NETWORK".
To change the VPC network, please undeploy all Vertex AI deployment resources,
delete all endpoint resources, and then retry creating resources in 30 mins.

Solution

To resolve this issue, try running this command in Cloud Shell.

gcloud services vpc-peerings delete \
    --service=servicenetworking.googleapis.com \
    --network=YOUR_SHARED_VPC_NETWORK \
    --project=YOUR_SHARED_VPC_HOST_PROJECT

This manually disconnects your old VPC network from the Service Networking VPC.

Unexpected deployment failure or endpoint deletion

Issue

A model deployment unexpectedly fails, an endpoint is found to be deleted, or a previously deployed model has become undeployed.

Your billing account may be invalid. If it remains invalid for a long time, some resources might be removed from the projects associated with your account. For example, your endpoints and models might be deleted. Removed resources aren't recoverable.

Solution

To resolve this issue, you can try the following:

For more information, see Billing questions.

Vertex AI custom service account issues

This section describes troubleshooting steps that you might find helpful if you run into problems with service accounts.

Model deployment fails with service account serviceAccountAdmin error

Issue

Your model deployment fails with an error such as the following:

Failed to deploy model MODEL_NAME to endpoint ENDPOINT_NAME due to the error: Failed to add IAM policy binding. Please grant SERVICE_ACC_NAME@gcp-sa-aiplatform.iam.gserviceaccount.com the iam.serviceAccountAdmin role on service account vertex-prediction-role@PROJECT_INFO.iam.gserviceaccount.com

Solution

This error means that your custom service account might not have been configured correctly. To create a custom service account with the correct IAM permissions, see Use a custom service account.

Unable to fetch identity token when using custom service account

Issue

When using a custom service account, training jobs that run on a single replica are not able to reach the Compute Engine metadata service required to retrieve a token.

You will see an error similar to:

Failed to refresh jwt, retry number 0: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience=...&format=full
from the Google Compute Engine Metadata service. Status: 404 Response:
\nb'Not Found\n'", <google.auth.transport.requests._Response object at
0x7fb19f058c50>)

Solution

To fetch the identity token with a custom service account, you must use iamcredentials.googleapis.com.

Custom-trained models

This section describes troubleshooting steps that you might find helpful if you run into problems with custom-trained models.

Custom training issues

The following issues can occur during custom training. The issues apply to CustomJob and HyperparameterTuningJob resources, including those created by TrainingPipeline resources.

Error code: 400

Issue

You encounter the following error:

400 Machine type MACHINE_TYPE is not supported.

You may see this error message if the selected machine type isn't supported for Vertex AI training, or if a specific resource isn't available in the selected region.

Solution

Use only available machine types in the appropriate regions.

Replica exited with a non-zero status code

Issue

During distributed training, an error from any worker causes training to fail.

Solution

To check the stack trace for the worker, view your custom training logs in the Google Cloud console.

View the other troubleshooting topics to fix common errors and then create a new CustomJob, HyperparameterTuningJob, or TrainingPipeline resource. In many cases, the error codes are caused by problems in your training code, not by the Vertex AI service. To determine if this is the case, you can run your training code on your local machine or on Compute Engine.

Replica ran out of memory

Issue

An error can occur if a training virtual machine (VM) instance runs out of memory during training.

Solution

You can view the memory usage of your training VMs in the Google Cloud console.

Even when you get this error, you might not see 100% memory usage on the VM, because services other than your training application that run on the VM also consume resources. For machine types that have less memory, other services might consume a relatively large percentage of memory. For example, on an n1-standard-4 VM, services can consume up to 40% of the memory.

You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.

Insufficient resources in a region

Issue

You encounter a stockout issue in a region.

Solution

Vertex AI trains your models by using Compute Engine resources. Vertex AI can't schedule your workload if Compute Engine is at capacity for a certain CPU or GPU in a region. This issue is unrelated to your project quota.

When reaching Compute Engine capacity, Vertex AI automatically retries your CustomJob or HyperparameterTuningJob up to three times. The job fails if all retries fail.

A stockout usually occurs when you are using GPUs. If you encounter this error when using GPUs, try switching to a different GPU type. If you can use another region, try training in a different region.

Permission error when accessing another Google Cloud service

If you encounter a permission error when accessing another Google Cloud service from your training code (for example: google.api_core.exceptions.PermissionDenied: 403), then you might have one of the following issues:

Internal error

Issue

Your training failed because of a system error.

Solution

The issue might be transient; try to resubmit the CustomJob, HyperparameterTuningJob, or TrainingPipeline. If the error persists, contact support.

Error code 500 when using a customer container image

Issue

You see a 500 error in your logs.

Solution

This type of error is likely to be a problem with your custom container image and not a Vertex AI error.

Service account can't access Cloud Storage bucket when deploying to an endpoint

Issue

When you try to deploy a model to an endpoint and your service account doesn't have storage.objects.list access to the related Cloud Storage bucket, you might see the following error:

custom-online-prediction@TENANT_PROJECT_ID.iam.gserviceaccount.com
does not have storage.objects.list access to the Cloud Storage bucket.

By default, the custom container that deploys your model uses a service account that doesn't have access to your Cloud Storage bucket.

Solution

To resolve this, try one of the following:

  • Copy the file that you are trying to access from the container into model artifacts when uploading the model. Vertex AI will copy it to a location the default service account has access to, similar to all the other model artifacts.

  • Copy the file into the container as part of the container build process.

  • Specify a custom service account.

Neural Architecture Search

Known issues

  • After cancelling the NAS job, the main job (the parent) stops, but some of the child trials keep showing a Running state. Ignore the child trial state that shows Running in this case. The trials have stopped, but the UI continues to show the Running state. As long as the main job has stopped, you won't be charged extra.
  • After reporting rewards in the trainer, wait (sleep) for 10 minutes before the trial jobs exit.
  • When using Cloud Shell to run TensorBoard, the generated output link might not work. In this case, write down the port number, use the Web Preview tool, and select the correct port number to display the plots.

    Accessing the Web Preview tool:

    A feature attribution chart for one predicted bike ride duration.

  • If you see error messages like the following in the trainer logs:

    gcsfuse errors: fuse: writeMessage: no such file or directory [16 0 0 0 218 255 255 255 242 25 111 1 0 0 0 0]
    

    use a machine with more RAM, because an OOM condition is causing this error.

  • If your custom trainer isn't able to find the job directory job-dir FLAG, import job_dir with an underscore rather than a hyphen. A note in tutorial-1 explains this.

  • NAN error during training There might be NaN errors in the training job like NaN : Tensor had NaN values. The learning rate might be too big for the suggested architecture. For more information, see Out-of-memory (OOM) and learning rate related errors.

  • OOM error during training There might be OOM (out-of-memory) errors in the training job. The batch size might be too large for the accelerator memory. For more information, see Out-of-memory (OOM) and learning rate related errors.

  • Proxy-task model selection controller job dies In the rare case that the proxy-task model selection controller job dies, you can resume the job by following these steps.

  • Proxy-task search controller job dies In the rare case that the proxy-task search controller job dies, you can resume the job by following these steps.

  • Service account does not have permission to access Artifact Registry or bucket. If you get an error such as Vertex AI Service Agent service-123456789@gcp-sa-aiplatform-cc.iam.gserviceaccount.com does not have permission to access Artifact Registry repository projects/my-project/locations/my-region/repositories/nas or a similar error for bucket access, give this service account a storage editor role in your project.

Vertex AI Feature Store

This section describes troubleshooting steps that you might find helpful if you run into problems with Vertex AI Feature Store.

Resource not found error when sending a streaming ingestion or online serving request

Issue

After you set up a featurestore, entity type, or feature resources, there's a delay before those resources are propagated to the FeaturestoreOnlineServingService service. Sometimes this delayed propagation might cause a resource not found error when you submit a streaming ingestion or online serving request immediately after you create a resource.

Solution

If you receive this error, wait a few minutes and then try your request again.

Batch ingestion succeeded for newly created features but online serving request returns empty values

Issue

For newly created features only, there is a delay before those features are propagated to the FeaturestoreOnlineServingService service. The features and values exist but take time to propagate. This might result in your online serving request returning empty values.

Solution

If you do see this inconsistency, wait a few minutes and then try your online serving request again.

CPU utilization is high for an online serving node

Issue

Your CPU utilization for an online serving node is high.

Solution

To mitigate this issue, you can either increase the number of online serving nodes by manually increasing the node count or by enabling autoscaling. Note that even if auto scaling is enabled, Vertex AI Feature Store needs time to rebalance the data when nodes are added or removed. For information about how to view feature value distribution metrics over time, see View feature value metrics.

CPU utilization is high for the hottest online serving node

Issue

If the CPU utilization is high for the hottest node, you can either increase the number of serving nodes or change the entity access pattern to pseudo-random.

Solution

Setting the entity access pattern to pseudo-random mitigates high CPU utilization resulting from frequently accessing entities that are located near to each other in the featurestore. If neither solution is effective, implement a client-side cache to avoid accessing the same entities repeatedly.

Online serving latency is high when QPS is low

Issue

The period of inactivity or low activity at low QPS might result in some server-side caches expiring. This can result in high latency when traffic to online serving nodes resumes at regular or higher QPS.

Solution

To mitigate this issue, you need to keep the connection active by sending artificial traffic of at least 5 QPS to the featurestore.

Batch ingestion job fails after six hours

Issue

The batch ingestion job can fail because the read session expires after six hours.

Solution

To avoid the timeout, increase the number of workers to complete the ingestion job within the six hour time limit.

Resource exceeded error when exporting feature values

Issue

Exporting a high volume of data can fail with a resource exceeded error if the export job exceeds the internal quota.

Solution

To avoid this error, you can configure the time range parameters, start_time and end_time, to process smaller amounts of data at a time. For information about full export, see Full export.

Vertex AI Vizier

When using Vertex AI Vizier, you might get the following issues.

Internal error

Issue

The internal error occurs when there is a system error.

Solution

It might be transient. Try to resend the request, and if the error persists, contact support.