Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI.

AutoML models

Missing labels in the test, validation, or training set

When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data. To resolve this issue, add more training data, manually split your data to assign enough classes to every set, or remove the less frequently occurring labels from your dataset. For more information, see About data splits for AutoML models.

Custom-trained models

Custom training issues

The following issues can occur during custom training. The issues apply to CustomJob and HyperparameterTuningJob resources, including those created by TrainingPipeline resources.

Replica exited with a non-zero status code

During distributed training, an error from any worker causes training to fail. To check the stack trace for the worker, view your custom training logs in the Google Cloud Console.

View the other troubleshooting topics to fix common errors and then create a new CustomJob, HyperparameterTuningJob, or TrainingPipeline resource. In many cases, the error codes are caused by problems in your training code, not by the Vertex AI service. To determine if this is the case, you can run your training code on your local machine or on Compute Engine.

Replica ran out of memory

This error occurs if a training virtual machine (VM) instance runs out of memory during training. You can view the memory usage of your training VMs in the Cloud Console.

Even when you get this error, you might not see 100% memory usage on the VM, because services other than your training application that run on the VM also consume resources. For machine types that have less memory, other services might consume a relatively large percentage of memory. For example, on an n1-standard-4 VM, services can consume up to 40% of the memory.

You can optimize the memory consumption of your training application, or you can choose a larger machine type with more memory.

Insufficient resources in a region

Vertex AI trains your models by using Compute Engine resources. Vertex AI cannot schedule your workload if Compute Engine is at capacity for a certain CPU or GPU in a region. This issue is also known as a stockout, and it is unrelated to your project quota.

When reaching Compute Engine capacity, Vertex AI automatically retries your CustomJob or HyperparameterTuningJob up to three times. The job fails if all retries fail.

A stockout usually occurs when you are using GPUs. If you encounter this error when using GPUs, try switching to a different GPU type. If you can use another region, try training in a different region.

Permission error when accessing another Google Cloud service

If you encounter a permission error when accessing another Google Cloud service from your training code (for example: google.api_core.exceptions.PermissionDenied: 403), then you might have one of the following issues:

Internal error

This error occurs if training failed because of a system error. The issue might be transient; try to resubmit the CustomJob, HyperparameterTuningJob, or TrainingPipeline. If the error persists, contact support.

Vertex Feature Store

Resource not found error when sending an online serving request

After you just set up a featurestore, entity type, or feature resources, there is a delay before those resources are propagated to the FeaturestoreOnlineServingService. In some cases, this delayed propagation might cause a resource not found error when you submit an online serving request immediately after resource creation. If you do receive this error, wait a few minutes and then try your online serving request again.

Batch ingestion succeeded for newly created features but online serving request returns empty values

For newly created features only, there is a delay before those features are propagated to the FeaturestoreOnlineServingService. The features and values exist but take time to propagate. If you do see this inconsistency, wait a few minutes and then try your online serving request again.

Vertex Vizier

When using Vertex Vizier, you might get the following issues.

Internal error

The internal error occurs when there is a system error. It might be transient. Try to resend the request, and if the error persists, contact support.

Vertex AI Workbench: User-managed notebooks

Connecting to and opening JupyterLab

Nothing happens after clicking Open JupyterLab

Verify that your browser does not block pop-up tabs. JupyterLab opens in a new browser tab.

No Inverting Proxy server access to JupyterLab

Vertex AI Workbench uses a Google internal Inverting Proxy server to provide access to JupyterLab. User-managed notebooks instance settings, network configuration, and other factors can prevent access to JupyterLab. Use SSH to connect to JupyterLab and learn more about why you might not have access through the Inverting Proxy.

Unable to SSH into user-managed notebooks instance

User-managed notebooks instances use OS Login to enable SSH access. This is done automatically at user-managed notebooks instance creation time by setting the metadata entry enable-oslogin value to TRUE. To enable SSH access for user-managed notebooks for users, complete the steps for configuring OS Login roles on user accounts.

Opening a notebook results in a 403 (Forbidden) error

There are 3 different ways to access JupyterLab notebooks:

  • Single User
  • Service Account
  • Project Editors

The access mode is configured during user-managed notebooks instance creation and it is defined in the notebook metadata:

  • Single User: proxy-mode=mail, proxy-user-mail=user@domain.com
  • Service Account: proxy-mode=service_account
  • Project Editors: proxy-mode=project_editors

If you can't access a notebook when you click Open JupyterLab, try the following:

  • Verify that the proxy-mode metadata entry is correct.

  • Verify that the user accessing the instance has the iam.serviceAccounts.ActAs permission for the defined service account. The service account on the instance provides access to other Google Cloud services. You can use any service account within the same project, but you must have the Service Account User permission (iam.serviceAccounts.actAs) to access the instance. If not specified, the Compute Engine default service account is used and this permission is required as well.

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf2-latest-cpu \
  --metadata=proxy-mode=mail,proxy-user-mail=user@domain.com \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

When you click Open JupyterLab to open a notebook, the notebook opens in a new browser tab. If you are signed in to more than one Google account, the new tab opens with your default Google account. If you did not create your user-managed notebooks instance with your default Google account, the new browser tab will show a 403 (Forbidden) error.

Opening a notebook results in a 504 (Gateway Timeout) error

This is an indication of an internal proxy timeout or a backend server (Jupyter) timeout. This can be seen when:

  • The request never reached the internal Inverting Proxy server
  • Backend(Jupyter) returns a 504 error.

If you can't access a notebook:

  • Open a Google support case.

Opening a notebook results in a 524 (A Timeout Occurred) error

The internal Inverting Proxy server hasn't received a response from the Inverting Proxy agent for the request within the timeout period. Inverting Proxy agent runs inside your user-managed notebooks instance as a Docker container. A 524 error is usually an indication that the Inverting Proxy agent isn't connecting to the Inverting Proxy server or the requests are taking too long on the backend server side (Jupyter). A typical case for this error is on the user side (for example, a networking issue, or the Inverting Proxy agent/Jupyter service isn't running).

If you can't access a notebook, verify that your user-managed notebooks instance is started and try the following:

Option 1: Run the diagnostic tool to automatically check and repair user-managed notebooks core services, verify available storage, and generate useful log files. To run the tool in your instance, perform the following steps:

  1. Make sure that your instance is on version M58 or newer.

  2. Connect to your Deep Learning VM Images instance using SSH.

  3. Run the following command:

       sudo /opt/deeplearning/bin/diagnostic_tool.sh [--repair] [--bucket=$BUCKET]
       

    Note that the --repair flag and --bucket flags are optional. The --repair flag will attempt to fix common core service errors, and the --bucket flag will allow you to specify a Cloud Storage bucket to store the created log files.

    The output of this command will display useful status messages for user-managed notebooks core services and will export log files of its findings.

Option 2: Use the following steps to check specific user-managed notebooks requirements individually.

Opening a notebook results in a 598 (Network read timeout) error

The Inverting Proxy server hasn't heard from the Inverting Proxy agent at all for more than 10 minutes, this is a strong indication of an Inverting Proxy agent/Jupyter issue.

If you can't access a notebook, try the following:

Notebook is unresponsive

If your user-managed notebooks instance isn't executing cells or appears to be frozen, first try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

  • Refresh the JupyterLab browser page. Any unsaved cell output will not persist, so you must run those cells again to regenerate the output.
  • From a terminal session in the notebook, run the command top to see if there are processes consuming the CPU.
  • From the terminal, check the amount of free disk space using the command df, or check the available RAM using the command free.
  • Shut your instance down by selecting it from the Notebook instances page and clicking Stop. Once it has stopped completely, select it and click Start.

Working with files

Downloading files from JupyterLab results in 403 (Forbidden) error

The notebook package in the M23 release of Deep Learning VM includes a bug that prevents you from downloading a file using the JupyterLab UI. You can read more about the bug at Cannot download files after JL update and Download file functionality is broken in notebook packages version 5.7.6+ (5.7.7, 5.7.8).

If you are using the M23 release of Deep Learning VM you can resolve the issue in one of two ways:

  • Use a Safari browser. The download functionality works for Safari.

  • Downgrade your notebook package to version 5.7.5.

    To downgrade your notebook package:

    1. Connect to your Deep Learning VM using SSH. For information on connecting to a VM using SSH, see Connecting to instances.

    2. Run the following commands:

      sudo pip3 install notebook==5.7.5
      sudo service jupyter restart
      

After restarting VM, local files cannot be referenced from notebook terminal

Sometimes after restarting a user-managed notebooks instance, local files cannot be referenced from within a notebook terminal.

This is a known issue. To reference your local files from within a notebook terminal, first re-establish your current working directory using the following command:

cd PWD

In this command, replace PWD with your current working directory. For example, if your current working directory was /home/jupyter/, use the command cd /home/jupyter/.

After re-establishing your current working directory, your local files can be referenced from within the notebook terminal.

GPU quota has been exceeded

Determine the number of GPUs available in your project by checking the quotas page. If GPUs are not listed on the quotas page, or you require additional GPU quota, you can request a quota increase. See Requesting additional quota on the Compute Engine Resource Quotas page.

Creating user-managed notebooks instances

New user-managed notebooks instance is not created (insufficient permissions)

It usually takes about a minute to create a user-managed notebooks instance. If your new user-managed notebooks instance remains in the pending state indefinitely, it might be because the service account used to start the user-managed notebooks instance does not have the required Editor permission in your Google Cloud Platform (GCP) project.

You can start a user-managed notebooks instance with a custom service account that you create or in single-user mode with a userid. If you start a user-managed notebooks instance in single-user mode, then your user-managed notebooks instance begins the boot process using Compute Engine default service account before turning control over to your userid.

To verify that a service account has the appropriate permissions, follow these steps:

Console

  1. Open the IAM page in the Cloud Console.

    Open the IAM page

  2. Determine the service account used with your user-managed notebooks instance, which is one of the following:

    • A custom service account that you specified when you created your user-managed notebooks instance.

    • The Compute Engine default service account for your GCP project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your GCP project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.

  3. Verify that your service account is in the Editor role.

  4. If not, edit the service account and add it to the Editor role.

For more information, see Granting, changing, and revoking access to resources in the IAM documentation.

gcloud

  1. If you have not already, install the gcloud command-line tool.

  2. Get the name and project number for your GCP project with the following command. Replace PROJECT_ID with the project ID for your GCP project.

    gcloud projects describe PROJECT_ID
    

    You should see output similar to the following, which displays the name (name:) and project number (projectNumber:) for your project.

    createTime: '2018-10-18T21:03:31.408Z'
    lifecycleState: ACTIVE
    name: my-project-name
    parent:
     id: '396521612403'
     type: folder
    projectId: my-project-id-1234
    projectNumber: '113377992299'
    
  3. Determine the service account used with your user-managed notebooks instance, which is one of the following:

    • A custom service account that you specified when you created your user-managed notebooks instance.

    • The Compute Engine default service account for your GCP project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your GCP project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.

  4. Add the roles/editor role to the service account with the following command. Replace project-name with the name of your project, and replace service-account-id with the service account ID for your user-managed notebooks instance.

    gcloud projects add-iam-policy-binding project-name \
     --member serviceAccount:service-account-id \
     --role roles/editor
    

Creating an instance results in a "Permission denied" error

When creating a new instance, verify that the user creating the instance has the iam.serviceAccounts.ActAs permission for the defined service account.

The service account on the instance provides access to other Google Cloud services. You can use any service account within the same project, but you must have the Service Account User permission (iam.serviceAccounts.actAs) to create the instance. If not specified, the Compute Engine default service account is used.

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf2-latest-cpu \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

To grant the Service Account User permission, see Allowing a member to impersonate a single service account.

Creating a new instance results in an "already exists" error

When creating a new instance, verify that a user-managed notebooks instance with the same name was not deleted previously by Compute Engine and still exists in the Notebooks API database.

The following example shows how to list instances using the Notebooks API and verify their state.

gcloud notebooks instances list --location=LOCATION

If an instance's state is DELETED, run the following command to delete it permanently.

gcloud notebooks instances delete INSTANCE_NAME --location=LOCATION

Upgrading user-managed notebooks instances

Unable to upgrade because unable to get instance disk information

Upgrade is not supported for single-disk user-managed notebooks instances. You might want to migrate your user data to a new user-managed notebooks instance.

Unable to upgrade because instance is not UEFI compatible

Vertex AI Workbench depends on UEFI compatibility to complete an upgrade.

User-managed notebooks instances created from some older images are not UEFI compatible, and therefore cannot be upgraded.

To verify that your instance is UEFI compatible, type the following command in either Cloud Shell or any environment where the Cloud SDK is installed.

gcloud compute instances describe INSTANCE_NAME \
  --zone=ZONE | grep type

Replace the following:

  • INSTANCE_NAME: the name of your instance
  • ZONE: the zone where your instance is located

To verify that the image that you used to create your instance is UEFI compatible, use the following command:

gcloud compute images describe VM_IMAGE_FAMILY \
  --project deeplearning-platform-release | grep type

Replace VM_IMAGE_FAMILY with the image family name that you used to create your instance.

If you determine that either your instance or image is not UEFI compatible, you can attempt to migrate your user data to a new user-managed notebooks instance. To do so, complete the following steps:

  1. Verify that the image that you want to use to create your new instance is UEFI compatible. To do so, type the following command in either Cloud Shell or any environment where the Cloud SDK is installed.

    gcloud compute images describe VM_IMAGE_FAMILY \
      --project deeplearning-platform-release --format=json | grep type
    

    Replace VM_IMAGE_FAMILY with the image family name that you want to use to create your instance.

  2. Migrate your user data to a new user-managed notebooks instance.

User-managed notebooks instance is not accessible after upgrade

User-managed notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.

If the user-managed notebooks instance is not accessible after an upgrade, there may have been a failure during the replacement of the boot disk's image. Complete the following steps to attach a new valid image to the boot disk.

  1. To store values you'll use to complete this procedure, type the following command in either Cloud Shell or any environment where the Cloud SDK is installed.

    export INSTANCE_NAME=MY_INSTANCE_NAME
    export PROJECT_ID=MY_PROJECT_ID
    export ZONE=MY_ZONE
    

    Replace the following:

    • MY_INSTANCE_NAME: the name of your instance
    • MY_PROJECT_ID: your project ID
    • MY_ZONE: the zone where your instance is located
  2. Use the following command to stop the instance:

    gcloud compute instances stop $INSTANCE_NAME \
      --project=$PROJECT_ID --zone=$ZONE
    
  3. Detach the data disk from the instance.

    gcloud compute instances detach-disk $INSTANCE_NAME --device-name=data \
      --project=$PROJECT_ID --zone=$ZONE
    
  4. Delete the instance's VM.

    gcloud compute instances delete $INSTANCE_NAME --keep-disks=all --quiet \
      --project=$PROJECT_ID --zone=$ZONE
    
  5. Use the Notebooks API to delete the user-managed notebooks instance.

    gcloud notebooks instances delete $INSTANCE_NAME \
      --project=$PROJECT_ID --location=$ZONE
    
  6. Create a new user-managed notebooks instance using the same name as your previous instance.

    gcloud notebooks instances create $INSTANCE_NAME \
      --vm-image-project="deeplearning-platform-release" \
      --vm-image-family=MY_VM_IMAGE_FAMILY \
      --instance-owners=MY_INSTANCE_OWNER \
      --machine-type=MY_MACHINE_TYPE \
      --service-account=MY_SERVICE_ACCOUNT \
      --accelerator-type=MY_ACCELERATOR_TYPE \
      --accelerator-core-count=MY_ACCELERATOR_CORE_COUNT \
      --install-gpu-driver \
      --project=$PROJECT_ID \
      --location=$ZONE
    

    Replace the following:

    • MY_VM_IMAGE_FAMILY: the image family name
    • MY_INSTANCE_OWNER: your instance owner
    • MY_MACHINE_TYPE: the machine type of your instance's VM
    • MY_SERVICE_ACCOUNT: the service account to use with this instance, or use "default"
    • MY_ACCELERATOR_TYPE: the accelerator type; for example, "NVIDIA_TESLA_K80"
    • MY_ACCELERATOR_CORE_COUNT: the core count; for example, 1

Monitoring health status of user-managed notebooks instances

docker-proxy-agent status failure

Follow these steps after a docker-proxy-agent status failure:

  1. Verify that the Inverting Proxy agent is running. If not, go to step 3.

  2. Restart the Inverting Proxy agent.

  3. Re-register with the Inverting Proxy server.

docker-service status failure

Follow these steps after a docker-service status failure:

  1. Verify that the Docker service is running.

  2. Restart the Docker service.

jupyter-service status failure

Follow these steps after a jupyter-service status failure:

  1. Verify that the Jupyter service is running.

  2. Restart the Jupyter service.

jupyter-api status failure

Follow these steps after a jupyter-api status failure:

  1. Verify that the Jupyter internal API is active.

  2. Restart the Jupyter service.

Boot disk utilization percent

The boot disk space status is unhealthy if the disk space is greater than 85% full.

If your boot disk space status is unhealthy, try the following:

  1. From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -H.

  2. Use the command find . -type d -size +100M to help you find large files that you may be able to delete, but do not delete them unless you are sure you can safely do so. If you aren't sure, you can get help from support.

  3. If the previous steps do not solve your problem, get support.

Data disk utilization percent

The data disk space status is unhealthy if the disk space is greater than 85% full.

If your data disk space status is unhealthy, try the following:

  1. From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -h -T /home/jupyter.

  2. Delete large files to increase the available disk space. Use the command find . -type d -size +100M to help you find large files.

  3. If the previous steps do not solve your problem, get support.

Helpful procedures

Use SSH to connect to your user-managed notebooks instance

Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Cloud SDK is installed.

gcloud compute ssh --project PROJECT_ID \
  --zone ZONE \
  INSTANCE_NAME -- -L 8080:localhost:8080

Replace the following:

  • PROJECT_ID: Your project ID
  • ZONE: The Google Cloud zone where your instance is located
  • INSTANCE_NAME: The name of your instance

Re-register with the Inverting Proxy server

To re-register the user-managed notebooks instance with the internal Inverting Proxy server, you can stop and start the VM from the Notebook instances page or you can use ssh to connect to your user-managed notebooks instance and enter:

cd /opt/deeplearning/bin
sudo ./attempt-register-vm-on-proxy.sh

Verify the Docker service status

To verify the Docker service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker status

Verify that the Inverting Proxy agent is running

To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your user-managed notebooks instance and enter:

# Confirm Inverting Proxy agent Docker container is running (proxy-agent)
sudo docker ps

# Verify State.Status is running and State.Running is true.
sudo docker inspect proxy-agent

# Grab logs
sudo docker logs proxy-agent

Verify the Jupyter service status and collect logs

To verify the Jupyter service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter status

To collect Jupyter service logs:

sudo journalctl -u jupyter.service --no-pager

Verify that the Jupyter internal API is active

To verify that the Jupyter internal API is active you can use ssh to connect to your user-managed notebooks instance and enter:

curl http://127.0.0.1:8080/api/kernelspecs

Restart the Docker service

To restart the Docker service, you can stop and start the VM from the Notebook instances page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker restart

Restart the Inverting Proxy agent

To restart the Inverting Proxy agent, you can stop and start the VM from the Notebook instances page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo docker restart proxy-agent

Restart the Jupyter service

To restart the Jupyter service, you can stop and start the VM from the Notebook instances page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter restart

Migrate your data to a new user-managed notebooks instance

  1. Copy your user data to a Cloud Storage bucket using gsutil. The following example command copies all of the files from the default directory /home/jupyter/ to a Cloud Storage directory.

    gsutil cp -R /home/jupyter/* gs://MY_DIRECTORY/
    

    Replace the following:

    • MY_DIRECTORY: the Cloud Storage directory where you want to store your instance's data
  2. Create a new user-managed notebooks instance using the Create a new instance instructions to make sure that your new instance is registered with the Notebooks API.

  3. Restore your user data on the new instance. The following example command copies all of the files from a Cloud Storage directory to the default directory /home/jupyter/.

    gsutil cp gs://MY_DIRECTORY/* /home/jupyter/
    

Make a copy of the user data on your user-managed notebooks instance

To store a copy of your instance's user data in Cloud Storage, complete the following steps:

  1. Use ssh to connect to your user-managed notebooks instance.

  2. Copy the contents of the instance to a Cloud Storage bucket using gsutil. The following example command copies all of the notebook (.ipynb) files from the default directory /home/jupyter/ to a Cloud Storage directory named my-bucket/legacy-notebooks.

    gsutil cp -R /home/jupyter/*.ipynb gs://my-bucket/legacy-notebooks/