Troubleshooting Vertex AI Workbench

This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI Workbench.

See also Troubleshooting Vertex AI for help using other components of Vertex AI.

To filter this page's content, click a topic:

Helpful procedures

This section describes procedures that you might find helpful.

Use SSH to connect to your user-managed notebooks instance

Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute ssh --project PROJECT_ID \
  --zone ZONE \
  INSTANCE_NAME -- -L 8080:localhost:8080

Replace the following:

PROJECT_ID: Your project ID
ZONE: The Google Cloud zone where your instance is located
INSTANCE_NAME: The name of your instance

You can also connect to your instance by opening your instance's Compute Engine detail page, and then clicking the SSH button.

Re-register with the Inverting Proxy server

To re-register the user-managed notebooks instance with the internal Inverting Proxy server, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

cd /opt/deeplearning/bin
sudo ./attempt-register-vm-on-proxy.sh

Verify the Docker service status

To verify the Docker service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker status

Verify that the Inverting Proxy agent is running

To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your user-managed notebooks instance and enter:

# Confirm Inverting Proxy agent Docker container is running (proxy-agent)
sudo docker ps

# Verify State.Status is running and State.Running is true.
sudo docker inspect proxy-agent

# Grab logs
sudo docker logs proxy-agent

Verify the Jupyter service status and collect logs

To verify the Jupyter service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter status

To collect Jupyter service logs:

sudo journalctl -u jupyter.service --no-pager

Verify that the Jupyter internal API is active

The Jupyter API should always run on port 8080. You can verify this by inspecting the instance's syslogs for an entry similar to:

Jupyter Server ... running at:
http://localhost:8080

To verify that the Jupyter internal API is active you can also use ssh to connect to your user-managed notebooks instance and enter:

curl http://127.0.0.1:8080/api/kernelspecs

You can also measure the time it takes for the API to respond in case the requests are taking too long:

time curl -V http://127.0.0.1:8080/api/status
time curl -V http://127.0.0.1:8080/api/kernels
time curl -V http://127.0.0.1:8080/api/connections

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Restart the Docker service

To restart the Docker service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker restart

Restart the Inverting Proxy agent

To restart the Inverting Proxy agent, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo docker restart proxy-agent

Restart the Jupyter service

To restart the Jupyter service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter restart

Restart the Notebooks Collection Agent

The Notebooks Collection Agent service runs a Python process in the background that verifies the status of the Vertex AI Workbench instance's core services.

To restart the Notebooks Collection Agent service, you can stop and start the VM from the Google Cloud console or you can use ssh to connect to your Vertex AI Workbench instance and enter:

sudo systemctl stop notebooks-collection-agent.service

followed by:

sudo systemctl start notebooks-collection-agent.service

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Modify the Notebooks Collection Agent script

To access and edit the script open a terminal in our instance or use ssh to connect to your Vertex AI Workbench instance, and enter:

nano /opt/deeplearning/bin/notebooks_collection_agent.py

After editing the file, remember to save it.

Then, you must restart the Notebooks Collection Agent service.

Verify the instance can resolve the required DNS domains

To verify that the instance can resolve the required DNS domains, you can use ssh to connect to your user-managed notebooks instance and enter:

host notebooks.googleapis.com
host *.notebooks.cloud.google.com
host *.notebooks.googleusercontent.com
host *.kernels.googleusercontent.com

or:

curl --silent --output /dev/null "https://notebooks.cloud.google.com"; echo $?

If the instance has Dataproc enabled, you can verify that the instance resolves *.kernels.googleusercontent.com by running:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://${PROJECT_NUMBER}-dot-${REGION}.kernels.googleusercontent.com/api/kernelspecs | jq .

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Make a copy of the user data on an instance

To store a copy of an instance's user data in Cloud Storage, complete the following steps.

Create a Cloud Storage bucket (optional)

In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.

Create a Cloud Storage bucket:
```
gcloud storage buckets create gs://BUCKET_NAME
```
Replace BUCKET_NAME with a bucket name that meets the bucket naming requirements.

Copy your user data

In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For user-managed notebooks instances, you can instead connect to your instance's terminal by using SSH.
Use the gcloud CLI to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's /home/jupyter/ directory to a directory in a Cloud Storage bucket.
```
gcloud storage cp /home/jupyter/* gs://BUCKET_NAMEPATH --recursive
```
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket
- PATH: the path to the directory where you want to copy your files, for example: /copy/jupyter/

Investigate an instance stuck in provisioning by using gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

This gcpdiag runbook investigates potential causes for a Vertex AI Workbench instance to get stuck in provisioning status, including the following areas:

Status: Checks the instance's current status to ensure that it is stuck in provisioning and not stopped or active.
Instance's Compute Engine VM boot disk image: Checks whether the instance was created with a custom container, an official workbench-instances image, Deep Learning VM Images, or unsupported images that might cause the instance to get stuck in provisioning status.
Custom scripts: Checks whether the instance is using custom startup or post-startup scripts that change the default Jupyter port or break dependencies that might cause the instance to get stuck in provisioning status.
Environment version: Checks whether the instance is using the latest environment version by checking its upgradability. Earlier versions might cause the instance to get stuck in provisioning status.
Instance's Compute Engine VM performance: Checks the VM's current performance to ensure that it isn't impaired by high CPU usage, insufficient memory, or disk space issues that might disrupt normal operations.
Instance's Compute Engine serial port or system logging: Checks whether the instance has serial port logs, which are analyzed to ensure that Jupyter is running on port 127.0.0.1:8080.
Instance's Compute Engine SSH and terminal access: Checks whether the instance's Compute Engine VM is running so that the user can SSH and open a terminal to verify that space usage in 'home/jupyter' is lower than 85%. If no space is left, this might cause the instance to get stuck in provisioning status.
External IP turned off: Checks whether external IP access is turned off. An incorrect networking configuration can cause the instance to get stuck in provisioning status.

Google Cloud console

Complete and then copy the following command.

GOOGLE_AUTH_TOKEN=GOOGLE_AUTH_TOKEN \
  gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
    --parameter project_id=PROJECT_ID \
    --parameter instance_name=INSTANCE_NAME \
    --parameter zone=ZONE \
    --auto --reason=REASON

Open the Google Cloud console and activate Cloud Shell.

Open Cloud console

Paste the copied command.
Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

Copy and run the following command on your local workstation.

curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag

Execute the gcpdiag command.

./gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
    --parameter project_id=PROJECT_ID \
    --parameter instance_name=INSTANCE_NAME \
    --parameter zone=ZONE

View available parameters for this runbook.

Replace the following:

PROJECT_ID: The ID of the project containing the resource.
INSTANCE_NAME: The name of the target Vertex AI Workbench instance within your project.
ZONE: The zone in which your target Vertex AI Workbench instance is located.

Useful flags:

--universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource
--parameter or -p: Runbook parameters

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.