Troubleshooting Vertex AI Workbench

This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI Workbench.

See also Troubleshooting Vertex AI for help using other components of Vertex AI.

To filter this page's content, click a topic:

Vertex AI Workbench instances

This section describes troubleshooting steps for Vertex AI Workbench instances.

Connecting to and opening JupyterLab

This section describes troubleshooting steps for connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

Can't access the terminal in a Vertex AI Workbench instance

Issue

If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your Vertex AI Workbench instance doesn't have terminal access enabled.

Solution

You must create a new Vertex AI Workbench instance with the Terminal access option enabled. This option can't be changed after instance creation.

502 error when opening JupyterLab

Issue

A 502 error might mean that your Vertex AI Workbench instance isn't ready yet.

Solution

Wait a few minutes, refresh the Google Cloud console browser tab, and try again.

Notebook is unresponsive

Issue

Your Vertex AI Workbench instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
Reset your instance.

Unable to connect with Vertex AI Workbench instance using SSH

Issue

You're unable to connect to your instance by using SSH through a terminal window.

Vertex AI Workbench instances use OS Login to enable SSH access. When you create an instance, Vertex AI Workbench enables OS Login by default by setting the metadata key enable-oslogin to TRUE. If you're unable to use SSH to connect to your instance, this metadata key might need to be set to TRUE.

Solution

Connecting to a Vertex AI Workbench instance by using the Google Cloud console isn't supported. If you're unable to connect to your instance by using SSH through a terminal window, see the following:

To set the metadata key enable-oslogin to TRUE, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

GPU quota has been exceeded

Issue

You're unable to create a Vertex AI Workbench instance with GPUs.

Solution

Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase for Compute Engine GPUs. See Request a higher quota limit.

Creating Vertex AI Workbench instances

This section describes how to troubleshoot issues related to creating Vertex AI Workbench instances.

Instance stays in pending state indefinitely or is stuck in provisioning status

Issue

After creating a Vertex AI Workbench instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:

Could not resolve host: notebooks.googleapis.com

If your instance is stuck in provisioning status, this could be because you have an invalid private networking configuration for your instance.

Solution

Follow the steps in the Instance logs show connection or timeout errors section.

Unable to create an instance within a Shared VPC network

Issue

Attempting to create an instance within a Shared VPC network results in an error message like the following:

Required 'compute.subnetworks.use' permission for
'projects/network-administration/regions/us-central1/subnetworks/v'

Solution

The issue is that the Notebooks Service Account is attempting to create the instance without the correct permissions.

To ensure that the Notebooks Service Account has the necessary permissions to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network, ask your administrator to grant the Notebooks Service Account the Compute Network User role (roles/compute.networkUser) IAM role on the host project.

For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the permissions required to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network:

To use subnetworks: compute.subnetworks.use

Your administrator might also be able to give the Notebooks Service Account these permissions with custom roles or other predefined roles.

Can't create a Vertex AI Workbench instance with a custom container

Issue

There isn't an option to use a custom container when creating a Vertex AI Workbench instance in the Google Cloud console.

Solution

Adding a custom container to a Vertex AI Workbench instance isn't supported, and you can't add a custom container by using the Google Cloud console.

Adding a conda environment is recommended instead of using a custom container.

You can add a custom container to a Vertex AI Workbench instance by using the Notebooks API, but this capability isn't supported.

Mount shared storage button isn't there

Issue

The Mount shared storage button isn't in the File Browser tab of the JupyterLab interface.

Solution

The storage.buckets.list permission is required for the Mount shared storage button to appear in the JupyterLab interface of your Vertex AI Workbench instance. Ask your administrator to grant your Vertex AI Workbench instance's service account the storage.buckets.list permission on the project.

599 error when using Dataproc

Issue

Attempting to create a Dataproc-enabled instance results in an error message like the following:

HTTP 599: Unknown (Error from Gateway: [Timeout while connecting]
Exception while attempting to connect to Gateway server url.
Ensure gateway url is valid and the Gateway instance is running.)

Solution

In your Cloud DNS configuration, add a Cloud DNS entry for the *.googleusercontent.com domain.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in Vertex AI Workbench instances.

Unable to edit underlying virtual machine

Issue

When you try to edit the underlying virtual machine (VM) of a Vertex AI Workbench instance, you might get an error message similar to the following:

Current principal doesn't have permission to mutate this resource.

Solution

This error occurs because you can't edit the underlying VM of an instance by using the Google Cloud console or the Compute Engine API.

To edit a Vertex AI Workbench instance's underlying VM, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK

`pip` packages aren't available after adding conda environment

Issue

Your pip packages aren't available after you add a conda-based kernel.

Solution

To resolve the issue, see Add a conda environment and try the following:

Check that you used the DL_ANACONDA_ENV_HOME variable and that it contains the name of your environment.
Check that pip is located in a path similar to opt/conda/envs/ENVIRONMENT/bin/pip. You can run the which pip command to get the path.

Unable to access or copy data of an instance with single user access

Issue

The data on an instance with single user access is inaccessible.

For Vertex AI Workbench instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.

Solution

To access or copy the data when you aren't the owner of the instance, open a support case.

Unexpected shutdown

Issue

Your Vertex AI Workbench instance shuts down unexpectedly.

Solution

If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.

If you enabled idle shutdown, your instance shuts down when there is no kernel activity for the specified time period. For example, running a cell or new output printing to a notebook is activity that resets the idle timeout timer. CPU usage doesn't reset the idle timeout timer.

Instance logs show connection or timeout errors

Issue

Your Vertex AI Workbench instance's logs show connection or timeout errors.

Solution

If you notice connection or timeout errors in the instance's logs make sure that the Jupyter server is running on port 8080. Follow the steps in the Verify that the Jupyter internal API is active section.

If you have turned off External IP and you are using a private VPC network make sure you have also followed the network configuration options documentation. Consider the following:

You must enable Private Google Access on the chosen subnetwork in the same region where your instance is located in the VPC host project. For more information on configuring Private Google Access, see the Private Google Access documentation.
If you're using Cloud DNS, the instance must be able to resolve the required Cloud DNS domains specified in the network configuration options documentation. To verify this, follow the steps in the Verify the instance can resolve the required DNS domains section.

Instance logs show 'Unable to contact Jupyter API' 'ReadTimeoutError'

Issue

Your Vertex AI Workbench instance logs show an error such as:

notebooks_collection_agent. Unable to contact Jupyter API:
HTTPConnectionPool(host=\'127.0.0.1\', port=8080):
Max retries exceeded ReadTimeoutError(\"HTTPConnectionPool(host=\'127.0.0.1\', port=8080

Solution

Follow the steps in the Instance logs show connection or timeout errors section. You can also try modifying the Notebooks Collection Agent script to change HTTP_TIMEOUT_SESSION to a larger value, for example: 60, to help verify whether the request has failed due to the call taking too long to respond or the requested URL can't be reached.

`docker0` address conflicts with VPC addressing

Issue

By default, the docker0 interface is created with an IP address of 172.17.0.1/16. This might conflict with the IP addressing in your VPC network such that the instance is unable to connect to other endpoints with 172.17.0.1/16 addresses.

Solution

You can force the docker0 interface to be created with an IP address that doesn't conflict with your VPC network by using the following post-startup script and setting the post-startup script behavior to run_once.

   #!/bin/bash
   # Wait for Docker to be fully started
   while ! systemctl is-active docker; do
    sleep 1
   done
   # Stop the Docker service
   systemctl stop docker
   # Modify /etc/docker/daemon.json
   cat < /etc/docker/daemon.json
   {
    "bip": "CUSTOM_DOCKER_IP/16"
   }
   EOF
   # Restart the Docker service
   systemctl start docker

Specified reservations don't exist

Issue

The operation to create the instance results in a Specified reservations do not exist error message. The operation's output might be similar to the following:

{
  "name": "projects/PROJECT/locations/LOCATION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.notebooks.v2.OperationMetadata",
    "createTime": "2025-01-01T01:00:01.000000000Z",
    "endTime": "2025-01-01T01:00:01.000000000Z",
    "target": "projects/PROJECT/locations/LOCATION/instances/INSTANCE_NAME",
    "verb": "create",
    "requestedCancellation": false,
    "apiVersion": "v2",
    "endpoint": "CreateInstance"
  },
  "done": true,
  "error": {
    "code": 3,
    "message": "Invalid value for field 'resource.reservationAffinity': '{  \"consumeReservationType\": \"SPECIFIC_ALLOCATION\",  \"key\": \"compute.googleapis.com/reservation-name...'. Specified reservations [projects/PROJECT/zones/ZONE/futureReservations/RESERVATION_NAME] do not exist.",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.RequestInfo",
        "requestId": "REQUEST_ID"
      }
    ]
  }
}

Solution

Some Compute Engine machine types require additional parameters at creation such as local disks or a minimum CPU platform. The instance specification must include these additional fields.

For example, the a3-megagpu-8g machine type requires 16 local SSD disks, which must be included in the reservation and specified in the instance creation request.

BODY='{
  "gce_setup": {
    "machine_type": "a3-megagpu-8g",
    "reservation_affinity": {
      "consume_reservation_type": "RESERVATION_SPECIFIC",
      "key": "compute.googleapis.com/reservation-name",
      "values": ["RESERVATION_NAME"]
    },
    "bootDisk": {
        "disk_type": "PD_SSD",
        "diskSizeGb": "150",
        "diskEncryption": "GMEK"
    },
    "data_disks": [
      {
        "disk_type": "PD_SSD",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
      {
        "disk_type": "SCRATCH",
        "interface_type": "NVME",
      },
    ],
  }
}'

Managed notebooks

This section describes troubleshooting steps for managed notebooks.

Connecting to and opening JupyterLab

This section describes troubleshooting issues with connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

Unable to connect with managed notebooks instance using SSH

Issue

There isn't an option to connect with managed notebooks instances by using SSH.

Solution

SSH access to managed notebooks instances isn't available.

Can't access the terminal in a managed notebooks instance

Issue

If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your managed notebooks instance doesn't have terminal access enabled.

Solution

You must create a new managed notebooks instance with the Terminal access option enabled. This option can't be changed after instance creation.

502 error when opening JupyterLab

Issue

A 502 error might mean that your managed notebooks instance isn't ready yet.

Solution

Wait a few minutes, refresh the Google Cloud console browser tab, and try again.

Opening a notebook results in a 524 (A Timeout Occurred) error

Issue

A 524 error is usually an indication that the Inverting Proxy agent isn't connecting to the Inverting Proxy server or the requests are taking too long on the backend server side (Jupyter). Common causes of this error include networking issues, the Inverting Proxy agent isn't running, or the Jupyter service isn't running.

Solution

Verify that your managed notebooks instance is started.

Notebook is unresponsive

Issue

managed notebooks instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
Reset your instance.

Migrating to Vertex AI Workbench instances

This section describes methods for diagnosing and resolving issues with migrating from a managed notebooks instance to a Vertex AI Workbench instance.

Can't find a kernel that was in the managed notebooks instance

Issue

A kernel that was in your managed notebooks instance doesn't appear in the Vertex AI Workbench instance that you migrated to.

Custom containers appear as kernels in managed notebooks. The Vertex AI Workbench migration tool doesn't support custom container migration.

Solution

To resolve this issue, add a conda environment to your Vertex AI Workbench instance.

Different version of framework in migrated instance

Issue

A framework that was in your managed notebooks instance was a different version than the one in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances provide a default set of framework versions. The migration tool doesn't add framework versions from your original managed notebooks instance. See default migration tool behaviors.

Solution

To add a specific version of a framework, add a conda environment to your Vertex AI Workbench instance.

GPUs aren't migrated to the new Vertex AI Workbench instance

Issue

GPUs that were in your managed notebooks instance aren't in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances support a default set of GPUs. If the GPUs in your original managed notebooks instance aren't available, your instance is migrated without any GPUs.

Solution

After migration, you can add GPUs to your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

Migrated instance's machine type is different

Issue

The machine type of your managed notebooks instance is different from the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances don't support all machine types. If the machine type in your original managed notebooks instance isn't available, your instance is migrated to the e2-standard-4 machine type.

Solution

After migration, you can change the machine type of your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

GPU quota has been exceeded

Issue

You are unable to create a managed notebooks instance with GPUs.

Solution

Using container images

This section describes troubleshooting issues with using container images.

Container image doesn't appear as a kernel in JupyterLab

Issue

Container images that don't have a valid kernelspec don't successfully load as kernels in JupyterLab.

Solution

Make sure that your container meets our requirements. For more information, see the custom container requirements.

Notebook disconnects on long-running job

Issue

If you see the following error message when running a job in a notebook, it might be caused by the request taking too long to load, or the CPU or memory utilization is high, which can make the Jupyter Service unresponsive.

{"log":"2021/06/29 18:10:33 failure fetching a VM ID: compute: Received 500
`internal error`\n","stream":"stderr","time":"2021-06-29T18:10:33.383650241Z"}
{"log":"2021/06/29 18:38:26 Websocket failure: failed to read a websocket
message from the server: read tcp [::1]:40168-\u003e[::1]:8080: use of closed
network connection\n","stream":"stderr","time":"2021-06-29T18:38:26.057622824Z"}

Solution

This issue is caused by running a long-running job within a notebook. To run a job that might take a long time to complete, it's recommended to use the executor.

Using the executor

This section describes troubleshooting issues with using executor.

Package installations not available to the executor

Issue

The executor runs your notebook code in a separate environment from the kernel where you run your notebook file's code. Because of this, some of the packages you installed might not be available in the executor's environment.

Solution

To resolve this issue, see Ensure package installations are available to the executor.

401 or 403 errors when running the notebook code using the executor

Issue

A 401 or 403 error when you run the executor can mean that the executor isn't able to access resources.

Solution

See the following for possible causes:

The executor runs your notebook code in a tenant project separate from your managed notebooks instance's project. Therefore, when you access resources through code run by the executor, the executor might not connect to the correct Google Cloud project by default. To resolve this issue, use explicit project selection.
By default, your managed notebooks instance can have access to resources that exist in the same project, and therefore, when you run your notebook file's code manually, these resources don't need additional authentication. However, because the executor runs in a separate tenant project, it does not have the same default access. To resolve this issue, authenticate access using service accounts.
The executor can't use end-user credentials to authenticate access to resources, for example, the gcloud auth login command. To resolve this issue, authenticate access using service accounts.

`exited with a non-zero status of 127` error when using the executor

Issue

An exited with a non-zero status of 127 error, or "command not found" error, can happen when you use the executor to run code on a custom container that doesn't have the nbexecutor extension installed.

Solution

To ensure that your custom container has the nbexecutor extension, you can create a derivative container image from a Deep Learning Containers image. Deep Learning Containers images include the nbexecutor extension.

Invalid service networking configuration error message

Issue

This error might look like the following:

Invalid Service Networking configuration. Couldn't find free blocks in allocated IP ranges.
Please use a valid range using: /24 mask or below (/23,/22, etc).

This means that no free blocks were found in the allocated IP ranges of your network.

Solution

Use a subnet mask of /24 or lower. Create a bigger allocated IP address range and attach this range by modifying the private service connection for servicenetworking-googleapis-com.

For more information, see Set up a network.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in managed notebooks instances.

Unable to access or copy data of an instance with single user access

Issue

The data on an instance with single user access is inaccessible.

Solution

For managed notebooks instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.

To access or copy the data when you aren't the owner of the instance, open a support case.

Unexpected shutdown

Issue

Your Vertex AI Workbench instance shuts down unexpectedly.

Solution

If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.

Restore instance

Issue

Restoring a managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub.

Recover data from an instance

Issue

Recovering data from a managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub.

Creating managed notebooks instances

This section describes troubleshooting issues with creating managed notebooks instances.

Error: Problem while creating a connection

Issue

You encounter this error while creating an instance:

We encountered a problem while creating a connection.

Service 'servicenetworking.googleapis.com' requires at least
one allocated range to have minimal size; please make sure
at least one allocated range will have prefix length at most '24'.

Solution

Create an allocated IP range bigger than /24 and attach this range by modifying the private service connection for the servicenetworking-googleapis-com connection.

Creating an instance results in a resource availability error

Issue

You're unable to create an instance because of a resource availability error.

This error can look like the following:

Creating notebook INSTANCE_NAME: ZONE does not have
enough resources available to fulfill the request.
Retry later or try another zone in your configurations.

Resource errors occur when you request new resources in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to new resource requests in the zone and don't affect existing resources. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, try the following:

Create an instance with a different machine type.
Create the instance in a different zone.
Attempt the request again later.
Reduce the amount of resources that you're requesting. For example, try to create an instance with less GPUs, disks, vCPUs, or memory.

Starting an instance results in a resource availability error

Issue

You're unable to start an instance because of a resource availability error.

This error can look like the following:

The zone ZONE_NAME doesn't have enough resources available to fulfill
the request. '(resource type:compute)'.

Resource errors occur when you try to start an instance in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to the resources you specified in your request at the time you sent the request, not to all resources in the zone. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, try the following:

Change the machine type of your instance.
Migrate your files and data to an instance in a different zone.
Attempt the request again later.
Reduce the amount of resources that you're requesting. For example, start a different instance with less GPUs, disks, vCPUs, or memory.

`No route to host` on outbound connections from managed notebooks

Issue

Typically, the only routes you can see in the Google Cloud console are those known to your own VPC as well as the ranges reserved when you complete the VPC Network Peering configuration.

Managed notebooks instances reside in a Google-managed network and run a modified version of Jupyter in a Docker networking namespace within the instance.

The Docker network interface and Linux bridge on this instance may select a local IP that conflicts with IP ranges being exported over the peering by your VPC. These are typically in the 172.16.0.0/161 and 192.168.10.0/24 ranges, respectively.

In these circumstances, outbound connections from the instance to these ranges will fail with a complaint that is some variation of No route to host despite VPC routes being correctly shared.

Solution

Invoke ifconfig in a terminal session and ensure that no IP addresses on any virtual interfaces in the instance conflict with IP ranges that your VPC is exporting to the peering connection.

User-managed notebooks

This section describes troubleshooting steps for user-managed notebooks.

Connecting to and opening JupyterLab

This section describes troubleshooting issues with connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

No Inverting Proxy server access to JupyterLab

Issue

You are unable to access JupyterLab.

Vertex AI Workbench uses a Google internal Inverting Proxy server to provide access to JupyterLab. User-managed notebooks instance settings, network configuration, and other factors can prevent access to JupyterLab.

Solution

Use SSH to connect to JupyterLab and learn more about why you might not have access through the Inverting Proxy.

Unable to connect with user-managed notebooks instance using SSH

Issue

You're unable to connect to your instance by using SSH through a terminal window.

User-managed notebooks instances use OS Login to enable SSH access. When you create an instance, Vertex AI Workbench enables OS Login by default by setting the metadata key enable-oslogin to TRUE. If you're unable to use SSH to connect to your instance, this metadata key might need to be set to TRUE.

Solution

To enable SSH access for user-managed notebooks for users, complete the steps for configuring OS Login roles on user accounts.

Opening a user-managed notebooks instance results in a 403 (Forbidden) error

Issue

A 403 (Forbidden) error when opening a user-managed notebooks instance often means that there is an access issue.

Solution

To troubleshoot access issues, consider the three ways that access can be granted to a user-managed notebooks instance:

Single user
Service account
Project editors

The access mode is configured during user-managed notebooks instance creation and it is defined in the notebook metadata:

Single user: proxy-mode=mail, proxy-user-mail=user@domain.com
Service account: proxy-mode=service_account
Project editors: proxy-mode=project_editors

If you can't access a notebook when you click Open JupyterLab, try the following:

Verify that the proxy-mode metadata entry is correct.
Verify that the user accessing the instance has the iam.serviceAccounts.ActAs permission for the instance's service account. The service account is either the Compute Engine default service account or a service account that is specified when the instance is created.
If your instance uses single user access with a specified service account as the single user, see No JupyterLab access, single user mode enabled.

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf-latest-cpu \
  --metadata=proxy-mode=mail,proxy-user-mail=user@domain.com \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

When you click Open JupyterLab to open a notebook, the notebook opens in a new browser tab. If you are signed in to more than one Google Account, the new tab opens with your default Google Account. If you didn't create your user-managed notebooks instance with your default Google Account, the new browser tab will show a 403 (Forbidden) error.

No JupyterLab access, single user mode enabled

Issue

You are unable to access JupyterLab.

Solution

If a user is unable to access JupyterLab and the instance's access to JupyterLab is set to Single user only, try the following:

On the User-managed notebooks page of the Google Cloud console, click the name of your instance to open the Notebook details page.
Next to View VM details, click View in Compute Engine.
On the VM details page, click Edit.
In the Metadata section, verify that the proxy-mode metadata entry is set to mail.
Verify that the proxy-user-mail metadata entry is set to a valid user email address, not a service account.
Click Save.
On the User-managed notebooks page of the Google Cloud console, initialize the updated metadata by stopping your instance and starting the instance back up again.

Opening a notebook results in a 504 (Gateway Timeout) error

Issue

This is an indication of an internal proxy timeout or a backend server (Jupyter) timeout. This can be seen when:

The request never reached the internal Inverting Proxy server
Backend (Jupyter) returns a 504 error.

Solution

Open a Google support case.

Opening a notebook results in a 524 (A Timeout Occurred) error

Issue

The internal Inverting Proxy server hasn't received a response from the Inverting Proxy agent for the request within the timeout period. Inverting Proxy agent runs inside your user-managed notebooks instance as a Docker container. A 524 error is usually an indication that the Inverting Proxy agent isn't connecting to the Inverting Proxy server or the requests are taking too long on the backend server side (Jupyter). A typical case for this error is on the user side (for example, a networking issue, or the Inverting Proxy agent service isn't running).

Solution

If you can't access a notebook, verify that your user-managed notebooks instance is started and try the following:

Option 1: Run the diagnostic tool to automatically check and repair user-managed notebooks core services, verify available storage, and generate useful log files. To run the tool in your instance, perform the following steps:

Make sure that your instance is on version M58 or newer.
Connect to your Deep Learning VM Images instance using SSH.
Run the following command:
```
sudo /opt/deeplearning/bin/diagnostic_tool.sh [--repair] [--bucket=$BUCKET]
```
Note that the --repair flag and --bucket flags are optional. The --repair flag will attempt to fix common core service errors, and the --bucket flag will let you specify a Cloud Storage bucket to store the created log files.

The output of this command will display useful status messages for user-managed notebooks core services and will export log files of its findings.

Option 2: Use the following steps to check specific user-managed notebooks requirements individually.

Verify that the user-managed notebooks instance disk isn't out of space.
1. Connect to your Deep Learning VM Images instance using SSH.
2. Run the following command:
```
df -h -T /home/jupyter
```
  If the Use% is more than 85%, you need to manually delete files from /home/jupyter. As a first step, you can empty the trash with the following command:
```
sudo rm -rf  /home/jupyter/.local/share/Trash/*
```
Verify that the Docker service is started.
Verify that the Inverting Proxy agent is running. If the agent is started, try restarting it.
Make sure the Jupyter service is running. If it is, try restarting it.
Verify memory utilization in the user-managed notebooks instance.
1. Connect to your Deep Learning VM instance using SSH.
2. Run the following command:
```
free -t -h
```
  If the used memory is more than 85% of the total, you should consider changing the machine type.
3. You can install Cloud Monitoring agent to monitor if there is high memory usage in your user-managed notebooks instance. See pricing information.
Verify that you are using Deep Learning VM version M55 or later. To learn more about upgrading, see Upgrade the environment of a user-managed notebooks instance.

Opening a notebook results in a 598 (Network read timeout) error

Issue

The Inverting Proxy server hasn't heard from the Inverting Proxy agent at all for more than 10 minutes. This is a strong indication of an Inverting Proxy agent issue.

Solution

If you can't access a notebook, try the following:

Verify that your user-managed notebooks instance is started.
Verify that the Docker service is started.
Verify that the Inverting Proxy agent is running. If the agent is started, try restarting it.
Make sure the Jupyter service is running. If it is, try restarting it.
Verify that you are using Deep Learning VM version M55 or later. To learn more about upgrading, see Upgrade the environment of a user-managed notebooks instance.

Notebook is unresponsive

Issue

Your user-managed notebooks instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

Refresh the JupyterLab browser page. Any unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
From a terminal session in the notebook, run the command top to see if there are processes consuming the CPU.
From the terminal, check the amount of free disk space using the command df, or check the available RAM using the command free.
Shut your instance down by selecting it from the User-managed notebooks page and clicking Stop. After it has stopped completely, select it and click Start.

Migrating to Vertex AI Workbench instances

This section describes methods for diagnosing and resolving issues with migrating from a user-managed notebooks instance to a Vertex AI Workbench instance.

Can't find R, Beam, or other kernels that were in the user-managed notebooks instance

Issue

A kernel that was in your user-managed notebooks instance doesn't appear in the Vertex AI Workbench instance that you migrated to.

Some kernels, such as the R and Beam kernels, aren't available in Vertex AI Workbench instances by default. Migration of those kernels isn't supported.

Solution

To resolve this issue, add a conda environment to your Vertex AI Workbench instance.

Can't set up a Dataproc Hub instance in the Vertex AI Workbench instance

Issue

Dataproc Hub isn't supported in Vertex AI Workbench instances.

Solution

Continue to use Dataproc Hub in user-managed notebooks instances.

Different version of framework in migrated instance

Issue

A framework that was in your user-managed notebooks instance was a different version than the one in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances provide a default set of framework versions. The migration tool doesn't add framework versions from your original user-managed notebooks instance. See default migration tool behaviors.

Solution

To add a specific version of a framework, add a conda environment to your Vertex AI Workbench instance.

GPUs aren't migrated to the new Vertex AI Workbench instance

Issue

GPUs that were in your user-managed notebooks instance aren't in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances support a default set of GPUs. If the GPUs in your original user-managed notebooks instance aren't available, your instance is migrated without any GPUs.

Solution

Migrated instance's machine type is different

Issue

The machine type of your user-managed notebooks instance is different from the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances don't support all machine types. If the machine type in your original user-managed notebooks instance isn't available, your instance is migrated to the e2-standard-4 machine type.

Solution

Working with files

This section describes troubleshooting issues with files for user-managed notebooks instances.

File downloading disabled but user can still download files

Issue

For Dataproc Hub user-managed notebooks instances, disabling file downloading from the JupyterLab user interface isn't supported. User-managed notebooks instances that use the Dataproc Hub framework permit file downloading even if you don't select Enable file downloading from JupyterLab UI when you create the instance.

Solution

Dataproc Hub user-managed notebooks instances don't support restricting file downloads.

Downloaded files are truncated or don't complete downloading

Issue

When you download files from your user-managed notebooks instance, a timeout setting on the proxy-forwarding agent limits the connection time for the download to complete. If the download takes too long, this can truncate your downloaded file or prevent it from being downloaded.

Solution

To download the file, copy your file to Cloud Storage, and then download the file from Cloud Storage.

Consider migrating your files and data to a new user-managed notebooks instance.

After restarting VM, local files can't be referenced from notebook terminal

Issue

Sometimes after restarting a user-managed notebooks instance, local files can't be referenced from within a notebook terminal.

Solution

This is a known issue. To reference your local files from within a notebook terminal, first re-establish your current working directory using the following command:

cd PWD

In this command, replace PWD with your current working directory. For example, if your current working directory was /home/jupyter/, use the command cd /home/jupyter/.

After re-establishing your current working directory, your local files can be referenced from within the notebook terminal.

Creating user-managed notebooks instances

This section describes troubleshooting issues with creating user-managed notebooks instances.

GPU quota has been exceeded

Issue

You are unable to create a user-managed notebooks instance with GPUs.

Solution

Instance stays in pending state indefinitely

Issue

After creating a user-managed notebooks instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:

Could not resolve host: notebooks.googleapis.com

Solution

Your instance can't connect to the Notebooks API server due to a Cloud DNS configuration or other network issue. To resolve the issue, check your Cloud DNS and network configurations. For more information, see the network configuration options section.

New user-managed notebooks instance isn't created (insufficient permissions)

Issue

It usually takes about a minute to create a user-managed notebooks instance. If your new user-managed notebooks instance remains in the pending state indefinitely, it might be because the service account used to start the user-managed notebooks instance doesn't have the required permissions in your Google Cloud project.

You can start a user-managed notebooks instance with a custom service account that you create or in single-user mode with a user ID. If you start a user-managed notebooks instance in single-user mode, then your user-managed notebooks instance begins the boot process using the Compute Engine default service account before turning control over to your user ID.

Solution

To verify that a service account has the appropriate permissions, follow these steps:

Console

Open the IAM page in the Google Cloud console.
Open the IAM page
Determine the service account used with your user-managed notebooks instance, which is one of the following:
- A custom service account that you specified when you created your user-managed notebooks instance.
- The Compute Engine default service account for your Google Cloud project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your Google Cloud project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.
Verify that your service account has the Notebooks Runner (roles/notebooks.runner) role. If not, grant the service account the Notebooks Runner (roles/notebooks.runner) role.

For more information, see Granting, changing, and revoking access to resources in the IAM documentation.

gcloud

If you haven't already, install the Google Cloud CLI.
Get the name and project number for your Google Cloud project with the following command. Replace PROJECT_ID with the project ID for your Google Cloud project.
```
gcloud projects describe PROJECT_ID
```
You should see output similar to the following, which displays the name (name:) and project number (projectNumber:) for your project.
```
createTime: '2018-10-18T21:03:31.408Z'
lifecycleState: ACTIVE
name: my-project-name
parent:
 id: '396521612403'
 type: folder
projectId: my-project-id-1234
projectNumber: '113377992299'
```
Determine the service account used with your user-managed notebooks instance, which is one of the following:
- A custom service account that you specified when you created your user-managed notebooks instance.
- The Compute Engine default service account for your Google Cloud project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your Google Cloud project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.
Add the roles/notebooks.runner role to the service account with the following command. Replace project-name with the name of your project, and replace service-account-id with the service account ID for your user-managed notebooks instance.
```
gcloud projects add-iam-policy-binding project-name \
 --member serviceAccount:service-account-id \
 --role roles/notebooks.runner
```

Creating an instance results in a `Permission denied` error

Issue

The service account on the instance provides access to other Google Cloud services. You can use any service account within the same project, but you must have the Service Account User permission (iam.serviceAccounts.actAs) to create the instance. If not specified, the Compute Engine default service account is used.

Solution

When creating an instance, verify that the user creating the instance has the iam.serviceAccounts.ActAs permission for the defined service account.

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf-latest-cpu \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

To grant the Service Account User role, see Manage access to service accounts.

Creating an instance results in an `already exists` error

Issue

When creating an instance, verify that a user-managed notebooks instance with the same name wasn't deleted previously by Compute Engine and still exists in the Notebooks API database.

Solution

The following example shows how to list instances using the Notebooks API and verify their state.

gcloud notebooks instances list --location=LOCATION

If an instance's state is DELETED, run the following command to delete it permanently.

gcloud notebooks instances delete INSTANCE_NAME --location=LOCATION

Unable to create an instance in a Shared VPC

Issue

You're unable to create an instance in a Shared VPC.

Solution

If you are using Shared VPC, you must add the host and the service projects to the service perimeter. In the host project, you must also grant the Compute Network User role (roles/compute.networkUser) to the Notebooks Service Agent from the service project. For more information, see Managing service perimeters.

Creating an instance results in a resource availability error

Issue

You're unable to create an instance because of a resource availability error.

This error can look like the following:

Creating notebook INSTANCE_NAME: ZONE does not have enough
resources available to fulfill the request. Retry later or try another zone in
your configurations.

Resource errors occur when you request new resources in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Solution

To proceed, you can try the following:

Create an instance with a different machine type.
Create the instance in a different zone.
Attempt the request again later.
Reduce the amount of resources that you're requesting. For example, try to create an instance with less GPUs, disks, vCPUs, or memory.

Starting an instance results in a resource availability error

Issue

You're unable to start an instance because of a resource availability error.

This error can look like the following:

The zone ZONE_NAME doesn't have enough resources available to fulfill
the request. '(resource type:compute)'.

Resource errors occur when you try to start an instance in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Solution

To proceed, you can try the following:

Change the machine type of your instance.
Migrate your files and data to an instance in a different zone.
Attempt the request again later.
Reduce the amount of resources that you're requesting. For example, start a different instance with less GPUs, disks, vCPUs, or memory.

Upgrading user-managed notebooks instances

This section describes troubleshooting issues with upgrading user-managed notebooks instances.

Unable to upgrade because unable to get instance disk information

Issue

Upgrade isn't supported for single-disk user-managed notebooks instances.

Solution

You might want to migrate your user data to a new user-managed notebooks instance.

Unable to upgrade because instance isn't UEFI compatible

Issue

Vertex AI Workbench depends on UEFI compatibility to complete an upgrade.

User-managed notebooks instances created from some older images are not UEFI compatible, and therefore can't be upgraded.

Solution

To verify that your instance is UEFI compatible, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute instances describe INSTANCE_NAME \
  --zone=ZONE | grep type

Replace the following:

INSTANCE_NAME: the name of your instance
ZONE: the zone where your instance is located

To verify that the image that you used to create your instance is UEFI compatible, use the following command:

gcloud compute images describe VM_IMAGE_FAMILY \
  --project deeplearning-platform-release | grep type

Replace VM_IMAGE_FAMILY with the image family name that you used to create your instance.

If you determine that either your instance or image isn't UEFI compatible, you can attempt to migrate your user data to a new user-managed notebooks instance. To do so, complete the following steps:

Verify that the image that you want to use to create your new instance is UEFI compatible. To do so, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.
```
gcloud compute images describe VM_IMAGE_FAMILY \
  --project deeplearning-platform-release --format=json | grep type
```
Replace VM_IMAGE_FAMILY with the image family name that you want to use to create your instance.
Migrate your user data to a new user-managed notebooks instance.

User-managed notebooks instance isn't accessible after upgrade

Issue

If the user-managed notebooks instance isn't accessible after an upgrade, there might have been a failure during the replacement of the boot disk image.

User-managed notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.

Solution

Complete the following steps to attach a new valid image to the boot disk.

To store values you'll use to complete this procedure, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.
```
export INSTANCE_NAME=MY_INSTANCE_NAME
export PROJECT_ID=MY_PROJECT_ID
export ZONE=MY_ZONE
```
Replace the following:
- MY_INSTANCE_NAME: the name of your instance
- MY_PROJECT_ID: your project ID
- MY_ZONE: the zone where your instance is located

Use the following command to stop the instance:

gcloud compute instances stop $INSTANCE_NAME \
  --project=$PROJECT_ID --zone=$ZONE

Detach the data disk from the instance.

gcloud compute instances detach-disk $INSTANCE_NAME --device-name=data \
  --project=$PROJECT_ID --zone=$ZONE

Delete the instance's VM.

gcloud compute instances delete $INSTANCE_NAME --keep-disks=all --quiet \
  --project=$PROJECT_ID --zone=$ZONE

Use the Notebooks API to delete the user-managed notebooks instance.

gcloud notebooks instances delete $INSTANCE_NAME \
  --project=$PROJECT_ID --location=$ZONE

Create a user-managed notebooks instance using the same name as your previous instance.

gcloud notebooks instances create $INSTANCE_NAME \
  --vm-image-project="deeplearning-platform-release" \
  --vm-image-family=MY_VM_IMAGE_FAMILY \
  --instance-owners=MY_INSTANCE_OWNER \
  --machine-type=MY_MACHINE_TYPE \
  --service-account=MY_SERVICE_ACCOUNT \
  --accelerator-type=MY_ACCELERATOR_TYPE \
  --accelerator-core-count=MY_ACCELERATOR_CORE_COUNT \
  --install-gpu-driver \
  --project=$PROJECT_ID \
  --location=$ZONE

Replace the following:

MY_VM_IMAGE_FAMILY: the image family name
MY_INSTANCE_OWNER: your instance owner
MY_MACHINE_TYPE: the machine type of your instance's VM
MY_SERVICE_ACCOUNT: the service account to use with this instance, or use "default"
MY_ACCELERATOR_TYPE: the accelerator type; for example, "NVIDIA_TESLA_T4"
MY_ACCELERATOR_CORE_COUNT: the core count; for example, 1

Monitoring health status of user-managed notebooks instances

This section describes how to troubleshoot issues with monitoring health status errors.

`docker-proxy-agent` status failure

Follow these steps after a docker-proxy-agent status failure:

Verify that the Inverting Proxy agent is running. If not, go to step 3.
Restart the Inverting Proxy agent.
Re-register with the Inverting Proxy server.

`docker-service` status failure

Follow these steps after a docker-service status failure:

Verify that the Docker service is running.
Restart the Docker service.

`jupyter-service` status failure

Follow these steps after a jupyter-service status failure:

Verify that the Jupyter service is running.
Restart the Jupyter service.

`jupyter-api` status failure

Follow these steps after a jupyter-api status failure:

Verify that the Jupyter internal API is active.
Restart the Jupyter service.

Boot disk utilization percent

The boot disk space status is unhealthy if the disk space is greater than 85% full.

If your boot disk space status is unhealthy, try the following:

From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -H.
Use the command find . -type d -size +100M to help you find large files that you might be able to delete, but don't delete them unless you are sure you can safely do so. If you aren't sure, you can get help from support.
If the previous steps don't solve your problem, get support.

Data disk utilization percent

The data disk space status is unhealthy if the disk space is greater than 85% full.

If your data disk space status is unhealthy, try the following:

From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -h -T /home/jupyter.
Delete large files to increase the available disk space. Use the command find . -type d -size +100M to help you find large files.
If the previous steps don't solve your problem, get support.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in user-managed notebooks instances.

Restore instance

Issue

Restoring a user-managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub or make a snapshot of the disk.

Recover data from an instance

Issue

Recovering data from a user-managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub or make a snapshot of the disk

Unable to increase shared memory

Issue

You can't increase shared memory on an existing user-managed notebooks instance.

Solution

However, you can specify a shared memory size when you create a user-managed notebooks instance by using the container-custom-params metadata key, with a value like the following:

--shm-size=SHARED_MEMORY_SIZE gb

Replace SHARED_MEMORY_SIZE with the size that you want in GB.

Helpful procedures

This section describes procedures that you might find helpful.

Use SSH to connect to your user-managed notebooks instance

Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute ssh --project PROJECT_ID \
  --zone ZONE \
  INSTANCE_NAME -- -L 8080:localhost:8080

Replace the following:

PROJECT_ID: Your project ID
ZONE: The Google Cloud zone where your instance is located
INSTANCE_NAME: The name of your instance

You can also connect to your instance by opening your instance's Compute Engine detail page, and then clicking the SSH button.

Re-register with the Inverting Proxy server

To re-register the user-managed notebooks instance with the internal Inverting Proxy server, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

cd /opt/deeplearning/bin
sudo ./attempt-register-vm-on-proxy.sh

Verify the Docker service status

To verify the Docker service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker status

Verify that the Inverting Proxy agent is running

To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your user-managed notebooks instance and enter:

# Confirm Inverting Proxy agent Docker container is running (proxy-agent)
sudo docker ps

# Verify State.Status is running and State.Running is true.
sudo docker inspect proxy-agent

# Grab logs
sudo docker logs proxy-agent

Verify the Jupyter service status and collect logs

To verify the Jupyter service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter status

To collect Jupyter service logs:

sudo journalctl -u jupyter.service --no-pager

Verify that the Jupyter internal API is active

The Jupyter API should always run on port 8080. You can verify this by inspecting the instance's syslogs for an entry similar to:

Jupyter Server ... running at:
http://localhost:8080

To verify that the Jupyter internal API is active you can also use ssh to connect to your user-managed notebooks instance and enter:

curl http://127.0.0.1:8080/api/kernelspecs

You can also measure the time it takes for the API to respond in case the requests are taking too long:

time curl -V http://127.0.0.1:8080/api/status
time curl -V http://127.0.0.1:8080/api/kernels
time curl -V http://127.0.0.1:8080/api/connections

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Restart the Docker service

To restart the Docker service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker restart

Restart the Inverting Proxy agent

To restart the Inverting Proxy agent, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo docker restart proxy-agent

Restart the Jupyter service

To restart the Jupyter service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter restart

Restart the Notebooks Collection Agent

The Notebooks Collection Agent service runs a Python process in the background that verifies the status of the Vertex AI Workbench instance's core services.

To restart the Notebooks Collection Agent service, you can stop and start the VM from the Google Cloud console or you can use ssh to connect to your Vertex AI Workbench instance and enter:

sudo systemctl stop notebooks-collection-agent.service

followed by:

sudo systemctl start notebooks-collection-agent.service

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Modify the Notebooks Collection Agent script

To access and edit the script open a terminal in our instance or use ssh to connect to your Vertex AI Workbench instance, and enter:

nano /opt/deeplearning/bin/notebooks_collection_agent.py

After editing the file, remember to save it.

Then, you must restart the Notebooks Collection Agent service.

Verify the instance can resolve the required DNS domains

To verify that the instance can resolve the required DNS domains, you can use ssh to connect to your user-managed notebooks instance and enter:

host notebooks.googleapis.com
host *.notebooks.cloud.google.com
host *.notebooks.googleusercontent.com
host *.kernels.googleusercontent.com

or:

curl --silent --output /dev/null "https://notebooks.cloud.google.com"; echo $?

If the instance has Dataproc enabled, you can verify that the instance resolves *.kernels.googleusercontent.com by running:

curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://${PROJECT_NUMBER}-dot-${REGION}.kernels.googleusercontent.com/api/kernelspecs | jq .

To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.

Make a copy of the user data on an instance

To store a copy of an instance's user data in Cloud Storage, complete the following steps.

Create a Cloud Storage bucket (optional)

In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.

Create a Cloud Storage bucket:
```
gcloud storage buckets create gs://BUCKET_NAME
```
Replace BUCKET_NAME with a bucket name that meets the bucket naming requirements.

Copy your user data

In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For user-managed notebooks instances, you can instead connect to your instance's terminal by using SSH.
Use the gcloud CLI to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's /home/jupyter/ directory to a directory in a Cloud Storage bucket.
```
gcloud storage cp /home/jupyter/* gs://BUCKET_NAMEPATH --recursive
```
Replace the following:
- BUCKET_NAME: the name of your Cloud Storage bucket
- PATH: the path to the directory where you want to copy your files, for example: /copy/jupyter/

Investigate an instance stuck in provisioning by using gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

This gcpdiag runbook investigates potential causes for a Vertex AI Workbench instance to get stuck in provisioning status, including the following areas:

Status: Checks the instance's current status to ensure that it is stuck in provisioning and not stopped or active.
Instance's Compute Engine VM boot disk image: Checks whether the instance was created with a custom container, an official workbench-instances image, Deep Learning VM Images, or unsupported images that might cause the instance to get stuck in provisioning status.
Custom scripts: Checks whether the instance is using custom startup or post-startup scripts that change the default Jupyter port or break dependencies that might cause the instance to get stuck in provisioning status.
Environment version: Checks whether the instance is using the latest environment version by checking its upgradability. Earlier versions might cause the instance to get stuck in provisioning status.
Instance's Compute Engine VM performance: Checks the VM's current performance to ensure that it isn't impaired by high CPU usage, insufficient memory, or disk space issues that might disrupt normal operations.
Instance's Compute Engine serial port or system logging: Checks whether the instance has serial port logs, which are analyzed to ensure that Jupyter is running on port 127.0.0.1:8080.
Instance's Compute Engine SSH and terminal access: Checks whether the instance's Compute Engine VM is running so that the user can SSH and open a terminal to verify that space usage in 'home/jupyter' is lower than 85%. If no space is left, this might cause the instance to get stuck in provisioning status.
External IP turned off: Checks whether external IP access is turned off. An incorrect networking configuration can cause the instance to get stuck in provisioning status.

Google Cloud console

Complete and then copy the following command.

gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
    --parameter project_id=PROJECT_ID \
    --parameter instance_name=INSTANCE_NAME \
    --parameter zone=ZONE

Open the Google Cloud console and activate Cloud Shell.

Open Cloud console

Paste the copied command.
Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

Copy and run the following command on your local workstation.

curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag

Execute the gcpdiag command.

./gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
    --parameter project_id=PROJECT_ID \
    --parameter instance_name=INSTANCE_NAME \
    --parameter zone=ZONE

View available parameters for this runbook.

Replace the following:

PROJECT_ID: The ID of the project containing the resource.
INSTANCE_NAME: The name of the target Vertex AI Workbench instance within your project.
ZONE: The zone in which your target Vertex AI Workbench instance is located.

Useful flags:

--universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource
--parameter or -p: Runbook parameters

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Permissions errors when using service account roles with Vertex AI

Issue

You get general permissions errors when you use service account roles with Vertex AI.

These errors can appear in Cloud Logging in either the product component logs or audit logs. They may also appear in any combination of the affected projects.

These issues can be caused by one or both of the following:

Use of the Service Account Token Creator role when the Service Account User role should have been used, or the other way around. These roles grant different permissions on a service account and aren't interchangeable. To learn about the differences between the Service Account Token Creator and Service Account User roles, see Service account roles.
You've granted a service account permissions across multiple projects, which isn't permitted by default.

Solution

To resolve the issue, try one or more of the following:

Determine whether the Service Account Token Creator or Service Account User role is needed. To learn more, read the IAM documentation for the Vertex AI services you are using, as well as any other product integrations that you are using.
If you have granted a service account permissions across multiple projects, enable service accounts to be attached across projects by ensuring that iam.disableCrossProjectServiceAccountUsage. isn't enforced. To ensure that iam.disableCrossProjectServiceAccountUsage isn't enforced, run the following command:
```
gcloud resource-manager org-policies disable-enforce \
  iam.disableCrossProjectServiceAccountUsage \
  --project=PROJECT_ID
```

Troubleshooting Vertex AI Workbench Stay organized with collections Save and categorize content based on your preferences.

Vertex AI Workbench instances

Connecting to and opening JupyterLab

Nothing happens after clicking Open JupyterLab

Can't access the terminal in a Vertex AI Workbench instance

502 error when opening JupyterLab

Notebook is unresponsive

Unable to connect with Vertex AI Workbench instance using SSH

GPU quota has been exceeded

Creating Vertex AI Workbench instances

Instance stays in pending state indefinitely or is stuck in provisioning status

Unable to create an instance within a Shared VPC network

Required permissions

Can't create a Vertex AI Workbench instance with a custom container

Mount shared storage button isn't there

599 error when using Dataproc

Unable to install third-party JupyterLab extension

Unable to edit underlying virtual machine

pip packages aren't available after adding conda environment

Unable to access or copy data of an instance with single user access

Unexpected shutdown

Instance logs show connection or timeout errors

Instance logs show 'Unable to contact Jupyter API' 'ReadTimeoutError'

docker0 address conflicts with VPC addressing

Specified reservations don't exist

Managed notebooks

Connecting to and opening JupyterLab

Nothing happens after clicking Open JupyterLab

Unable to connect with managed notebooks instance using SSH

Can't access the terminal in a managed notebooks instance

502 error when opening JupyterLab

Opening a notebook results in a 524 (A Timeout Occurred) error

Notebook is unresponsive

Migrating to Vertex AI Workbench instances

Can't find a kernel that was in the managed notebooks instance

Different version of framework in migrated instance

GPUs aren't migrated to the new Vertex AI Workbench instance

Migrated instance's machine type is different

GPU quota has been exceeded

Using container images

Container image doesn't appear as a kernel in JupyterLab

Notebook disconnects on long-running job

Using the executor

Package installations not available to the executor

401 or 403 errors when running the notebook code using the executor

exited with a non-zero status of 127 error when using the executor

Invalid service networking configuration error message

Unable to install third-party JupyterLab extension

Unable to access or copy data of an instance with single user access

Unexpected shutdown

Restore instance

Recover data from an instance

Creating managed notebooks instances

Error: Problem while creating a connection

Creating an instance results in a resource availability error

Starting an instance results in a resource availability error

No route to host on outbound connections from managed notebooks

User-managed notebooks

Connecting to and opening JupyterLab

Nothing happens after clicking Open JupyterLab

No Inverting Proxy server access to JupyterLab

Unable to connect with user-managed notebooks instance using SSH

Opening a user-managed notebooks instance results in a 403 (Forbidden) error

No JupyterLab access, single user mode enabled

Opening a notebook results in a 504 (Gateway Timeout) error

Opening a notebook results in a 524 (A Timeout Occurred) error

Opening a notebook results in a 598 (Network read timeout) error

Notebook is unresponsive

Migrating to Vertex AI Workbench instances

Can't find R, Beam, or other kernels that were in the user-managed notebooks instance

Can't set up a Dataproc Hub instance in the Vertex AI Workbench instance

Different version of framework in migrated instance

GPUs aren't migrated to the new Vertex AI Workbench instance

Migrated instance's machine type is different

Working with files

File downloading disabled but user can still download files

Downloaded files are truncated or don't complete downloading

After restarting VM, local files can't be referenced from notebook terminal

Creating user-managed notebooks instances

GPU quota has been exceeded

Troubleshooting Vertex AI Workbench

`pip` packages aren't available after adding conda environment

`docker0` address conflicts with VPC addressing

`exited with a non-zero status of 127` error when using the executor

`No route to host` on outbound connections from managed notebooks

Creating an instance results in a `Permission denied` error

Creating an instance results in an `already exists` error

`docker-proxy-agent` status failure

`docker-service` status failure

`jupyter-service` status failure

`jupyter-api` status failure