Troubleshooting Vertex AI Workbench

This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI Workbench.

See also Troubleshooting Vertex AI for help using other components of Vertex AI.

To filter this page's content, click a topic:

Vertex AI Workbench instances

This section describes troubleshooting steps for Vertex AI Workbench instances.

Connecting to and opening JupyterLab

This section describes troubleshooting steps for connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

Can't access the terminal in a Vertex AI Workbench instance

Issue

If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your Vertex AI Workbench instance doesn't have terminal access enabled.

Solution

You must create a new Vertex AI Workbench instance with the Terminal access option enabled. This option can't be changed after instance creation.

502 error when opening JupyterLab

Issue

A 502 error might mean that your Vertex AI Workbench instance isn't ready yet.

Solution

Wait a few minutes, refresh the Google Cloud console browser tab, and try again.

Notebook is unresponsive

Issue

Your Vertex AI Workbench instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

  • Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
  • Reset your instance.

Unable to connect with Vertex AI Workbench instance using SSH

Issue

You're unable to connect to your instance by using SSH through a terminal window.

Vertex AI Workbench instances use OS Login to enable SSH access. When you create an instance, Vertex AI Workbench enables OS Login by default by setting the metadata key enable-oslogin to TRUE. If you're unable to use SSH to connect to your instance, this metadata key might need to be set to TRUE.

Solution

Connecting to a Vertex AI Workbench instance by using the Google Cloud console isn't supported. If you're unable to connect to your instance by using SSH through a terminal window, see the following:

To set the metadata key enable-oslogin to TRUE, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

GPU quota has been exceeded

Issue

You're unable to create a Vertex AI Workbench instance with GPUs.

Solution

Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase. See Request a higher quota limit.

Creating Vertex AI Workbench instances

This section describes how to troubleshoot issues related to creating Vertex AI Workbench instances.

Instance stays in pending state indefinitely

Issue

After creating a Vertex AI Workbench instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:

Could not resolve host: notebooks.googleapis.com

Solution

Your instance can't connect to the Notebooks API server due to a DNS configuration or other network issue. To resolve the issue, check your DNS and network configurations. For more information, see Network configuration options.

Unable to create an instance within a Shared VPC network

Issue

Attempting to create an instance within a Shared VPC network results in an error message like the following:

Required 'compute.subnetworks.use' permission for
'projects/network-administration/regions/us-central1/subnetworks/v'

Solution

The issue is that the Notebooks Service Account is attempting to create the instance without the correct permissions.

To ensure that the Notebooks Service Account has the necessary permissions to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network, ask your administrator to grant the Notebooks Service Account the Compute Network User role (roles/compute.networkUser) IAM role on the host project. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to ensure that the Notebooks Service Account can create a Vertex AI Workbench instance within a Shared VPC network:

  • To use subnetworks: compute.subnetworks.use

Your administrator might also be able to give the Notebooks Service Account these permissions with custom roles or other predefined roles.

Can't create a Vertex AI Workbench instance with a custom container

Issue

There isn't an option to use a custom container when creating a Vertex AI Workbench instance in the Google Cloud console.

Solution

Adding a custom container to a Vertex AI Workbench instance isn't supported, and you can't add a custom container by using the Google Cloud console.

Adding a conda environment is recommended instead of using a custom container.

You can add a custom container to a Vertex AI Workbench instance by using the Notebooks API, but this capability isn't supported.

Mount shared storage button isn't there

Issue

The Mount shared storage button isn't in the File Browser tab of the JupyterLab interface.

Solution

The storage.buckets.list permission is required for the Mount shared storage button to appear in the JupyterLab interface of your Vertex AI Workbench instance. Ask your administrator to grant your Vertex AI Workbench instance's service account the storage.buckets.list permission on the project.

599 error when using Dataproc

Issue

Attempting to create a Dataproc-enabled instance results in an error message like the following:

HTTP 599: Unknown (Error from Gateway: [Timeout while connecting]
Exception while attempting to connect to Gateway server url.
Ensure gateway url is valid and the Gateway instance is running.)

Solution

In your DNS configuration, add a DNS entry for the *.googleusercontent.com domain.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in Vertex AI Workbench instances.

Unable to edit underlying virtual machine

Issue

When you try to edit the underlying virtual machine (VM) of a Vertex AI Workbench instance, you might get an error message similar to the following:

Current principal doesn't have permission to mutate this resource.

Solution

This error occurs because you can't edit the underlying VM of an instance by using the Google Cloud console or the Compute Engine API.

To edit a Vertex AI Workbench instance's underlying VM, use the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK

pip packages aren't available after adding conda environment

Issue

Your pip packages aren't available after you add a conda-based kernel.

Solution

To resolve the issue, see Add a conda environment and try the following:

  • Check that you used the DL_ANACONDA_ENV_HOME variable and that it contains the name of your environment.

  • Check that pip is located in a path similar to opt/conda/envs/ENVIRONMENT/bin/pip. You can run the which pip command to get the path.

Unable to access or copy data of an instance with single user access

Issue

The data on an instance with single user access is inaccessible.

For Vertex AI Workbench instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.

Solution

To access or copy the data when you aren't the owner of the instance, open a support case.

Unexpected shutdown

Issue

Your Vertex AI Workbench instance shuts down unexpectedly.

Solution

If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.

If you enabled idle shutdown, your instance shuts down when there is no kernel activity for the specified time period. For example, running a cell or new output printing to a notebook is activity that resets the idle timeout timer. CPU usage doesn't reset the idle timeout timer.

Managed notebooks

This section describes troubleshooting steps for managed notebooks.

Connecting to and opening JupyterLab

This section describes troubleshooting issues with connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

Unable to connect with managed notebooks instance using SSH

Issue

There isn't an option to connect with managed notebooks instances by using SSH.

Solution

SSH access to managed notebooks instances isn't available.

Can't access the terminal in a managed notebooks instance

Issue

If you're unable to access the terminal or can't find the terminal window in the launcher, it could be because your managed notebooks instance doesn't have terminal access enabled.

Solution

You must create a new managed notebooks instance with the Terminal access option enabled. This option can't be changed after instance creation.

502 error when opening JupyterLab

Issue

A 502 error might mean that your managed notebooks instance isn't ready yet.

Solution

Wait a few minutes, refresh the Google Cloud console browser tab, and try again.

Opening a notebook results in a 524 (A Timeout Occurred) error

Issue

A 524 error is usually an indication that the Inverting Proxy agent isn't connecting to the Inverting Proxy server or the requests are taking too long on the backend server side (Jupyter). Common causes of this error include networking issues, the Inverting Proxy agent isn't running, or the Jupyter service isn't running.

Solution

Verify that your managed notebooks instance is started.

Notebook is unresponsive

Issue

managed notebooks instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

  • Refresh the JupyterLab browser page. Unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
  • Reset your instance.

Migrating to Vertex AI Workbench instances

This section describes methods for diagnosing and resolving issues with migrating from a managed notebooks instance to a Vertex AI Workbench instance.

Can't find a kernel that was in the managed notebooks instance

Issue

A kernel that was in your managed notebooks instance doesn't appear in the Vertex AI Workbench instance that you migrated to.

Custom containers appear as kernels in managed notebooks. The Vertex AI Workbench migration tool doesn't support custom container migration.

Solution

To resolve this issue, add a conda environment to your Vertex AI Workbench instance.

Different version of framework in migrated instance

Issue

A framework that was in your managed notebooks instance was a different version than the one in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances provide a default set of framework versions. The migration tool doesn't add framework versions from your original managed notebooks instance. See default migration tool behaviors.

Solution

To add a specific version of a framework, add a conda environment to your Vertex AI Workbench instance.

GPUs aren't migrated to the new Vertex AI Workbench instance

Issue

GPUs that were in your managed notebooks instance aren't in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances support a default set of GPUs. If the GPUs in your original managed notebooks instance aren't available, your instance is migrated without any GPUs.

Solution

After migration, you can add GPUs to your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

Migrated instance's machine type is different

Issue

The machine type of your managed notebooks instance is different from the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances don't support all machine types. If the machine type in your original managed notebooks instance isn't available, your instance is migrated to the e2-standard-4 machine type.

Solution

After migration, you can change the machine type of your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

GPU quota has been exceeded

Issue

You are unable to create a managed notebooks instance with GPUs.

Solution

Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase. See Request a higher quota limit.

Using container images

This section describes troubleshooting issues with using container images.

Container image doesn't appear as a kernel in JupyterLab

Issue

Container images that don't have a valid kernelspec don't successfully load as kernels in JupyterLab.

Solution

Make sure that your container meets our requirements. For more information, see the custom container requirements.

Notebook disconnects on long-running job

Issue

If you see the following error message when running a job in a notebook, it might be caused by the request taking too long to load, or the CPU or memory utilization is high, which can make the Jupyter Service unresponsive.

{"log":"2021/06/29 18:10:33 failure fetching a VM ID: compute: Received 500
`internal error`\n","stream":"stderr","time":"2021-06-29T18:10:33.383650241Z"}
{"log":"2021/06/29 18:38:26 Websocket failure: failed to read a websocket
message from the server: read tcp [::1]:40168-\u003e[::1]:8080: use of closed
network connection\n","stream":"stderr","time":"2021-06-29T18:38:26.057622824Z"}

Solution

This issue is caused by running a long-running job within a notebook. To run a job that might take a long time to complete, it's recommended to use the executor.

Using the executor

This section describes troubleshooting issues with using executor.

Package installations not available to the executor

Issue

The executor runs your notebook code in a separate environment from the kernel where you run your notebook file's code. Because of this, some of the packages you installed might not be available in the executor's environment.

Solution

To resolve this issue, see Ensure package installations are available to the executor.

401 or 403 errors when running the notebook code using the executor

Issue

A 401 or 403 error when you run the executor can mean that the executor isn't able to access resources.

Solution

See the following for possible causes:

  • The executor runs your notebook code in a tenant project separate from your managed notebooks instance's project. Therefore, when you access resources through code run by the executor, the executor might not connect to the correct Google Cloud project by default. To resolve this issue, use explicit project selection.

  • By default, your managed notebooks instance can have access to resources that exist in the same project, and therefore, when you run your notebook file's code manually, these resources don't need additional authentication. However, because the executor runs in a separate tenant project, it does not have the same default access. To resolve this issue, authenticate access using service accounts.

  • The executor can't use end-user credentials to authenticate access to resources, for example, the gcloud auth login command. To resolve this issue, authenticate access using service accounts.

exited with a non-zero status of 127 error when using the executor

Issue

An exited with a non-zero status of 127 error, or "command not found" error, can happen when you use the executor to run code on a custom container that doesn't have the nbexecutor extension installed.

Solution

To ensure that your custom container has the nbexecutor extension, you can create a derivative container image from a Deep Learning Containers image. Deep Learning Containers images include the nbexecutor extension.

Invalid service networking configuration error message

Issue

This error might look like the following:

Invalid Service Networking configuration. Couldn't find free blocks in allocated IP ranges.
Please use a valid range using: /24 mask or below (/23,/22, etc).

This means that no free blocks were found in the allocated IP ranges of your network.

Solution

Use a subnet mask of /24 or lower. Create a bigger allocated IP address range and attach this range by modifying the private service connection for servicenetworking-googleapis-com.

For more information, see Set up a network.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in managed notebooks instances.

Unable to access or copy data of an instance with single user access

Issue

The data on an instance with single user access is inaccessible.

Solution

For managed notebooks instances that are set up with single user access, only the specified single user (the owner) can access the data on the instance.

To access or copy the data when you aren't the owner of the instance, open a support case.

Unexpected shutdown

Issue

Your Vertex AI Workbench instance shuts down unexpectedly.

Solution

If your instance shuts down unexpectedly, this could be because idle shutdown was initiated.

If you enabled idle shutdown, your instance shuts down when there is no kernel activity for the specified time period. For example, running a cell or new output printing to a notebook is activity that resets the idle timeout timer. CPU usage doesn't reset the idle timeout timer.

Restore instance

Issue

Restoring a managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub.

Recover data from an instance

Issue

Recovering data from a managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub.

Creating managed notebooks instances

This section describes troubleshooting issues with creating managed notebooks instances.

Error: Problem while creating a connection

Issue

You encounter this error while creating an instance:

We encountered a problem while creating a connection.

Service 'servicenetworking.googleapis.com' requires at least
one allocated range to have minimal size; please make sure
at least one allocated range will have prefix length at most '24'.

Solution

Create an allocated IP range bigger than /24 and attach this range by modifying the private service connection for the servicenetworking-googleapis-com connection.

Creating an instance results in a resource availability error

Issue

You're unable to create an instance because of a resource availability error.

This error can look like the following:

Creating notebook INSTANCE_NAME: ZONE does not have
enough resources available to fulfill the request.
Retry later or try another zone in your configurations.

Resource errors occur when you request new resources in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to new resource requests in the zone and don't affect existing resources. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, try the following:

  • Create an instance with a different machine type.
  • Create the instance in a different zone.
  • Attempt the request again later.
  • Reduce the amount of resources that you're requesting. For example, try to create an instance with less GPUs, disks, vCPUs, or memory.

Starting an instance results in a resource availability error

Issue

You're unable to start an instance because of a resource availability error.

This error can look like the following:

The zone ZONE_NAME doesn't have enough resources available to fulfill
the request. '(resource type:compute)'.

Resource errors occur when you try to start an instance in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to the resources you specified in your request at the time you sent the request, not to all resources in the zone. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, try the following:

  • Change the machine type of your instance.
  • Migrate your files and data to an instance in a different zone.
  • Attempt the request again later.
  • Reduce the amount of resources that you're requesting. For example, start a different instance with less GPUs, disks, vCPUs, or memory.

No route to host on outbound connections from managed notebooks

Issue

Typically, the only routes you can see in the Google Cloud console are those known to your own VPC as well as the ranges reserved when you complete the VPC Network Peering configuration.

Managed notebooks instances reside in a Google-managed network and run a modified version of Jupyter in a Docker networking namespace within the instance.

The Docker network interface and Linux bridge on this instance may select a local IP that conflicts with IP ranges being exported over the peering by your VPC. These are typically in the 172.16.0.0/161 and 192.168.10.0/24 ranges, respectively.

In these circumstances, outbound connections from the instance to these ranges will fail with a complaint that is some variation of No route to host despite VPC routes being correctly shared.

Solution

Invoke ifconfig in a terminal session and ensure that no IP addresses on any virtual interfaces in the instance conflict with IP ranges that your VPC is exporting to the peering connection.

User-managed notebooks

This section describes troubleshooting steps for user-managed notebooks.

Connecting to and opening JupyterLab

This section describes troubleshooting issues with connecting to and opening JupyterLab.

Nothing happens after clicking Open JupyterLab

Issue

When you click Open JupyterLab, nothing happens.

Solution

Verify that your browser doesn't block new tabs from opening automatically. JupyterLab opens in a new browser tab.

No Inverting Proxy server access to JupyterLab

Issue

You are unable to access JupyterLab.

Vertex AI Workbench uses a Google internal Inverting Proxy server to provide access to JupyterLab. User-managed notebooks instance settings, network configuration, and other factors can prevent access to JupyterLab.

Solution

Use SSH to connect to JupyterLab and learn more about why you might not have access through the Inverting Proxy.

Unable to connect with user-managed notebooks instance using SSH

Issue

You're unable to connect to your instance by using SSH through a terminal window.

User-managed notebooks instances use OS Login to enable SSH access. When you create an instance, Vertex AI Workbench enables OS Login by default by setting the metadata key enable-oslogin to TRUE. If you're unable to use SSH to connect to your instance, this metadata key might need to be set to TRUE.

Solution

To enable SSH access for user-managed notebooks for users, complete the steps for configuring OS Login roles on user accounts.

Opening a user-managed notebooks instance results in a 403 (Forbidden) error

Issue

A 403 (Forbidden) error when opening a user-managed notebooks instance often means that there is an access issue.

Solution

To troubleshoot access issues, consider the three ways that access can be granted to a user-managed notebooks instance:

  • Single user
  • Service account
  • Project editors

The access mode is configured during user-managed notebooks instance creation and it is defined in the notebook metadata:

  • Single user: proxy-mode=mail, proxy-user-mail=user@domain.com
  • Service account: proxy-mode=service_account
  • Project editors: proxy-mode=project_editors

If you can't access a notebook when you click Open JupyterLab, try the following:

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf-latest-cpu \
  --metadata=proxy-mode=mail,proxy-user-mail=user@domain.com \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

When you click Open JupyterLab to open a notebook, the notebook opens in a new browser tab. If you are signed in to more than one Google Account, the new tab opens with your default Google Account. If you didn't create your user-managed notebooks instance with your default Google Account, the new browser tab will show a 403 (Forbidden) error.

No JupyterLab access, single user mode enabled

Issue

You are unable to access JupyterLab.

Solution

If a user is unable to access JupyterLab and the instance's access to JupyterLab is set to Single user only, try the following:

  1. On the User-managed notebooks page of the Google Cloud console, click the name of your instance to open the Notebook details page.

  2. Next to View VM details, click View in Compute Engine.

  3. On the VM details page, click Edit.

  4. In the Metadata section, verify that the proxy-mode metadata entry is set to mail.

  5. Verify that the proxy-user-mail metadata entry is set to a valid user email address, not a service account.

  6. Click Save.

  7. On the User-managed notebooks page of the Google Cloud console, initialize the updated metadata by stopping your instance and starting the instance back up again.

Opening a notebook results in a 504 (Gateway Timeout) error

Issue

This is an indication of an internal proxy timeout or a backend server (Jupyter) timeout. This can be seen when:

  • The request never reached the internal Inverting Proxy server
  • Backend (Jupyter) returns a 504 error.

Solution

Open a Google support case.

Opening a notebook results in a 524 (A Timeout Occurred) error

Issue

The internal Inverting Proxy server hasn't received a response from the Inverting Proxy agent for the request within the timeout period. Inverting Proxy agent runs inside your user-managed notebooks instance as a Docker container. A 524 error is usually an indication that the Inverting Proxy agent isn't connecting to the Inverting Proxy server or the requests are taking too long on the backend server side (Jupyter). A typical case for this error is on the user side (for example, a networking issue, or the Inverting Proxy agent service isn't running).

Solution

If you can't access a notebook, verify that your user-managed notebooks instance is started and try the following:

Option 1: Run the diagnostic tool to automatically check and repair user-managed notebooks core services, verify available storage, and generate useful log files. To run the tool in your instance, perform the following steps:

  1. Make sure that your instance is on version M58 or newer.

  2. Connect to your Deep Learning VM Images instance using SSH.

  3. Run the following command:

    sudo /opt/deeplearning/bin/diagnostic_tool.sh [--repair] [--bucket=$BUCKET]
    

    Note that the --repair flag and --bucket flags are optional. The --repair flag will attempt to fix common core service errors, and the --bucket flag will let you specify a Cloud Storage bucket to store the created log files.

    The output of this command will display useful status messages for user-managed notebooks core services and will export log files of its findings.

Option 2: Use the following steps to check specific user-managed notebooks requirements individually.

Opening a notebook results in a 598 (Network read timeout) error

Issue

The Inverting Proxy server hasn't heard from the Inverting Proxy agent at all for more than 10 minutes. This is a strong indication of an Inverting Proxy agent issue.

Solution

If you can't access a notebook, try the following:

Notebook is unresponsive

Issue

Your user-managed notebooks instance isn't running cells or appears to be frozen.

Solution

First try restarting the kernel by clicking Kernel from the top menu and then Restart Kernel. If that doesn't work, you can try the following:

  • Refresh the JupyterLab browser page. Any unsaved cell output doesn't persist, so you must run those cells again to regenerate the output.
  • From a terminal session in the notebook, run the command top to see if there are processes consuming the CPU.
  • From the terminal, check the amount of free disk space using the command df, or check the available RAM using the command free.
  • Shut your instance down by selecting it from the User-managed notebooks page and clicking Stop. After it has stopped completely, select it and click Start.

Migrating to Vertex AI Workbench instances

This section describes methods for diagnosing and resolving issues with migrating from a user-managed notebooks instance to a Vertex AI Workbench instance.

Can't find R, Beam, or other kernels that were in the user-managed notebooks instance

Issue

A kernel that was in your user-managed notebooks instance doesn't appear in the Vertex AI Workbench instance that you migrated to.

Some kernels, such as the R and Beam kernels, aren't available in Vertex AI Workbench instances by default. Migration of those kernels isn't supported.

Solution

To resolve this issue, add a conda environment to your Vertex AI Workbench instance.

Can't set up a Dataproc Hub instance in the Vertex AI Workbench instance

Issue

Dataproc Hub isn't supported in Vertex AI Workbench instances.

Solution

Continue to use Dataproc Hub in user-managed notebooks instances.

Different version of framework in migrated instance

Issue

A framework that was in your user-managed notebooks instance was a different version than the one in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances provide a default set of framework versions. The migration tool doesn't add framework versions from your original user-managed notebooks instance. See default migration tool behaviors.

Solution

To add a specific version of a framework, add a conda environment to your Vertex AI Workbench instance.

GPUs aren't migrated to the new Vertex AI Workbench instance

Issue

GPUs that were in your user-managed notebooks instance aren't in the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances support a default set of GPUs. If the GPUs in your original user-managed notebooks instance aren't available, your instance is migrated without any GPUs.

Solution

After migration, you can add GPUs to your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

Migrated instance's machine type is different

Issue

The machine type of your user-managed notebooks instance is different from the Vertex AI Workbench instance that you migrated to.

Vertex AI Workbench instances don't support all machine types. If the machine type in your original user-managed notebooks instance isn't available, your instance is migrated to the e2-standard-4 machine type.

Solution

After migration, you can change the machine type of your Vertex AI Workbench instance by using the projects.locations.instances.patch method in the Notebooks API or the gcloud workbench instances update command in the Google Cloud SDK.

Working with files

This section describes troubleshooting issues with files for user-managed notebooks instances.

File downloading disabled but user can still download files

Issue

For Dataproc Hub user-managed notebooks instances, disabling file downloading from the JupyterLab user interface isn't supported. User-managed notebooks instances that use the Dataproc Hub framework permit file downloading even if you don't select Enable file downloading from JupyterLab UI when you create the instance.

Solution

Dataproc Hub user-managed notebooks instances don't support restricting file downloads.

Downloaded files are truncated or don't complete downloading

Issue

When you download files from your user-managed notebooks instance, a timeout setting on the proxy-forwarding agent limits the connection time for the download to complete. If the download takes too long, this can truncate your downloaded file or prevent it from being downloaded.

Solution

To download the file, copy your file to Cloud Storage, and then download the file from Cloud Storage.

Consider migrating your files and data to a new user-managed notebooks instance.

After restarting VM, local files can't be referenced from notebook terminal

Issue

Sometimes after restarting a user-managed notebooks instance, local files can't be referenced from within a notebook terminal.

Solution

This is a known issue. To reference your local files from within a notebook terminal, first re-establish your current working directory using the following command:

cd PWD

In this command, replace PWD with your current working directory. For example, if your current working directory was /home/jupyter/, use the command cd /home/jupyter/.

After re-establishing your current working directory, your local files can be referenced from within the notebook terminal.

Creating user-managed notebooks instances

This section describes troubleshooting issues with creating user-managed notebooks instances.

GPU quota has been exceeded

Issue

You are unable to create a user-managed notebooks instance with GPUs.

Solution

Determine the number of GPUs available in your project by checking the quotas page. If GPUs aren't listed on the quotas page, or you require additional GPU quota, you can request a quota increase. See Request a higher quota limit.

Instance stays in pending state indefinitely

Issue

After creating a user-managed notebooks instance, it stays in the pending state indefinitely. An error like the following might appear in the serial logs:

Could not resolve host: notebooks.googleapis.com

Solution

Your instance can't connect to the Notebooks API server due to a DNS configuration or other network issue. To resolve the issue, check your DNS and network configurations. For more information, see Network configuration options.

New user-managed notebooks instance isn't created (insufficient permissions)

Issue

It usually takes about a minute to create a user-managed notebooks instance. If your new user-managed notebooks instance remains in the pending state indefinitely, it might be because the service account used to start the user-managed notebooks instance doesn't have the required Editor permission in your Google Cloud project.

You can start a user-managed notebooks instance with a custom service account that you create or in single-user mode with a userid. If you start a user-managed notebooks instance in single-user mode, then your user-managed notebooks instance begins the boot process using Compute Engine default service account before turning control over to your user ID.

Solution

To verify that a service account has the appropriate permissions, follow these steps:

Console

  1. Open the IAM page in the Google Cloud console.

    Open the IAM page

  2. Determine the service account used with your user-managed notebooks instance, which is one of the following:

    • A custom service account that you specified when you created your user-managed notebooks instance.

    • The Compute Engine default service account for your Google Cloud project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your Google Cloud project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.

  3. Verify that your service account is in the Editor role.

  4. If not, edit the service account and add it to the Editor role.

For more information, see Granting, changing, and revoking access to resources in the IAM documentation.

gcloud

  1. If you haven't already, install the Google Cloud CLI.

  2. Get the name and project number for your Google Cloud project with the following command. Replace PROJECT_ID with the project ID for your Google Cloud project.

    gcloud projects describe PROJECT_ID
    

    You should see output similar to the following, which displays the name (name:) and project number (projectNumber:) for your project.

    createTime: '2018-10-18T21:03:31.408Z'
    lifecycleState: ACTIVE
    name: my-project-name
    parent:
     id: '396521612403'
     type: folder
    projectId: my-project-id-1234
    projectNumber: '113377992299'
    
  3. Determine the service account used with your user-managed notebooks instance, which is one of the following:

    • A custom service account that you specified when you created your user-managed notebooks instance.

    • The Compute Engine default service account for your Google Cloud project, which is used when you start your user-managed notebooks instance in single-user mode. The Compute Engine default service account for your Google Cloud project is named PROJECT_NUMBER-compute@developer.gserviceaccount.com. For example: 113377992299-compute@developer.gserviceaccount.com.

  4. Add the roles/editor role to the service account with the following command. Replace project-name with the name of your project, and replace service-account-id with the service account ID for your user-managed notebooks instance.

    gcloud projects add-iam-policy-binding project-name \
     --member serviceAccount:service-account-id \
     --role roles/editor
    

Creating an instance results in a Permission denied error

Issue

The service account on the instance provides access to other Google Cloud services. You can use any service account within the same project, but you must have the Service Account User permission (iam.serviceAccounts.actAs) to create the instance. If not specified, the Compute Engine default service account is used.

Solution

When creating an instance, verify that the user creating the instance has the iam.serviceAccounts.ActAs permission for the defined service account.

The following example shows how to specify a service account when you create an instance:

gcloud notebooks instances create nb-1 \
  --vm-image-family=tf-latest-cpu \
  --service-account=your_service_account@project_id.iam.gserviceaccount.com \
  --location=us-west1-a

To grant the Service Account User role, see Manage access to service accounts.

Creating an instance results in an already exists error

Issue

When creating an instance, verify that a user-managed notebooks instance with the same name wasn't deleted previously by Compute Engine and still exists in the Notebooks API database.

Solution

The following example shows how to list instances using the Notebooks API and verify their state.

gcloud notebooks instances list --location=LOCATION

If an instance's state is DELETED, run the following command to delete it permanently.

gcloud notebooks instances delete INSTANCE_NAME --location=LOCATION

Unable to create an instance in a Shared VPC

Issue

You're unable to create an instance in a Shared VPC.

Solution

If you are using Shared VPC, you must add the host and the service projects to the service perimeter. In the host project, you must also grant the Compute Network User role (roles/compute.networkUser) to the Notebooks Service Agent from the service project. For more information, see Managing service perimeters.

Creating an instance results in a resource availability error

Issue

You're unable to create an instance because of a resource availability error.

This error can look like the following:

Creating notebook INSTANCE_NAME: ZONE does not have enough
resources available to fulfill the request. Retry later or try another zone in
your configurations.

Resource errors occur when you request new resources in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to new resource requests in the zone and don't affect existing resources. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, you can try the following:

  • Create an instance with a different machine type.
  • Create the instance in a different zone.
  • Attempt the request again later.
  • Reduce the amount of resources that you're requesting. For example, try to create an instance with less GPUs, disks, vCPUs, or memory.

Starting an instance results in a resource availability error

Issue

You're unable to start an instance because of a resource availability error.

This error can look like the following:

The zone ZONE_NAME doesn't have enough resources available to fulfill
the request. '(resource type:compute)'.

Resource errors occur when you try to start an instance in a zone that can't accommodate your request due to the current unavailability of Compute Engine resources, such as GPUs or CPUs.

Resource errors only apply to the resources you specified in your request at the time you sent the request, not to all resources in the zone. Resource errors aren't related to your Compute Engine quota. Resource errors are temporary and can change frequently based on fluctuating demand.

Solution

To proceed, you can try the following:

  • Change the machine type of your instance.
  • Migrate your files and data to an instance in a different zone.
  • Attempt the request again later.
  • Reduce the amount of resources that you're requesting. For example, start a different instance with less GPUs, disks, vCPUs, or memory.

Upgrading user-managed notebooks instances

This section describes troubleshooting issues with upgrading user-managed notebooks instances.

Unable to upgrade because unable to get instance disk information

Issue

Upgrade isn't supported for single-disk user-managed notebooks instances.

Solution

You might want to migrate your user data to a new user-managed notebooks instance.

Unable to upgrade because instance isn't UEFI compatible

Issue

Vertex AI Workbench depends on UEFI compatibility to complete an upgrade.

User-managed notebooks instances created from some older images are not UEFI compatible, and therefore can't be upgraded.

Solution

To verify that your instance is UEFI compatible, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute instances describe INSTANCE_NAME \
  --zone=ZONE | grep type

Replace the following:

  • INSTANCE_NAME: the name of your instance
  • ZONE: the zone where your instance is located

To verify that the image that you used to create your instance is UEFI compatible, use the following command:

gcloud compute images describe VM_IMAGE_FAMILY \
  --project deeplearning-platform-release | grep type

Replace VM_IMAGE_FAMILY with the image family name that you used to create your instance.

If you determine that either your instance or image isn't UEFI compatible, you can attempt to migrate your user data to a new user-managed notebooks instance. To do so, complete the following steps:

  1. Verify that the image that you want to use to create your new instance is UEFI compatible. To do so, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

    gcloud compute images describe VM_IMAGE_FAMILY \
      --project deeplearning-platform-release --format=json | grep type
    

    Replace VM_IMAGE_FAMILY with the image family name that you want to use to create your instance.

  2. Migrate your user data to a new user-managed notebooks instance.

User-managed notebooks instance isn't accessible after upgrade

Issue

If the user-managed notebooks instance isn't accessible after an upgrade, there might have been a failure during the replacement of the boot disk image.

User-managed notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.

Solution

Complete the following steps to attach a new valid image to the boot disk.

  1. To store values you'll use to complete this procedure, type the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

    export INSTANCE_NAME=MY_INSTANCE_NAME
    export PROJECT_ID=MY_PROJECT_ID
    export ZONE=MY_ZONE
    

    Replace the following:

    • MY_INSTANCE_NAME: the name of your instance
    • MY_PROJECT_ID: your project ID
    • MY_ZONE: the zone where your instance is located
  2. Use the following command to stop the instance:

    gcloud compute instances stop $INSTANCE_NAME \
      --project=$PROJECT_ID --zone=$ZONE
    
  3. Detach the data disk from the instance.

    gcloud compute instances detach-disk $INSTANCE_NAME --device-name=data \
      --project=$PROJECT_ID --zone=$ZONE
    
  4. Delete the instance's VM.

    gcloud compute instances delete $INSTANCE_NAME --keep-disks=all --quiet \
      --project=$PROJECT_ID --zone=$ZONE
    
  5. Use the Notebooks API to delete the user-managed notebooks instance.

    gcloud notebooks instances delete $INSTANCE_NAME \
      --project=$PROJECT_ID --location=$ZONE
    
  6. Create a user-managed notebooks instance using the same name as your previous instance.

    gcloud notebooks instances create $INSTANCE_NAME \
      --vm-image-project="deeplearning-platform-release" \
      --vm-image-family=MY_VM_IMAGE_FAMILY \
      --instance-owners=MY_INSTANCE_OWNER \
      --machine-type=MY_MACHINE_TYPE \
      --service-account=MY_SERVICE_ACCOUNT \
      --accelerator-type=MY_ACCELERATOR_TYPE \
      --accelerator-core-count=MY_ACCELERATOR_CORE_COUNT \
      --install-gpu-driver \
      --project=$PROJECT_ID \
      --location=$ZONE
    

    Replace the following:

    • MY_VM_IMAGE_FAMILY: the image family name
    • MY_INSTANCE_OWNER: your instance owner
    • MY_MACHINE_TYPE: the machine type of your instance's VM
    • MY_SERVICE_ACCOUNT: the service account to use with this instance, or use "default"
    • MY_ACCELERATOR_TYPE: the accelerator type; for example, "NVIDIA_TESLA_T4"
    • MY_ACCELERATOR_CORE_COUNT: the core count; for example, 1

Monitoring health status of user-managed notebooks instances

This section describes how to troubleshoot issues with monitoring health status errors.

docker-proxy-agent status failure

Follow these steps after a docker-proxy-agent status failure:

  1. Verify that the Inverting Proxy agent is running. If not, go to step 3.

  2. Restart the Inverting Proxy agent.

  3. Re-register with the Inverting Proxy server.

docker-service status failure

Follow these steps after a docker-service status failure:

  1. Verify that the Docker service is running.

  2. Restart the Docker service.

jupyter-service status failure

Follow these steps after a jupyter-service status failure:

  1. Verify that the Jupyter service is running.

  2. Restart the Jupyter service.

jupyter-api status failure

Follow these steps after a jupyter-api status failure:

  1. Verify that the Jupyter internal API is active.

  2. Restart the Jupyter service.

Boot disk utilization percent

The boot disk space status is unhealthy if the disk space is greater than 85% full.

If your boot disk space status is unhealthy, try the following:

  1. From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -H.

  2. Use the command find . -type d -size +100M to help you find large files that you might be able to delete, but don't delete them unless you are sure you can safely do so. If you aren't sure, you can get help from support.

  3. If the previous steps don't solve your problem, get support.

Data disk utilization percent

The data disk space status is unhealthy if the disk space is greater than 85% full.

If your data disk space status is unhealthy, try the following:

  1. From a terminal session in the user-managed notebooks instance or using ssh to connect, check the amount of free disk space using the command df -h -T /home/jupyter.

  2. Delete large files to increase the available disk space. Use the command find . -type d -size +100M to help you find large files.

  3. If the previous steps don't solve your problem, get support.

Unable to install third-party JupyterLab extension

Issue

Attempting to install a third-party JupyterLab extension results in an Error: 500 message.

Solution

Third-party JupyterLab extensions aren't supported in user-managed notebooks instances.

Restore instance

Issue

Restoring a user-managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub or make a snapshot of the disk.

Recover data from an instance

Issue

Recovering data from a user-managed notebooks instance after it's been deleted isn't supported.

Solution

To back up the data on your instance, you can save your notebooks to GitHub or make a snapshot of the disk

Unable to increase shared memory

Issue

You can't increase shared memory on an existing user-managed notebooks instance.

Solution

However, you can specify a shared memory size when you create a user-managed notebooks instance by using the container-custom-params metadata key, with a value like the following:

--shm-size=SHARED_MEMORY_SIZE gb

Replace SHARED_MEMORY_SIZE with the size that you want in GB.

Helpful procedures

This section describes procedures that you might find helpful.

Use SSH to connect to your user-managed notebooks instance

Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.

gcloud compute ssh --project PROJECT_ID \
  --zone ZONE \
  INSTANCE_NAME -- -L 8080:localhost:8080

Replace the following:

  • PROJECT_ID: Your project ID
  • ZONE: The Google Cloud zone where your instance is located
  • INSTANCE_NAME: The name of your instance

Re-register with the Inverting Proxy server

To re-register the user-managed notebooks instance with the internal Inverting Proxy server, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

cd /opt/deeplearning/bin
sudo ./attempt-register-vm-on-proxy.sh

Verify the Docker service status

To verify the Docker service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker status

Verify that the Inverting Proxy agent is running

To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your user-managed notebooks instance and enter:

# Confirm Inverting Proxy agent Docker container is running (proxy-agent)
sudo docker ps

# Verify State.Status is running and State.Running is true.
sudo docker inspect proxy-agent

# Grab logs
sudo docker logs proxy-agent

Verify the Jupyter service status and collect logs

To verify the Jupyter service status you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter status

To collect Jupyter service logs:

sudo journalctl -u jupyter.service --no-pager

Verify that the Jupyter internal API is active

To verify that the Jupyter internal API is active you can use ssh to connect to your user-managed notebooks instance and enter:

curl http://127.0.0.1:8080/api/kernelspecs

Restart the Docker service

To restart the Docker service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service docker restart

Restart the Inverting Proxy agent

To restart the Inverting Proxy agent, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo docker restart proxy-agent

Restart the Jupyter service

To restart the Jupyter service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:

sudo service jupyter restart

Make a copy of the user data on an instance

To store a copy of an instance's user data in Cloud Storage, complete the following steps.

Create a Cloud Storage bucket (optional)

In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.

  • Create a Cloud Storage bucket:
    gcloud storage buckets create gs://BUCKET_NAME
    Replace BUCKET_NAME with a bucket name that meets the bucket naming requirements.

Copy your user data

  1. In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For user-managed notebooks instances, you can instead connect to your instance's terminal by using SSH.

  2. Use the gsutil tool to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's /home/jupyter/ directory to a directory in a Cloud Storage bucket.

    gsutil cp -R /home/jupyter/* gs://BUCKET_NAMEPATH
    

    Replace the following:

    • BUCKET_NAME: the name of your Cloud Storage bucket
    • PATH: the path to the directory where you want to copy your files, for example: /copy/jupyter/