RUNNING
or ERROR
state.
Rebooting a persistent resource lets you recover from errors that the persistent
resource can't recover from on its own. You can also reboot a persistent
resource to manually obtain more up-to-date clusters. This page shows you how to
reboot a persistent resource by using the Google Cloud console and the REST API.
Required roles
To get the permission that you need to reboot a persistent resource,
ask your administrator to grant you the
Vertex AI Administrator (roles/aiplatform.admin
) IAM role on your project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the
aiplatform.persistentResources.update
permission,
which is required to
reboot a persistent resource.
You might also be able to get this permission with custom roles or other predefined roles.
Reboot a persistent resource
Select one of the following tabs for instructions on how to reboot a persistent resource. Make sure there's no training jobs running on the persistent resource.
Console
To reboot a persistent resource in the Google Cloud console, do the following:
In the Google Cloud console, go to the Persistent resources page.
Next to the name of the persistent resource that you want to reboot, click the vertical ellipses (
).Click Reboot.
Click Confirm.
gcloud
Before using any of the command data below, make the following replacements:
- PROJECT_ID: The Project ID of the persistent resource that you want to reboot.
- LOCATION: The region of the persistent resource that you want to reboot.
- PERSISTENT_RESOURCE_ID: The ID of the persistent resource that you want to reboot.
Execute the following command:
Linux, macOS, or Cloud Shell
gcloud ai persistent-resources reboot PERSISTENT_RESOURCE_ID \ --project=PROJECT_ID \ --region=LOCATION
Windows (PowerShell)
gcloud ai persistent-resources reboot PERSISTENT_RESOURCE_ID ` --project=PROJECT_ID ` --region=LOCATION
Windows (cmd.exe)
gcloud ai persistent-resources reboot PERSISTENT_RESOURCE_ID ^ --project=PROJECT_ID ^ --region=LOCATION
You should receive a response similar to the following:
Using endpoint [https://us-central1-aiplatform.googleapis.com/] Request to reboot the PersistentResource [projects/sample-project/locations/us-central1/persistentResources/test-persistent-resource] has been sent. You may view the status of your persistent resource with the command $ gcloud ai persistent-resources describe projects/sample-project/locations/us-central1/persistentResources/test-persistent-resource
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: The Project ID of the persistent resource that you want to reboot.
- LOCATION: The region of the persistent resource that you want to reboot.
- PERSISTENT_RESOURCE_ID: The ID of the persistent resource that you want to reboot.
HTTP method and URL:
POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/persistentResources/PERSISTENT_RESOURCE_ID:reboot
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
response:{ "name": "projects/123456789012/locations/us-central1/persistentResources/test-persistent-resource/operations/1234567890123456789", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.RebootPersistentResourceOperationMetadata", "genericMetadata": { "createTime": "2024-03-18T17:31:54.955004Z", "updateTime": "2024-03-18T17:31:55.204817Z", "state": "RUNNING", "worksOn": [ "projects/123456789012/locations/us-central1/persistentResources/test-persistent-resource" ] }, "progressMessage": "Waiting for persistent resource shut down." } }
Rebooting a persistent resource is a
long running operation,
during which the persistent resource can't be deleted. The operation contains a
progressMessage
field that populates with an error status if one occurs. After
the operation indicates "done: true"
,
check the status
of the persistent resource. If the persistent resource is in the RUNNING
state, the reboot is successful and it's ready to run training jobs.
Limitations
The following are limitations for rebooting a persistent resource:
- In some cases, it's possible to lose capacity of scarce resources when rebooting a persistent resource. Full resource retention is not guaranteed.
- Reboot is not available on Ray on Vertex AI.
- Persistent resources containing autoscaled worker pools reboot with the minimum replica count.