Manage cluster

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

Once a cluster is deployed, you can manage its full lifecycle using the following REST API endpoints.

  • List: Views all active clusters in your project.
  • Get: Retrieves detailed information for a specific cluster.
  • Update: Modifies an existing cluster configuration.
  • Delete: Permanently removes a cluster and its resources.

Authentication

alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'

List clusters:

    gcurl -X GET https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters
    

The list method supports the following optional query parameters to control pagination.

  • pageSize (integer, optional): The maximum number of clusters to return in the response. The service may return fewer than this value, even if more items exist. If unspecified, a default page size will be used.
  • pageToken (string, optional): A token received from a previous list call. Provide this token to retrieve the subsequent page of results.
The following example requests the first page of results with a maximum of 5 clusters per page.

    gcurl "https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters?pageSize=5"
    

A successful response contains a clusters array with up to five cluster resources and may also include a nextPageToken string.

Get a cluster:

    gcurl -X GET https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID
    

Update a cluster:

UPDATE_PAYLOAD specifies the local path to a JSON file that defines the full ModelDevelopmentCluster you want to update to.

For example, to update the node count of a pool of a CPU-only cluster, use the following JSON payload:

{
  "display_name": "DISPLAY_NAME",
  "network": {
    "network": "projects/PROJECT_ID/global/networks/NETWORK",
    "subnetwork": "projects/PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK"
  },
  "node_pools": [
    {
      "id": "cpu",
      "machine_spec": {
        "machine_type": "n2-standard-8"
      },
      "scaling_spec": {
        "min_node_count": UPDATED_MIN_NODE_COUNT,
        "max_node_count": UPDATED_MAX_NODE_COUNT
      },
      "zone": "ZONE",
      "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 120
      }
    },
    {
      "id": "login",
      "machine_spec": {
        "machine_type": "n2-standard-8",
      },
      "scaling_spec": {
        "min_node_count": 1,
        "max_node_count": 1
      },
         "zone": "ZONE",
         "enable_public_ips": true,
      "boot_disk": {
        "boot_disk_type": "pd-standard",
        "boot_disk_size_gb": 120
      }
    },
  ],
  "orchestrator_spec": {
      "slurm_spec": {
      "home_directory_storage": "projects/PROJECT_ID/locations/ZONE/instances/FILESTORE",
      "partitions": [
         {
          "id": "cpu",
          "node_pool_ids": [
            "cpu"
          ]
        }
      ],
      "login_node_pool_id": "login"
    }
  }
}

gcurl -X PATCH -d @UPDATE_PAYLOAD https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID
Query Parameters
  • updateMask (string, optional): A FieldMask that specifies which fields of the Model Development cluster resource to update. Only the fields listed in the updateMask are changed.

    The following fields within the ModelDevelopmentCluster resource can be specified in the updateMask:

    • node_pools
    • orchestrator_spec.slurm_spec.partitions
    • orchestrator_spec.slurm_spec.login_node_pool_id
    • orchestrator_spec.slurm_spec.prolog_bash_scripts
    • orchestrator_spec.slurm_spec.epilog_bash_scripts

The command below updates both the node pool configuration and the Slurm partitions.

gcurl -X PATCH -d @update-payload.json
https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID?updateMask=orchestrator_spec.slurm_spec.partitions,node_pools
Important note on repeated fields

For repeated fields, such as node_pools, prolog_bash_scripts, and epilog_bash_scripts, the API only supports a full replacement operation. The user must provide the entire, expected list of items in the request payload to replace the existing list completely.

Post-update behavior and cluster status

A successful request returns a Long Running Operation (LRO). You can then monitor the status of this operation using the following command:

gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID
Once the update operation completes successfully, most modifications take effect immediately (or within the 15s Slurm sync time) and require no further action. Only changes to the underlying cluster compute infrastructure—including modifications to storage sources and startup scripts—require the resources to be recreated.

Delete a cluster:

  gcurl -X DELETE https://REGION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/REGION/modelDevelopmentClusters/CLUSTER_ID
  

This command returns a Long-Running Operation on success, which you can then monitor using the operations describe command.

gcurl https://REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/operations/OPERATION_ID