Run custom training jobs on a persistent resource

This page shows you how to run a custom training job on a persistent resource by using the Vertex AI API or the Google Cloud CLI.

Normally, when you create a custom training job, you need to specify compute resources that the job creates and runs on. After you create a persistent resource, you can instead configure the custom training job to run on one or more resource pools of that persistent resource. Running a custom training job on a persistent resource significantly reduces the job startup time that's otherwise needed for compute resource creation.

Required roles

To get the permission that you need to run custom training jobs on a persistent resource, ask your administrator to grant you the Vertex AI User (roles/aiplatform.user) IAM role on your project. For more information about granting roles, see Manage access.

This predefined role contains the aiplatform.customJobs.create permission, which is required to run custom training jobs on a persistent resource.

You might also be able to get this permission with custom roles or other predefined roles.

Create a training job that runs on a persistent resource

To create a custom training jobs that runs on a persistent resource, make the following modifications to the standard instructions for creating a custom training job:

gcloud

  • Specify the --persistent-resource-id flag and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify the --worker-pool-spec flag such that the values for machine-type and disk-type matches exactly with a corresponding resource pool from the persistent resource. Specify one --worker-pool-spec for single node training and multiple for distributed training.
  • Specify a replica-count less than or equal to the replica-count or max-replica-count of the corresponding resource pool.

REST

  • Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
  • Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

What's next