This page shows you how to run a custom training job on a persistent resource by using the Vertex AI API or the Google Cloud CLI.
Normally, when you create a custom training job, you need to specify compute resources that the job creates and runs on. After you create a persistent resource, you can instead configure the custom training job to run on one or more resource pools of that persistent resource. Running a custom training job on a persistent resource significantly reduces the job startup time that's otherwise needed for compute resource creation.
Required roles
To get the permission that you need to run custom training jobs on a persistent resource,
ask your administrator to grant you the
Vertex AI User (roles/aiplatform.user
) IAM role on your project.
For more information about granting roles, see Manage access.
This predefined role contains the
aiplatform.customJobs.create
permission, which is
required to run custom training jobs on a persistent resource.
You might also be able to get this permission with custom roles or other predefined roles.
Create a training job that runs on a persistent resource
To create a custom training jobs that runs on a persistent resource, make the following modifications to the standard instructions for creating a custom training job:
gcloud
- Specify the
--persistent-resource-id
flag and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use. - Specify the
--worker-pool-spec
flag such that the values formachine-type
anddisk-type
matches exactly with a corresponding resource pool from the persistent resource. Specify one--worker-pool-spec
for single node training and multiple for distributed training. - Specify a
replica-count
less than or equal to thereplica-count
ormax-replica-count
of the corresponding resource pool.
REST
- Specify the
persistent_resource_id
parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use. - Specify the
worker_pool_specs
parameter such that the values ofmachine_spec
anddisk_spec
for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify onemachine_spec
for single node training and multiple for distributed training. - Specify a
replica_count
less than or equal to thereplica_count
ormax_replica_count
of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.
What's next
- Learn about persistent resource.
- Create and use a persistent resource.
- Get information about a persistent resource.
- Delete a persistent resource.