Run custom training jobs on a persistent resource

This page shows you how to run a custom training job on a persistent resource by using the Google Cloud CLI, Vertex AI SDK for Python, and the REST API.

Normally, when you create a custom training job, you need to specify compute resources that the job creates and runs on. After you create a persistent resource, you can instead configure the custom training job to run on one or more resource pools of that persistent resource. Running a custom training job on a persistent resource significantly reduces the job startup time that's otherwise needed for compute resource creation.

Required roles

To get the permission that you need to run custom training jobs on a persistent resource, ask your administrator to grant you the Vertex AI User (roles/aiplatform.user) IAM role on your project. For more information about granting roles, see Manage access.

This predefined role contains the aiplatform.customJobs.create permission, which is required to run custom training jobs on a persistent resource.

You might also be able to get this permission with custom roles or other predefined roles.

Create a training job that runs on a persistent resource

To create a custom training jobs that runs on a persistent resource, make the following modifications to the standard instructions for creating a custom training job:

gcloud

  • Specify the --persistent-resource-id flag and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify the --worker-pool-spec flag such that the values for machine-type and disk-type matches exactly with a corresponding resource pool from the persistent resource. Specify one --worker-pool-spec for single node training and multiple for distributed training.
  • Specify a replica-count less than or equal to the replica-count or max-replica-count of the corresponding resource pool.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

def create_custom_job_on_persistent_resource_sample(
    project: str,
    location: str,
    staging_bucket: str,
    display_name: str,
    container_uri: str,
    persistent_resource_id: str,
    service_account: Optional[str] = None,
) -> None:
    aiplatform.init(
        project=project, location=location, staging_bucket=staging_bucket
    )

    worker_pool_specs = [{
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": "NVIDIA_TESLA_K80",
            "accelerator_count": 1,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": container_uri,
            "command": [],
            "args": [],
        },
    }]

    custom_job = aiplatform.CustomJob(
        display_name=display_name,
        worker_pool_specs=worker_pool_specs,
        persistent_resource_id=persistent_resource_id,
    )

    custom_job.run(service_account=service_account)

REST

  • Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
  • Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
  • Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

What's next