Configure compute resources for custom training

When you perform custom training, your training code runs on one or more virtual machine (VM) instances. You can configure what types of VM to use for training: using VMs with more compute resources can speed up training and let you work with larger datasets, but they can also incur greater training costs.

In some cases, you can additionally use GPUs to accelerate training. GPUs incur additional costs.

You can also optionally customize the type and size of your training VMs' boot disks.

This document describes the different compute resources that you can use for custom training and how to configure them.

Where to specify compute resources

Specify configuration details within a WorkerPoolSpec. Depending on how you perform custom training, put this WorkerPoolSpec in one of the following API fields:

If you are performing distributed training, you can use different settings for each worker pool.

Machine types

In your WorkerPoolSpec, you must specify one of the following machine types in the machineSpec.machineType field. Each replica in the worker pool runs on a separate VM that has the specified machine type.

  • a2-ultragpu-1g*
  • a2-ultragpu-2g*
  • a2-ultragpu-4g*
  • a2-ultragpu-8g*
  • a2-highgpu-1g*
  • a2-highgpu-2g*
  • a2-highgpu-4g*
  • a2-highgpu-8g*
  • a2-megagpu-16g*
  • a3-highgpu-8g*
  • e2-standard-4
  • e2-standard-8
  • e2-standard-16
  • e2-standard-32
  • e2-highmem-2
  • e2-highmem-4
  • e2-highmem-8
  • e2-highmem-16
  • e2-highcpu-16
  • e2-highcpu-32
  • n2-standard-4
  • n2-standard-8
  • n2-standard-16
  • n2-standard-32
  • n2-standard-48
  • n2-standard-64
  • n2-standard-80
  • n2-highmem-2
  • n2-highmem-4
  • n2-highmem-8
  • n2-highmem-16
  • n2-highmem-32
  • n2-highmem-48
  • n2-highmem-64
  • n2-highmem-80
  • n2-highcpu-16
  • n2-highcpu-32
  • n2-highcpu-48
  • n2-highcpu-64
  • n2-highcpu-80
  • n1-standard-4
  • n1-standard-8
  • n1-standard-16
  • n1-standard-32
  • n1-standard-64
  • n1-standard-96
  • n1-highmem-2
  • n1-highmem-4
  • n1-highmem-8
  • n1-highmem-16
  • n1-highmem-32
  • n1-highmem-64
  • n1-highmem-96
  • n1-highcpu-16
  • n1-highcpu-32
  • n1-highcpu-64
  • n1-highcpu-96
  • c2-standard-4
  • c2-standard-8
  • c2-standard-16
  • c2-standard-30
  • c2-standard-60
  • m1-ultramem-40
  • m1-ultramem-80
  • m1-ultramem-160
  • m1-megamem-96
  • g2-standard-4*
  • g2-standard-8*
  • g2-standard-12*
  • g2-standard-16*
  • g2-standard-24*
  • g2-standard-32*
  • g2-standard-48*
  • g2-standard-96*
  • cloud-tpu*

* Machine types marked with asterisks in the preceding list must be used with certain GPUs or TPUs. See the following sections of this guide.

To learn about the technical specifications of each machine type, read the Compute Engine documentation about machine types. To learn about the cost of using each machine type for custom training, read Pricing.

The following examples highlight where you specify a machine type when you create a CustomJob:

Console

In the Google Cloud console, you can't create a CustomJob directly. However, you can create a TrainingPipeline that creates a CustomJob. When you create a TrainingPipeline in the Google Cloud console, specify a machine type for each worker pool on the Compute and pricing step, in the Machine type field.

gcloud

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


import com.google.cloud.aiplatform.v1.AcceleratorType;
import com.google.cloud.aiplatform.v1.ContainerSpec;
import com.google.cloud.aiplatform.v1.CustomJob;
import com.google.cloud.aiplatform.v1.CustomJobSpec;
import com.google.cloud.aiplatform.v1.JobServiceClient;
import com.google.cloud.aiplatform.v1.JobServiceSettings;
import com.google.cloud.aiplatform.v1.LocationName;
import com.google.cloud.aiplatform.v1.MachineSpec;
import com.google.cloud.aiplatform.v1.WorkerPoolSpec;
import java.io.IOException;

// Create a custom job to run machine learning training code in Vertex AI
public class CreateCustomJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String displayName = "DISPLAY_NAME";

    // Vertex AI runs your training application in a Docker container image. A Docker container
    // image is a self-contained software package that includes code and all dependencies. Learn
    // more about preparing your training application at
    // https://cloud.google.com/vertex-ai/docs/training/overview#prepare_your_training_application
    String containerImageUri = "CONTAINER_IMAGE_URI";
    createCustomJobSample(project, displayName, containerImageUri);
  }

  static void createCustomJobSample(String project, String displayName, String containerImageUri)
      throws IOException {
    JobServiceSettings settings =
        JobServiceSettings.newBuilder()
            .setEndpoint("us-central1-aiplatform.googleapis.com:443")
            .build();
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (JobServiceClient client = JobServiceClient.create(settings)) {
      MachineSpec machineSpec =
          MachineSpec.newBuilder()
              .setMachineType("n1-standard-4")
              .setAcceleratorType(AcceleratorType.NVIDIA_TESLA_K80)
              .setAcceleratorCount(1)
              .build();

      ContainerSpec containerSpec =
          ContainerSpec.newBuilder().setImageUri(containerImageUri).build();

      WorkerPoolSpec workerPoolSpec =
          WorkerPoolSpec.newBuilder()
              .setMachineSpec(machineSpec)
              .setReplicaCount(1)
              .setContainerSpec(containerSpec)
              .build();

      CustomJobSpec customJobSpecJobSpec =
          CustomJobSpec.newBuilder().addWorkerPoolSpecs(workerPoolSpec).build();

      CustomJob customJob =
          CustomJob.newBuilder()
              .setDisplayName(displayName)
              .setJobSpec(customJobSpecJobSpec)
              .build();
      LocationName parent = LocationName.of(project, location);
      CustomJob response = client.createCustomJob(parent, customJob);
      System.out.format("response: %s\n", response);
      System.out.format("Name: %s\n", response.getName());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python

To learn how to install or update the Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

For more context, read the guide to creating a CustomJob.

GPUs

If you have written your training code to use GPUs, then you may configure your worker pool to use one or more GPUs on each VM. To use GPUs, you must use an A2, N1, or G2 machine type. Additionally, using smaller machines types like n1-highmem-2 with GPUs might cause logging to fail for some workloads because of CPU constraints. If your training job stops returning logs, consider selecting a larger machine type.

Vertex AI supports the following types of GPU for custom training:

  • NVIDIA_H100_80GB
  • NVIDIA_A100_80GB
  • NVIDIA_TESLA_A100 (NVIDIA A100 40GB)
  • NVIDIA_TESLA_K80
  • NVIDIA_TESLA_P4
  • NVIDIA_TESLA_P100
  • NVIDIA_TESLA_T4
  • NVIDIA_TESLA_V100
  • NVIDIA_L4

To learn more about the technical specification for each type of GPU, read the Compute Engine short documentation about GPUs for compute workloads. To learn about the cost of using each machine type for custom training, read Pricing.

In your WorkerPoolSpec, specify the type of GPU that you want to use in the machineSpec.acceleratorType field and number of GPUs that you want each VM in the worker pool to use in the machineSpec.acceleratorCount field. However, your choices for these fields must meet the following restrictions:

  • The type of GPU that you choose must be available in the location where you are performing custom training. Not all types of GPU are available in all regions. Learn about regional availability.

  • You can only use certain numbers of GPUs in your configuration. For example, you can use 2 or 4 NVIDIA_TESLA_T4 GPUs on a VM, but not 3. To see what acceleratorCount values are valid for each type of GPU, see the following compatibility table.

  • You must make sure that your GPU configuration provides sufficient virtual CPUs and memory to the machine type that you use it with. For example, if you use the n1-standard-32 machine type in your worker pool, then each VM has 32 virtual CPUs and 120 GB of memory. Since each NVIDIA_TESLA_V100 GPU can provide up to 12 virtual CPUs and 76 GB of memory, you must use at least 4 GPUs for each n1-standard-32 VM to support its requirements. (2 GPUs provide insufficient resources, and you can't specify 3 GPUs.)

    The following compatibility table accounts for this requirement.

    Note the following additional limitations on using GPUs for custom training that differ from using GPUs with Compute Engine:

    • A configuration with 8 NVIDIA_TESLA_K80 GPUs only provides up to 208 GB of memory in all regions and zones.
    • A configuration with 4 NVIDIA_TESLA_P100 GPUs only provides up to 64 virtual CPUS and up to 208 GB of memory in all regions and zones.

The following compatibility table lists the valid values for machineSpec.acceleratorCount depending on your choices for machineSpec.machineType and machineSpec.acceleratorType:

Valid numbers of GPUs for each machine type
Machine type NVIDIA_A100_80GB NVIDIA_TESLA_A100 NVIDIA_TESLA_K80 NVIDIA_TESLA_P4 NVIDIA_TESLA_P100 NVIDIA_TESLA_T4 NVIDIA_TESLA_V100 NVIDIA_L4
a2-ultragpu-1g 1
a2-ultragpu-2g 2
a2-ultragpu-4g 4
a2-ultragpu-8g 8
a2-highgpu-1g 1
a2-highgpu-2g 2
a2-highgpu-4g 4
a2-highgpu-8g 8
a2-megagpu-16g 16
n1-standard-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-standard-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-standard-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-standard-64 4 4 8
n1-standard-96 4 4 8
n1-highmem-2 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-4 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-8 1, 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 1, 2, 4, 8
n1-highmem-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highmem-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highmem-64 4 4 8
n1-highmem-96 4 4 8
n1-highcpu-16 2, 4, 8 1, 2, 4 1, 2, 4 1, 2, 4 2, 4, 8
n1-highcpu-32 4, 8 2, 4 2, 4 2, 4 4, 8
n1-highcpu-64 8 4 4 4 8
n1-highcpu-96 4 4 8
g2-standard-4 1
g2-standard-8 1
g2-standard-12 1
g2-standard-16 1
g2-standard-24 2
g2-standard-32 1
g2-standard-48 4
g2-standard-96 8

The following examples highlight where you can specify GPUs when you create a CustomJob:

Console

In the Google Cloud console, you can't create a CustomJob directly. However, you can create a TrainingPipeline that creates a CustomJob. When you create a TrainingPipeline in the Google Cloud console, you can specify GPUs for each worker pool on the Compute and pricing step. First specify a Machine type. Then, you can specify GPU details in the Accelerator type and Accelerator count fields.

gcloud

To specify GPUs using the Google Cloud CLI tool, you must use a config.yaml file. For example:

config.yaml

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
    acceleratorType: ACCELERATOR_TYPE
    acceleratorCount: ACCELERATOR_COUNT
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

Then run a command like the following:

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

Node.js

Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response:\n', JSON.stringify(response));
}
createCustomJob();

Python

To learn how to install or update the Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

For more context, read the guide to creating a CustomJob.

TPUs

To use Tensor Processing Units (TPUs) for custom training on Vertex AI, you can configure a worker pool to use a TPU VM.

When you use a TPU VM in Vertex AI, you must only use a single worker pool for custom training, and you must configure this worker pool to use only one replica.

To use TPU VMs in your worker pool, you must use one of the following configurations:

  • To configure a TPU VM with TPU V2, specify the following fields in the WorkerPoolSpec:

    • Set machineSpec.machineType to cloud-tpu.
    • Set machineSpec.acceleratorType to TPU_V2.
    • Set machineSpec.acceleratorCount to 8 for single TPU or 32 or multiple of 32 for TPU Pods.
    • Set replicaCount to 1.
  • To configure a TPU VM with TPU V3, specify the following fields in the WorkerPoolSpec:

    • Set machineSpec.machineType to cloud-tpu.
    • Set machineSpec.acceleratorType to TPU_V3.
    • Set machineSpec.acceleratorCount to 8 for single TPU or 32+ for TPU Pods.
    • Set replicaCount to 1.

The following example highlights how to specify a TPU VM when you create a CustomJob:

gcloud

To specify a TPU VM using the gcloud CLI tool, you must use a config.yaml file. For example:

config.yaml

workerPoolSpecs:
  machineSpec:
    machineType: cloud-tpu
    acceleratorType: TPU_V2
    acceleratorCount: 8
  replicaCount: 1
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

Then run a command like the following:

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

For more context, read the guide to creating a CustomJob.

Boot disk options

You can optionally customize the boot disks for your training VMs. All VMs in a worker pool use the same type and size of boot disk.

  • To customize the type of boot disk that each training VM uses, specify the diskSpec.bootDiskType field in your WorkerPoolSpec.

    You can set this field to pd-standard in order to use a standard persistent disk backed by a standard hard drive, or you can set it to pd-ssd to use an SSD persistent disk backed by a solid-state drive. The default value is pd-ssd.

    Using pd-ssd might improve performance if your training code reads and writes to disk. Learn about disk types.

  • To customize the size (in GB) of the boot disk that each training VM uses, specify the diskSpec.bootDiskSizeGb field in your WorkerPoolSpec.

    You can set this field to an integer between 100 and 64,000, inclusive. The default value is 100.

    You might want to increase the boot disk size if your training code writes a lot of temporary data to disk. Note that any data you write to the boot disk is temporary, and you can't retrieve it after training completes.

Changing the type and size of your boot disks affects custom training pricing.

The following examples highlight where you can specify boot disk options when you create a CustomJob:

Console

In the Google Cloud console, you can't create a CustomJob directly. However, you can create a TrainingPipeline that creates a CustomJob. When you create a TrainingPipeline in the Google Cloud console, you can specify boot disk options for each worker pool on the Compute and pricing step, in the Disk type drop-down list and the Disk size (GB) field.

gcloud

To specify boot disk options using the Google Cloud CLI tool, you must use a config.yaml file. For example:

config.yaml

workerPoolSpecs:
  machineSpec:
    machineType: MACHINE_TYPE
  diskSpec:
    bootDiskType: DISK_TYPE
    bootDiskSizeGb: DISK_SIZE
  replicaCount: REPLICA_COUNT
  containerSpec:
    imageUri: CUSTOM_CONTAINER_IMAGE_URI

Then run a command like the following:

gcloud ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

For more context, read the guide to creating a CustomJob.

What's next