When you perform custom training, your training code runs on one or more virtual machine (VM) instances. You can configure what types of VM to use for training: using VMs with more compute resources can speed up training and let you work with larger datasets, but they can also incur greater training costs.
In some cases, you can additionally use GPUs to accelerate training. GPUs incur additional costs.
You can also optionally customize the type and size of your training VMs' boot disks.
This document describes the different compute resources that you can use for custom training and how to configure them.
Where to specify compute resources
Specify configuration details within a
WorkerPoolSpec
. Depending on how
you perform custom training, put this WorkerPoolSpec
in one of the following
API fields:
If you are creating a
CustomJob
resource, specify theWorkerPoolSpec
inCustomJob.jobSpec.workerPoolSpecs
.If you are using the Google Cloud CLI, then you can use the
--worker-pool-spec
flag or the--config
flag on thegcloud ai custom-jobs create
command to specify worker pool options.Learn more about creating a
CustomJob
.If you are creating a
HyperparameterTuningJob
resource, specify theWorkerPoolSpec
inHyperparameterTuningJob.trialJobSpec.workerPoolSpecs
.If you are using the gcloud CLI, then you can use the
--config
flag on thegcloud ai hpt-tuning-jobs create
command to specify worker pool options.Learn more about creating a
HyperparameterTuningJob
.If you are creating a
TrainingPipeline
resource without hyperparameter tuning, specify theWorkerPoolSpec
inTrainingPipeline.trainingTaskInputs.workerPoolSpecs
.Learn more about creating a custom
TrainingPipeline
.If you are creating a
TrainingPipeline
with hyperparameter tuning, specify theWorkerPoolSpec
inTrainingPipeline.trainingTaskInputs.trialJobSpec.workerPoolSpecs
.
If you are performing distributed training, you can use different settings for each worker pool.
Machine types
In your WorkerPoolSpec
, you must specify one of the following machine types in
the machineSpec.machineType
field. Each replica in
the worker pool runs on a separate VM that has the specified machine type.
a2-ultragpu-1g
*a2-ultragpu-2g
*a2-ultragpu-4g
*a2-ultragpu-8g
*a2-highgpu-1g
*a2-highgpu-2g
*a2-highgpu-4g
*a2-highgpu-8g
*a2-megagpu-16g
*e2-standard-4
e2-standard-8
e2-standard-16
e2-standard-32
e2-highmem-2
e2-highmem-4
e2-highmem-8
e2-highmem-16
e2-highcpu-16
e2-highcpu-32
n2-standard-4
n2-standard-8
n2-standard-16
n2-standard-32
n2-standard-48
n2-standard-64
n2-standard-80
n2-highmem-2
n2-highmem-4
n2-highmem-8
n2-highmem-16
n2-highmem-32
n2-highmem-48
n2-highmem-64
n2-highmem-80
n2-highcpu-16
n2-highcpu-32
n2-highcpu-48
n2-highcpu-64
n2-highcpu-80
n1-standard-4
n1-standard-8
n1-standard-16
n1-standard-32
n1-standard-64
n1-standard-96
n1-highmem-2
n1-highmem-4
n1-highmem-8
n1-highmem-16
n1-highmem-32
n1-highmem-64
n1-highmem-96
n1-highcpu-16
n1-highcpu-32
n1-highcpu-64
n1-highcpu-96
c2-standard-4
c2-standard-8
c2-standard-16
c2-standard-30
c2-standard-60
m1-ultramem-40
m1-ultramem-80
m1-ultramem-160
m1-megamem-96
g2-standard-4
*g2-standard-8
*g2-standard-12
*g2-standard-16
*g2-standard-24
*g2-standard-32
*g2-standard-48
*g2-standard-96
*cloud-tpu
*
* Machine types marked with asterisks in the preceding list must be used with certain GPUs or TPUs. See the following sections of this guide.
To learn about the technical specifications of each machine type, read the Compute Engine documentation about machine types. To learn about the cost of using each machine type for custom training, read Pricing.
The following examples highlight where you specify a machine type when you
create a CustomJob
:
Console
In the Google Cloud console, you can't create a CustomJob
directly. However,
you can create a TrainingPipeline
that creates a
CustomJob
. When you create a
TrainingPipeline
in the Google Cloud console, specify a machine type for
each worker pool on the Compute and pricing step, in the Machine type
field.
gcloud
gcloud ai custom-jobs create \
--region=LOCATION \
--display-name=JOB_NAME \
--worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Vertex AI SDK for Python
To learn how to install the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.
For more context, read the guide to creating a
CustomJob
.
GPUs
If you have written your training code to use GPUs, then you may configure your worker pool to use one or more GPUs on each VM. To use GPUs, you must use an A2 or N1 machine type.
Vertex AI supports the following types of GPU for custom training:
NVIDIA_A100_80GB
NVIDIA_TESLA_A100
(NVIDIA A100 40GB)NVIDIA_TESLA_K80
NVIDIA_TESLA_P4
NVIDIA_TESLA_P100
NVIDIA_TESLA_T4
NVIDIA_TESLA_V100
NVIDIA_L4
To learn more about the technical specification for each type of GPU, read the Compute Engine short documentation about GPUs for compute workloads. To learn about the cost of using each machine type for custom training, read Pricing.
In your WorkerPoolSpec
, specify the type of GPU that you want to use in the
machineSpec.acceleratorType
field and number of
GPUs that you want each VM in the worker pool to use in the
machineSpec.acceleratorCount
field. However, your
choices for these fields must meet the following restrictions:
The type of GPU that you choose must be available in the location where you are performing custom training. Not all types of GPU are available in all regions. Learn about regional availability..
You can only use certain numbers of GPUs in your configuration. For example, you can use 2 or 4
NVIDIA_TESLA_T4
GPUs on a VM, but not 3. To see whatacceleratorCount
values are valid for each type of GPU, see the following compatibility table.You must make sure that your GPU configuration provides sufficient virtual CPUs and memory to the machine type that you use it with. For example, if you use the
n1-standard-32
machine type in your worker pool, then each VM has 32 virtual CPUs and 120 GB of memory. Since eachNVIDIA_TESLA_V100
GPU can provide up to 12 virtual CPUs and 76 GB of memory, you must use at least 4 GPUs for eachn1-standard-32
VM to support its requirements. (2 GPUs provide insufficient resources, and you can't specify 3 GPUs.)The following compatibility table accounts for this requirement.
Note the following additional limitations on using GPUs for custom training that differ from using GPUs with Compute Engine:
- A configuration with 8
NVIDIA_TESLA_K80
GPUs only provides up to 208 GB of memory in all regions and zones. - A configuration with 4
NVIDIA_TESLA_P100
GPUs only provides up to 64 virtual CPUS and up to 208 GB of memory in all regions and zones.
- A configuration with 8
The following compatibility table lists the valid values for
machineSpec.acceleratorCount
depending on your choices for
machineSpec.machineType
and machineSpec.acceleratorType
:
Valid numbers of GPUs for each machine type | ||||||||
---|---|---|---|---|---|---|---|---|
Machine type | NVIDIA_A100_80GB |
NVIDIA_TESLA_A100 |
NVIDIA_TESLA_K80 |
NVIDIA_TESLA_P4 |
NVIDIA_TESLA_P100 |
NVIDIA_TESLA_T4 |
NVIDIA_TESLA_V100 |
NVIDIA_L4 |
a2-ultragpu-1g |
1 | |||||||
a2-ultragpu-2g |
2 | |||||||
a2-ultragpu-4g |
4 | |||||||
a2-ultragpu-8g |
8 | |||||||
a2-highgpu-1g |
1 | |||||||
a2-highgpu-2g |
2 | |||||||
a2-highgpu-4g |
4 | |||||||
a2-highgpu-8g |
8 | |||||||
a2-megagpu-16g |
16 | |||||||
n1-standard-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||
n1-standard-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||
n1-standard-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||
n1-standard-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||
n1-standard-64 |
4 | 4 | 8 | |||||
n1-standard-96 |
4 | 4 | 8 | |||||
n1-highmem-2 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||
n1-highmem-4 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||
n1-highmem-8 |
1, 2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4, 8 | |||
n1-highmem-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||
n1-highmem-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||
n1-highmem-64 |
4 | 4 | 8 | |||||
n1-highmem-96 |
4 | 4 | 8 | |||||
n1-highcpu-16 |
2, 4, 8 | 1, 2, 4 | 1, 2, 4 | 1, 2, 4 | 2, 4, 8 | |||
n1-highcpu-32 |
4, 8 | 2, 4 | 2, 4 | 2, 4 | 4, 8 | |||
n1-highcpu-64 |
8 | 4 | 4 | 4 | 8 | |||
n1-highcpu-96 |
4 | 4 | 8 | |||||
g2-standard-4 |
1 | |||||||
g2-standard-8 |
1 | |||||||
g2-standard-12 |
1 | |||||||
g2-standard-16 |
1 | |||||||
g2-standard-24 |
2 | |||||||
g2-standard-32 |
1 | |||||||
g2-standard-48 |
4 | |||||||
g2-standard-96 |
8 |
The following examples highlight where you can specify GPUs when you
create a CustomJob
:
Console
In the Google Cloud console, you can't create a CustomJob
directly.
However, you can create a TrainingPipeline
that creates a
CustomJob
. When you create a
TrainingPipeline
in the Google Cloud console, you can specify GPUs for each
worker pool on the Compute and pricing step. First specify a Machine
type. Then, you can specify GPU details in the Accelerator type and
Accelerator count fields.
gcloud
To specify GPUs using the Google Cloud CLI tool, you must use a config.yaml
file. For example: