Specifying machine types or scale tiers

When running a training job on AI Platform Training you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.

Specifying your configuration

How you specify your cluster configuration depends on how you plan to run your training job:

gcloud

Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.

The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: n1-highcpu-16
  parameterServerType: n1-highmem-8
  evaluatorType: n1-highcpu-16
  workerCount: 9
  parameterServerCount: 3
  evaluatorCount: 1

Provide the path to the YAML file in the --config flag when running the gcloud ai-platform jobs submit training command:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \ 
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.

The following example shows how to submit a training job with a similar configuration as the previous example, but without using a configuration file:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --scale-tier custom \
        --master-machine-type n1-highcpu-16 \
        --worker-machine-type n1-highcpu-16 \
        --parameter-server-machine-type n1-highmem-8 \
        --worker-count 9 \
        --parameter-server-count 3 \
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

See more details on how to run a training job.

Python

Specify the scale tier identifier and machine types in the TrainingInput object in your job configuration.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'n1-highcpu-16',
    'workerType': 'n1-highcpu-16',
    'parameterServerType': 'n1-highmem-8',
    'evaluatorType': 'n1-highcpu-16',
    'workerCount': 9,
    'parameterServerCount': 3,
    'evaluatorCount': 1,
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '2.11',
    'pythonVersion': '3.7'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown, to match the names in the Job and TrainingInput resources.

Scale tiers

Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.

AI Platform Training does not support distributed training or training with accelerators for scikit-learn or XGBoost code. If your training job runs scikit-learn or XGBoost code, you must set the scale tier to either BASIC or CUSTOM.

Below are the scale tier identifiers:

AI Platform Training scale tier
BASIC

A single worker instance. This tier is suitable for learning how to use AI Platform Training and for experimenting with new models using small datasets.

Compute Engine machine name: n1-standard-4

STANDARD_1

One master instance, plus four workers and three parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4

PREMIUM_1

One master instance, plus 19 workers and 11 parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8

BASIC_GPU

A single worker instance with a single NVIDIA Tesla K80 GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs. Only use this scale tier if you are training with TensorFlow or using a custom container.

Compute Engine machine name: n1-standard-8 with one k80 GPU

BASIC_TPU

A master VM and a Cloud TPU with eight TPU v2 cores. See how to use TPUs for your training job. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-standard-4, workers: Cloud TPU (8 TPU v2 cores)

CUSTOM

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:

  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting. See the machine types described below.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes. Only specify workers if you are training with TensorFlow or using custom containers.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers. Only specify parameter servers if you are training with TensorFlow or using custom containers.
  • You may set TrainingInput.evaluatorCount to specify the number of evaluators to use. If you specify one or more evaluators, you must also set TrainingInput.evaluatorType to specify the type of machine to use for your evaluators. Only specify evaluators if you are training with TensorFlow or using custom containers.

Machine types for the custom scale tier

Use a custom scale tier for finer control over the processing cluster that you use to train your model. Specify the configuration in the TrainingInput object in your job configuration. If you're using the gcloud ai-platform jobs submit training command to submit your training job, you can use the same identifiers:

  • Set the scale tier (scaleTier) to CUSTOM.

  • Set values for the number of workers (workerCount), parameter servers (parameterServerCount), and evaluators (evaluatorCount) that you need.

    AI Platform Training only supports distributed training when you train with TensorFlow or use a custom container. If your training job runs scikit-learn or XGBoost code, do not specify workers, parameter servers, or evaluators.

  • Set the machine type for your master worker (masterType). If you have chosen to use workers, parameter servers, or evaluators, then set machine types for them in the workerType, parameterServerType, and evaluatorType fields respectively.

    You can specify different machine types for masterType, workerType, parameterServerType, and evaluatorType, but you can't use different machine types for individual instances. For example, you can use a n1-highmem-8 machine type for your parameter servers, but you can't set some parameter servers to use n1-highmem-8 and some to use n1-highcpu-16.

  • If you need just one worker with a custom configuration (not a full cluster), you should specify a custom scale tier with a machine type for the master only. That gives you just the single worker. Here's an example config.yaml file:

    trainingInput:
      scaleTier: CUSTOM
      masterType: n1-highcpu-16
    

Compute Engine machine types

You can use the names of certain Compute Engine predefined machine types for your job's masterType, workerType, parameterServerType, and evaluatorType. If you are training with TensorFlow or using custom containers, you can optionally use various types of GPUs with these machine types.

The following list contains the Compute Engine machine type identifiers that you can use for your training job:

  • e2-standard-4
  • e2-standard-8
  • e2-standard-16
  • e2-standard-32
  • e2-highmem-2
  • e2-highmem-4
  • e2-highmem-8
  • e2-highmem-16
  • e2-highcpu-16
  • e2-highcpu-32
  • n2-standard-4
  • n2-standard-8
  • n2-standard-16
  • n2-standard-32
  • n2-standard-48
  • n2-standard-64
  • n2-standard-80
  • n2-highmem-2
  • n2-highmem-4
  • n2-highmem-8
  • n2-highmem-16
  • n2-highmem-32
  • n2-highmem-48
  • n2-highmem-64
  • n2-highmem-80
  • n2-highcpu-16
  • n2-highcpu-32
  • n2-highcpu-48
  • n2-highcpu-64
  • n2-highcpu-80
  • n1-standard-4
  • n1-standard-8
  • n1-standard-16
  • n1-standard-32
  • n1-standard-64
  • n1-standard-96
  • n1-highmem-2
  • n1-highmem-4
  • n1-highmem-8
  • n1-highmem-16
  • n1-highmem-32
  • n1-highmem-64
  • n1-highmem-96
  • n1-highcpu-16
  • n1-highcpu-32
  • n1-highcpu-64
  • n1-highcpu-96
  • c2-standard-4
  • c2-standard-8
  • c2-standard-16
  • c2-standard-30
  • c2-standard-60
  • m1-ultramem-40
  • m1-ultramem-80
  • m1-ultramem-160
  • m1-megamem-96
  • a2-highgpu-1g* (preview)
  • a2-highgpu-2g* (preview)
  • a2-highgpu-4g* (preview)
  • a2-highgpu-8g* (preview)
  • a2-megagpu-16g* (preview)

To learn about the technical specifications of each machine type, read the Compute Engine documentation about machine types.

Legacy machine types

Instead of using Compute Engine machine types for your job, you can specify legacy machine type names. These machine types provide the same vCPU and memory resources as equivalent Compute Engine machine types, but they have additional configuration limitations:

  • You cannot customize GPU usage using an acceleratorConfig. However, some legacy machine types include GPUs. See the following table.

  • If your training job configuration uses multiple machines, you cannot mix Compute Engine machine types with legacy machine types. Your master worker, workers, parameter servers, and evaluators must all use machine types from one group or the other.

    For example, if you configure masterType to be n1-highcpu-32 (a Compute Engine machine type), you cannot set workerType to complex_model_m (a legacy machine type), but you can set it to n1-highcpu-16 (another Compute Engine machine type).

The following table describes the legacy machine types:

Legacy machine types
standard

A basic machine configuration suitable for training simple models with small to moderate datasets.

Compute Engine machine name: n1-standard-4

large_model

A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).

Compute Engine machine name: n1-highmem-8

complex_model_s

A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.

Compute Engine machine name: n1-highcpu-8

complex_model_m

A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.

Compute Engine machine name: n1-highcpu-16

complex_model_l

A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.

Compute Engine machine name: n1-highcpu-32

standard_gpu

A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8 with one k80 GPU

complex_model_m_gpu

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-k80x4

complex_model_l_gpu

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-k80x8

standard_p100

A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-p100x1

complex_model_m_p100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-p100x4

standard_v100

A machine equivalent to a standard that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-v100x1

large_model_v100

A machine equivalent to large_model that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-highmem-8-v100x1

complex_model_m_v100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-v100x4

complex_model_l_v100

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-v100x8

Training with GPUs and TPUs

Some scale tiers and legacy machine types include graphics processing units (GPUs). You can also attach your own choice of several GPUs if you use a Compute Engine machine type. To learn more, read about training with GPUs.

To perform training with Tensor Processing Units (TPUs), you must use the BASIC_TPU scale tier or the cloud_tpu machine type. The cloud_tpu machine type has special configuration options: you can use it together with either Compute Engine machine types or with legacy machine types, and you can configure it to use 8 TPU v2 cores or 8 TPU v3 cores. Read about how to use TPUs for your training job.

What's next