Specifying machine types or scale tiers

When running a training job on AI Platform you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.

Specifying your configuration

How you specify your cluster configuration depends on how you plan to run your training job:

gcloud

Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.

The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: n1-highcpu-16
  parameterServerType: n1-highmem-8
  workerCount: 9
  parameterServerCount: 3

Provide the path to the YAML file in the --config flag when running the gcloud ai-platform jobs submit training command:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \ 
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.

The following example shows how to submit a training job with the same configuration as the previous example, but without using a configuration file:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --scale-tier custom \
        --master-machine-type n1-highcpu-16 \
        --worker-machine-type n1-highcpu-16 \
        --parameter-server-machine-type n1-highmem-8 \
        --worker-server-count 9 \
        --parameter-server-count 3 \
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

See more details on how to run a training job.

Python

Specify the scale tier identifier and machine types in the TrainingInput object in your job configuration.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'n1-highcpu-16',
    'workerType': 'n1-highcpu-16',
    'parameterServerType': 'n1-highmem-8',
    'workerCount': 9,
    'parameterServerCount': 3,
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '1.14',
    'pythonVersion': '3.5'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown, to match the names in the Job and TrainingInput resources.

Scale tiers

Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.

AI Platform Training does not support distributed training or training with accelerators for scikit-learn or XGBoost code. If your training job runs scikit-learn or XGBoost code, you must set the scale tier to either BASIC or CUSTOM.

Below are the scale tier identifiers:

AI Platform scale tier
BASIC

A single worker instance. This tier is suitable for learning how to use AI Platform and for experimenting with new models using small datasets.

Compute Engine machine name: n1-standard-4

STANDARD_1

One master instance, plus four workers and three parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4

PREMIUM_1

One master instance, plus 19 workers and 11 parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8

BASIC_GPU

A single worker instance with a single NVIDIA Tesla K80 GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs. Only use this scale tier if you are training with TensorFlow or using a custom container.

Compute Engine machine name: n1-standard-8 with one k80 GPU

BASIC_TPU

A master VM and a Cloud TPU with eight TPU v2 cores. See how to use TPUs for your training job. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-standard-4, workers: Cloud TPU (8 TPU v2 cores)

CUSTOM

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:

  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting. See the machine types described below.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes. Only specify workers if you are training with TensorFlow or using custom containers.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers. Only specify parameter servers if you are training with TensorFlow or using custom containers.

Machine types for the custom scale tier

Use a custom scale tier for finer control over the processing cluster that you use to train your model. Specify the configuration in the TrainingInput object in your job configuration. If you're using the gcloud ai-platform jobs submit training command to submit your training job, you can use the same identifiers:

  • Set the scale tier (scaleTier) to CUSTOM.

  • Set values for the number of parameter servers (parameterServerCount) and workers (workerCount) that you need.

    AI Platform Training only supports distributed training when you train with TensorFlow or use a custom container. If your training job runs scikit-learn or XGBoost code, do not specify workers or parameter servers.

  • Set the machine type for your master worker (masterType). If you have chosen to use parameter servers or workers, set machine types for them in the parameterServerType and workerType fields.

    You can specify different machine types for masterType, parameterServerType, and workerType, but you can't use different machine types for individual instances. For example, you can use a n1-highmem-8 machine type for your parameter servers, but you can't set some parameter servers to use n1-highmem-8 and some to use n1-highcpu-16.

  • If you need just one worker with a custom configuration (not a full cluster), you should specify a custom scale tier with a machine type for the master only. That gives you just the single worker. Here's an example config.yaml file:

    trainingInput:
      scaleTier: CUSTOM
      masterType: n1-highcpu-16
    

Compute Engine machine types

You can use the names of certain Compute Engine predefined machine types for your job's masterType, workerType, and parameterServerType. If you are training with TensorFlow or using custom containers, you can optionally use various types of GPUs with these machine types.

The following list contains the Compute Engine machine type identifiers that you can use for your training job:

  • n1-standard-4
  • n1-standard-8
  • n1-standard-16
  • n1-standard-32
  • n1-standard-64
  • n1-standard-96
  • n1-highmem-2
  • n1-highmem-4
  • n1-highmem-8
  • n1-highmem-16
  • n1-highmem-32
  • n1-highmem-64
  • n1-highmem-96
  • n1-highcpu-16
  • n1-highcpu-32
  • n1-highcpu-64
  • n1-highcpu-96

To learn more, read about the virtual CPU (vCPU) and memory resources provided by Compute Engine machine types or see the comparison table at the end of this page.

Legacy machine types

Instead of using Compute Engine machine types for your job, you can specify legacy machine type names. These machine types provide the same vCPU and memory resources as equivalent Compute Engine machine types, but they have additional configuration limitations:

  • You cannot customize GPU usage using an acceleratorConfig. However, some legacy machine types include GPUs. See the following table.

  • If your training job configuration uses multiple machines, you cannot mix Compute Engine machine types with legacy machine types. Your master worker, parameter servers, and workers must all use machine types from one group or the other.

    For example, if you configure masterType to be n1-highcpu-32 (a Compute Engine machine type), you cannot set workerType or parameterServerType to complex_model_m (a legacy machine type), but you can set them to n1-highcpu-16 (another Compute Engine machine type).

The following table describes the legacy machine types:

Legacy machine types
standard

A basic machine configuration suitable for training simple models with small to moderate datasets.

Compute Engine machine name: n1-standard-4

large_model

A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).

Compute Engine machine name: n1-highmem-8

complex_model_s

A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.

Compute Engine machine name: n1-highcpu-8

complex_model_m

A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.

Compute Engine machine name: n1-highcpu-16

complex_model_l

A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.

Compute Engine machine name: n1-highcpu-32

standard_gpu

A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8 with one k80 GPU

complex_model_m_gpu

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-k80x4

complex_model_l_gpu

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-k80x8

standard_p100

A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-p100x1

complex_model_m_p100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-p100x4

standard_v100

A machine equivalent to a standard that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-v100x1

large_model_v100

A machine equivalent to large_model that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-highmem-8-v100x1

complex_model_m_v100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-v100x4

complex_model_l_v100

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-v100x8

Training with GPUs and TPUs

Some scale tiers and legacy machine types include graphics processing units (GPUs). You can also attach your own choice of several GPUs if you use a Compute Engine machine type. To learn more, read about training with GPUs.

To perform training with Tensor Processing Units (TPUs), you must use the BASIC_TPU scale tier or the cloud_tpu machine type. The cloud_tpu machine type has special configuration options: you can use it together with either Compute Engine machine types or with legacy machine types, and you can configure it to use 8 TPU v2 cores or 8 TPU v3 cores. Read about how to use TPUs for your training job.

Comparing machine types

The following tables provide information that you can use to compare the Compute Engine machine types and the legacy machine types available for training when you set your scale tier to CUSTOM.

The exact specifications of the machine types are subject to change at any time.

If your training job uses TensorFlow or custom containers, you can use machine types with accelerators. Otherwise, do not use machine types with accelerators.

Machine type name Machine type category Accelerators Virtual CPUs Memory (GB)
n1-standard-4 Compute Engine customizable 4 15
n1-standard-8 Compute Engine customizable 8 30
n1-standard-16 Compute Engine customizable 16 60
n1-standard-32 Compute Engine customizable 32 120
n1-standard-64 Compute Engine customizable 64 240
n1-standard-96 Compute Engine customizable 96 360
n1-highmem-2 Compute Engine customizable 2 13
n1-highmem-4 Compute Engine customizable 4 26
n1-highmem-8 Compute Engine customizable 8 52
n1-highmem-16 Compute Engine customizable 16 104
n1-highmem-32 Compute Engine customizable 32 208
n1-highmem-64 Compute Engine customizable 64 416
n1-highmem-96 Compute Engine customizable 96 624
n1-highcpu-16 Compute Engine customizable 16 14.4
n1-highcpu-32 Compute Engine customizable 32 28.8
n1-highcpu-64 Compute Engine customizable 64 57.6
n1-highcpu-96 Compute Engine customizable 96 86.4
cloud_tpu TPU 8 (TPU v2 or TPUv3 cores)
standard legacy - 4 15
large_model legacy - 8 52
complex_model_s legacy - 8 7.2
complex_model_m legacy - 16 14.4
complex_model_l legacy - 32 28.8
standard_gpu legacy 1 (K80 GPU) 8 30
complex_model_m_gpu legacy 4 (K80 GPU) 16 60
complex_model_l_gpu legacy 8 (K80 GPU) 32 120
standard_p100 legacy 1 (P100 GPU) 8 30
complex_model_m_p100 legacy 4 (P100 GPU) 16 60
standard_v100 legacy 1 (V100 GPU) 8 30
large_model_v100 legacy 1 (V100 GPU) 16 52
complex_model_m_v100 legacy 4 (V100 GPU) 16 60
complex_model_l_v100 legacy 8 (V100 GPU) 32 120

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform
Need help? Visit our support page.