Specifying machine types or scale tiers

When running a training job on AI Platform you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.

Specifying your configuration

How you specify your cluster configuration depends on how you plan to run your training job:

gcloud

Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.

The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3

Provide the path to the YAML file in the --config flag when running the gcloud ai-platform jobs submit training command:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \ 
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

Alternatively, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags.

The following example shows how to submit a training job with the same configuration as the previous example, but without using a configuration file:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --scale-tier custom \
        --master-machine-type complex_model_m \
        --worker-machine-type complex_model_m \
        --parameter-server-machine-type large_model \
        --worker-server-count 9 \
        --parameter-server-count 3 \
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

See more details on how to run a training job.

Python

Specify the scale tier identifier and machine types in the TrainingInput object in your job configuration.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'complex_model_m',
    'workerType': 'complex_model_m',
    'parameterServerType': 'large_model',
    'workerCount': 9,
    'parameterServerCount': 3,
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '1.14',
    'pythonVersion': '3.5'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown, to match the names in the Job and TrainingInput resources.

Scale tiers

Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.

AI Platform Training does not support distributed training or training with accelerators for scikit-learn or XGBoost code. If your training job runs scikit-learn or XGBoost code, you must set the scale tier to either BASIC or CUSTOM.

Below are the scale tier identifiers:

AI Platform scale tier
BASIC

A single worker instance. This tier is suitable for learning how to use AI Platform and for experimenting with new models using small datasets.

Compute Engine machine name: n1-standard-4

STANDARD_1

One master instance, plus four workers and three parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4

PREMIUM_1

One master instance, plus 19 workers and 11 parameter servers. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8

BASIC_GPU

A single worker instance with a single NVIDIA Tesla K80 GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs. Only use this scale tier if you are training with TensorFlow or using a custom container.

Compute Engine machine name: n1-standard-8 with one k80 GPU

BASIC_TPU

A master VM and a Cloud TPU with eight TPU v2 cores. See how to use TPUs for your training job. Only use this scale tier if you are training with TensorFlow or using custom containers.

Compute Engine machine name, master: n1-standard-4, workers: Cloud TPU (8 TPU v2 cores)

CUSTOM

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:

  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting. See the machine types described below.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes. Only specify workers if you are training with TensorFlow or using custom containers.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers. Only specify parameter servers if you are training with TensorFlow or using custom containers.

Machine types for the custom scale tier

Use a custom scale tier for finer control over the processing cluster that you use to train your model. Specify the configuration in the TrainingInput object in your job configuration. If you're using the gcloud ai-platform jobs submit training command to submit your training job, you can use the same identifiers:

  • Set the scale tier (scaleTier) to CUSTOM.

  • Set values for the number of parameter servers (parameterServerCount) and workers (workerCount) that you need.

    AI Platform Training only supports distributed training when you train with TensorFlow or use a custom container. If your training job runs scikit-learn or XGBoost code, do not specify workers or parameter servers.

  • Set the machine type for your master worker (masterType). If you have chosen to use parameter servers or workers, set machine types for them in the parameterServerType and workerType fields.

    You can specify different machine types for masterType, parameterServerType, and workerType, but you can't use different machine types for individual instances. For example, you can use a large_model machine type for your parameter servers, but you can't set some parameter servers to use large_model and some to use complex_model_m.

  • If you need just one worker with a custom configuration (not a full cluster), you should specify a custom scale tier with a machine type for the master only. That gives you just the single worker. Here's an example config.yaml file:

    trainingInput:
      scaleTier: CUSTOM
      masterType: complex_model_m
    

Below are the machine type identifiers:

AI Platform machine name
standard

A basic machine configuration suitable for training simple models with small to moderate datasets.

Compute Engine machine name: n1-standard-4

large_model

A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).

Compute Engine machine name: n1-highmem-8

complex_model_s

A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.

Compute Engine machine name: n1-highcpu-8

complex_model_m

A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.

Compute Engine machine name: n1-highcpu-16

complex_model_l

A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.

Compute Engine machine name: n1-highcpu-32

standard_gpu

A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8 with one k80 GPU

complex_model_m_gpu

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-k80x4

complex_model_l_gpu

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-k80x8

standard_p100

A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-p100x1

complex_model_m_p100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-p100x4

standard_v100

A machine equivalent to a standard that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-8-v100x1

large_model_v100

A machine equivalent to large_model that also includes a single NVIDIA Tesla V100 GPU. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-highmem-8-v100x1

complex_model_m_v100

A machine equivalent to complex_model_m that also includes four NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-16-v100x4

complex_model_l_v100

A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla V100 GPUs. Only use this machine type if you are training with TensorFlow or using custom containers.

Compute Engine machine name: n1-standard-32-v100x8

cloud_tpu

A TPU VM including one Cloud TPU with eight TPU v2 cores by default. Only use this machine type if you are training with TensorFlow or using custom containers.

When you configure your training job, you can optionally use an acceleratorConfig with the cloud_tpu machine type. If you do this, your Cloud TPU can use TPU v2 cores or TPU v3 cores (beta). Learn more about this type of configuration.

Compute Engine machine types

You can also use the names of certain Compute Engine predefined machine types instead of the AI Platform machine types listed above. This provides more flexibility when allocating computing resources for your training job. If you are training with TensorFlow or using custom containers, this also allows you to customize how your job uses GPUs.

Below are the Compute Engine machine type identifiers you can use directly:

  • n1-standard-4
  • n1-standard-8
  • n1-standard-16
  • n1-standard-32
  • n1-standard-64
  • n1-standard-96
  • n1-highmem-2
  • n1-highmem-4
  • n1-highmem-8
  • n1-highmem-16
  • n1-highmem-32
  • n1-highmem-64
  • n1-highmem-96
  • n1-highcpu-16
  • n1-highcpu-32
  • n1-highcpu-64
  • n1-highcpu-96

To learn more, read about the resources provided by Compute Engine machine types.

If your training job configuration uses multiple machines, you cannot mix AI Platform machine types with Compute Engine machine types. Your master worker, parameter servers, and workers must all use machine types from one group or the other.

For example, if you configure masterType to be n1-highcpu-32 (a Compute Engine machine type), you cannot set workerType or parameterServerType to complex_model_m (an AI Platform machine type), but you can set them to n1-highcpu-16 (another Compute Engine machine type).

Training with GPUs and TPUs

Some scale tiers and AI Platform machine types include graphics processing units (GPUs). You can also attach your own choice of several GPUs if you use a Compute Engine machine type. To learn more, read about training with GPUs.

To learn more about Tensor Processing Units (TPUs), see how to use TPUs for your training job.

Comparing machine types

The following tables provide information that you can use to compare the AI Platform machine types and the Compute Engine machine types available for training when you set your scale tier to CUSTOM.

The exact specifications of the machine types are subject to change at any time.

AI Platform machine types

If your training job uses TensorFlow or custom containers, you can use machine types with accelerators. Otherwise, do not use machine types with accelerators.

Machine type Accelerators Virtual CPUs Memory (GB)
standard - 4 15
large_model - 8 52
complex_model_s - 8 7.2
complex_model_m - 16 14.4
complex_model_l - 32 28.8
standard_gpu 1 (K80 GPU) 8 30
complex_model_m_gpu 4 (K80 GPU) 16 60
complex_model_l_gpu 8 (K80 GPU) 32 120
standard_p100 1 (P100 GPU) 8 30
complex_model_m_p100 4 (P100 GPU) 16 60
standard_v100 1 (V100 GPU) 8 30
large_model_v100 1 (V100 GPU) 16 52
complex_model_m_v100 4 (V100 GPU) 16 60
complex_model_l_v100 8 (V100 GPU) 32 120
cloud_tpu 8 (TPU v2 cores)

Compute Engine machine types

Compute Engine machine types do not use accelerators by default, but if your training job uses TensorFlow or custom containers, you can attach GPUs to these virtual machines in your training configuration.

Machine type Virtual CPUs Memory (GB)
n1-standard-4 4 15
n1-standard-8 8 30
n1-standard-16 16 60
n1-standard-32 32 120
n1-standard-64 64 240
n1-standard-96 96 360
n1-highmem-2 2 13
n1-highmem-4 4 26
n1-highmem-8 8 52
n1-highmem-16 16 104
n1-highmem-32 32 208
n1-highmem-64 64 416
n1-highmem-96 96 624
n1-highcpu-16 16 14.4
n1-highcpu-32 32 28.8
n1-highcpu-64 64 57.6
n1-highcpu-96 96 86.4

What's next

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...

Trenger du hjelp? Gå til brukerstøttesiden vår.