Specifying Machine Types or Scale Tiers

When running a training job on AI Platform you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.

Specifying your configuration

How you specify your cluster configuration depends on how you plan to run your training job:

gcloud

Create a YAML configuration file representing the TrainingInput object, and specify the scale tier identifier and machine types in the configuration file. You can name this file whatever you want. By convention the name is config.yaml.

The following example shows the contents of the configuration file, config.yaml, for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m

Provide the path to the YAML file in the --config flag when running the gcloud ai-platform jobs submit training command:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \ 
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

Alternatively, if you install the gcloud beta component, you may specify cluster configuration details with command-line flags, rather than in a configuration file. Learn more about how to use these flags. To install or update the gcloud beta component, run gcloud components install beta.

The following example shows how to submit a training job with the same configuration as the previous example, but without using a configuration file:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --scale-tier custom \
        --master-machine-type complex_model_m \
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

See more details on how to run a training job.

Python

Specify the scale tier identifier and machine types in the TrainingInput object in your job configuration.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'complex_model_m',
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '1.13',
    'pythonVersion': '3.5'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown, to match the names in the Job and TrainingInput resources.

Scale tiers

Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.

Below are the scale tier identifiers:

AI Platform scale tier
BASIC

A single worker instance. This tier is suitable for learning how to use AI Platform and for experimenting with new models using small datasets.

Compute Engine machine name: n1-standard-4

CUSTOM The CUSTOM tier is not a set tier, but rather enables you to use your own machine specification.
  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only supported setting for scikit-learn and XGBoost. See the machine types described below.

Machine types for the custom scale tier

Use a custom scale tier for finer control over the machine type that you use to train your model. Specify the configuration in the TrainingInput object in your job configuration. If you're using the gcloud ai-platform jobs submit training command to submit your training job, you can use the same identifiers:

  • Set the scale tier (scaleTier) to CUSTOM.
  • Set the machine type for your master worker (masterType)
  • Since distributed training is not supported for scikit-learn and XGBoost, do not specify parameterServerType or workerType.

Here's an example config.yaml file:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m

Below are the machine type identifiers:

AI Platform machine name
standard

A basic machine configuration suitable for training simple models with small to moderate datasets.

Compute Engine machine name: n1-standard-4

large_model

A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).

Compute Engine machine name: n1-highmem-8

complex_model_s

A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.

Compute Engine machine name: n1-highcpu-8

complex_model_m

A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.

Compute Engine machine name: n1-highcpu-16

complex_model_l

A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.

Compute Engine machine name: n1-highcpu-32

Compute Engine machine types

You can also use the names of certain Compute Engine predefined machine types instead of the AI Platform machine types listed above. This provides more flexibility when allocating computing resources for your training job.

Below are the Compute Engine machine type identifiers you can use directly:

  • n1-standard-4
  • n1-standard-8
  • n1-standard-16
  • n1-standard-32
  • n1-standard-64
  • n1-standard-96
  • n1-highmem-2
  • n1-highmem-4
  • n1-highmem-8
  • n1-highmem-16
  • n1-highmem-32
  • n1-highmem-64
  • n1-highmem-96
  • n1-highcpu-16
  • n1-highcpu-32
  • n1-highcpu-64
  • n1-highcpu-96

To learn more, read about the resources provided by Compute Engine machine types.

Comparing machine types

The following tables provide information that you can use to compare the AI Platform machine types and the Compute Engine machine types available for training when you set your scale tier to CUSTOM.

The exact specifications of the machine types are subject to change at any time.

AI Platform machine types

Machine type Virtual CPUs Memory (GB)
standard 4 15
large_model 8 52
complex_model_s 8 7.2
complex_model_m 16 14.4
complex_model_l 32 28.8

Compute Engine machine types

Machine type Virtual CPUs Memory (GB)
n1-standard-4 4 15
n1-standard-8 8 30
n1-standard-16 16 60
n1-standard-32 32 120
n1-standard-64 64 240
n1-standard-96 96 360
n1-highmem-2 2 13
n1-highmem-4 4 26
n1-highmem-8 8 52
n1-highmem-16 16 104
n1-highmem-32 32 208
n1-highmem-64 64 416
n1-highmem-96 96 624
n1-highcpu-16 16 14.4
n1-highcpu-32 32 28.8
n1-highcpu-64 64 57.6
n1-highcpu-96 96 86.4

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

AI Platform for scikit-learn & XGBoost