Training Concepts

Cloud Machine Learning Engine enables you to easily run your TensorFlow training applications in the cloud. This page describes that capability and some of the key concepts you'll need to understand to make the most of your model training. If you'd rather get right into the training process without detailed descriptions, start by working through the steps enumerated in the basic training how-to.

How training works

Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your trainer on computing resources in the cloud. Here's an overview of the process:

  1. You create a TensorFlow application that defines your computation graph and trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment.
  2. You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Google Cloud Storage, Cloud Bigtable, or another Google Cloud Platform storage service associated with the same Google Cloud Platform project that you're using for Cloud ML Engine.
  3. When your application is ready to run, it must be packaged and transferred to a Google Cloud Storage bucket that your project can access. This is automated when you use the gcloud command-line tool to run a training job.
  4. The Cloud ML Engine training service sets up resources for your job. It allocates one or more virtual machines (sometimes called training instances) based on your job configuration. Each training instance is set up by:
    • Applying the standard machine image for the version of Cloud ML Engine your job uses.
    • Loading your trainer package and installing it with pip.
    • Installing any additional packages that you specify as dependencies.
  5. The training service runs your trainer, passing the command-line arguments you specify when you create the training job.
  6. You can get information about your running job in three ways:
    • On Stackdriver Logging.
    • By requesting job details or running log streaming with the gcloud command-line tool.
    • By programmatically making status requests to the training service.
  7. When your trainer succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.

If you run a distributed TensorFlow job with Cloud ML Engine, you'll specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running trainer on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:

  • Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. It's asserted in the previous list that the training service runs until "your trainer" succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.

    If you are running a single-process job, the sole replica is the master for the job.

  • One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your trainer.

  • One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.

The typical case

There are steps in the description above where you might assume that a machine learning service would intervene or control processing but where Cloud ML Engine doesn't. The training service is designed to have as little an impact on your trainer as possible. This means you can focus on the TensorFlow code that makes the model you want instead of being confined by a rigid structure. Essentially this means that Cloud ML Engine doesn't know or take interest in your application's implementation.

While it's true that the training service imposes almost no restriction on your trainer's architecture, that doesn't mean that there isn't any guidance to follow. Most machine learning trainers:

  • Provide a way to get training data and evaluation data.
  • Process data instances in batches.
  • Use evaluation data to test the accuracy of the model (how often it predicts the right value).
  • Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
  • Provide a way to export the trained model when the trainer finishes.

Packaging your trainer

If you've never made a Python package before, this process can feel daunting. The good news is that you can rely on the gcloud command-line tool to do the heavy lifting for you. This section covers some of the specifics in more detail. You'll find detailed instructions in the packaging how-to.


There are two kinds of dependencies that your trainer might have: standard dependencies and custom dependencies. Standard dependencies are packages that you import that are available in the Python Package Index (PyPI). These are well-known libraries that can be installed with a simple pip command. Custom dependencies are generally other packages that you developed yourself, or that were developed in-house by someone else. Here's how to work with both kinds:

Including standard dependencies

You can specify your package's standard dependencies as part of its script. The Cloud ML Engine training service uses pip to install your trainer package on training instances that it allocates for your job. A pip install includes installing all of the dependencies listed in the script.

Including custom dependencies

You can specify your trainer's custom dependencies by passing their paths as part of your job configuration. Like your trainer package, any included custom dependencies must be in a Google Cloud Storage location. The service also uses pip to install custom dependencies, so they can have standard dependencies of their own in their scripts.

Parameters for cloud training

There are two kinds of parameters that you provide when creating a training job: job configuration parameters, and training application parameters. This section describes both of these types of parameters.

Parameter formats

You pass your parameters to the training service by setting the members of the Job resource in a JSON request string. The training parameters are defined in the TrainingInput object. If you use the gcloud command-line tool to create training jobs, the most common training parameters are defined as flags of the gcloud ml-engine jobs submit training command. You can pass the remaining parameters in a YAML configuration file. That file, called config.yaml by convention, mirrors the structure of the JSON representation of the Job resource. The path to your configuration file should be passed to gcloud ml-engine jobs submit training via the --config argument. So, if the path to your configuration file is config.yaml, you would set --config=config.yaml.

Job configuration parameters

The Cloud ML Engine training service needs information to set up resources in the cloud and deploy your trainer application on each node in the processing cluster.

Job ID

You must give your training job a name following these rules:

  • It must be unique within your Google Cloud Platform project.
  • It may only contain mixed-case letters, digits, and underscores.
  • It must start with a letter.
  • It must be no more than 128 characters long.

You can use whatever job naming convention you want. If you don't run very many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.

A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name—all jobs for a model are grouped together in ascending order.

Scale tier

You must tell Cloud ML Engine the number and type of machines to run your training job on. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.

To specify a scale tier, add it to the TrainingInput object in your job configuration. If you're using the gcloud command to submit your training job, you can use the same identifiers, defined below.

Below are the scale tier definitions:

Cloud ML Engine scale tier

A single worker instance. This tier is suitable for learning how to use Cloud ML Engine and for experimenting with new models using small datasets.

Compute Engine machine name: n1-standard-4


One master instance, plus four workers and three parameter servers.

Compute Engine machine name, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4


One master instance, plus 19 workers and 11 parameter servers.

Compute Engine machine name, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8


A single worker instance with a single NVIDIA Tesla K80 GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs.

Compute Engine machine name: n1-standard-8 with one k80 GPU

CUSTOM The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:
  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting. See the machine types described below.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers.

Machine types for the custom scale tier

If you want finer control over the processing cluster that you use to train your model, you can set the scale tier to CUSTOM and set values for the number of parameter servers and workers that you want to use along with the type of machine to use for each. You can specify a different machine type for the master worker, the parameter servers, and the workers, but you can't use different machine types for individual instances within a given type. For example, you can use a large_model machine type for your parameter servers, but you can't set some parameter servers to use large_model and some to use complex_model_m. As with scale tiers, the values for the available machine types are defined in the Cloud ML Engine API, as part of the definition of the TrainingInput object.

Below are the machine type definitions:

Cloud ML Engine machine name

A basic machine configuration suitable for training simple models with small to moderate datasets.

Compute Engine machine name: n1-standard-4


A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).

Compute Engine machine name: n1-highmem-8


A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.

Compute Engine machine name: n1-highcpu-8


A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.

Compute Engine machine name: n1-highcpu-16


A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.

Compute Engine machine name: n1-highcpu-32


A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU.

Compute Engine machine name: n1-standard-8 with one k80 GPU


A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs.

Compute Engine machine name: n1-standard-16-k80x4


A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs.

Compute Engine machine name: n1-standard-32-k80x8


A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. The availability of these GPUs is in Beta launch stage.

Compute Engine machine name: n1-standard-8-p100x1


A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. The availability of these GPUs is in Beta launch stage.

Compute Engine machine name: n1-standard-16-p100x4

To learn more about graphics processing units (GPUs), see the section on training with GPUs.

Comparing machine types

The exact specifications of the machine types are subject to change at any time. The following table gives information that you can use to compare the machine types in terms of relative capability.

Machine type Compute Engine machine name Virtual CPUs GPUs Memory (GB)
standard n1-standard-4 4 - 15
large_model n1-highmem-8 8 - 52
complex_model_s n1-highcpu-8 8 - 7.20
complex_model_m n1-highcpu-16 16 - 14.4
complex_model_l n1-highcpu-32 32 - 28.8
standard_gpu n1-standard-8 8 1 (K80) 30
complex_model_m_gpu n1-standard-16 16 4 (K80) 60
complex_model_l_gpu n1-standard-32 32 8 (K80) 120
standard_p100 (Beta) n1-standard-8 8 1 (P100) 30
complex_model_m_p100 (Beta) n1-standard-16 16 4 (P100) 60

Package URIs

Your trainer must be made into a Python package and copied to a Google Cloud Storage bucket before you can run it on Cloud ML Engine. You pass the URI of your package to the training service as an element of the package URI list. The URI of a Cloud Storage location takes this form:


Your package is an element of a package URI list, rather than a single string, because you can specify other packages as dependencies. Each URI you include is the path to another package, formatted as a tarball (*.tar.gz) or as a wheel. The training service installs each package (using pip install) on every virtual machine it allocates for your training job.

Python module

Your trainer package can contain multiple modules (Python files). You must identify the module that contains your application entry point. The training service runs that module by invoking Python, just as you would run it locally.

When you make your trainer application into a Python package, you create a namespace. For example, if you create a package named my_trainer, and your main module is called, you specify that package with the name my_trainer.task.


If you want to use hyperparameter tuning, you must include configurations details when you create your training job. A discussion of hyperparameter tuning is given in its feature overview. Instructions on configuring and using it are given in its how-to guide.


Google Cloud Platform uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on Cloud ML Engine, you specify the region that you want it to run in.

If you store your training dataset on Google Cloud Storage, you should run your training job in the same region as the bucket you're using. If you must run your job in a different region from your data bucket, your job may take longer.

To see the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.

Using job-dir as a common output directory

Although Cloud ML Engine doesn't intervene in your input and output, it does provide a mechanism for specifying the output directory for a job. You can set a job directory when you configure your job. When you do, the Cloud ML Engine training service:

  • Validates the directory for you so that you can fix any problems before your job runs.
  • Passes the path to your application as a command-line argument named --job-dir.

If you want to use this feature, you need to account for the --job-dir argument in your application. Capture the argument value when you parse your other parameters and use it when saving your trainer's output. See the section on application parameters below.

Runtime version

You should specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Specify a version that gives you the functionality you need. If you run the training job locally as well as in the cloud, make sure the local and cloud jobs use the same runtime version.

If you don't specify a runtime version, your training job uses the default runtime version: 1.0.

Training application parameters

You can send data to your application when it runs in the cloud by specifying command-line arguments for your main module. Assemble the list of arguments and include it in your training configuration.

The training service accepts the arguments as a list of strings with the following format:

['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']

Each expression that you would enter in the command-line invocation of your trainer is a member of the list, given in order.

When you use the gcloud command-line tool to submit your training job, you give the arguments as you would when running your application at the command line. After all the gcloud specific arguments, specify an empty -- argument, then all your own arguments, also known as USER_ARGS:

gcloud ml-engine jobs submit training job123 \
    --package-path=gs://bucket/path/to/package.tar.gz \
    --module-name=trainer.task \
    --job-dir=gs://bucket/path/to/dir \
    --region=us-central1 \
    -- \
    --user_first_arg=first_arg_value \


  • The empty -- argument marks the end of the gcloud specific arguments and the start of the USER_ARGS that you want to pass to your application.
  • Arguments specific to Cloud ML Engine, such as --module-name, --runtime-version, and --job-dir, must come before the empty -- argument. The Cloud ML Engine service interprets these arguments.
  • The --job-dir argument, if specified, must come before the empty -- argument, because Cloud ML Engine uses the --job-dir argument to validate the path.
  • Your application must handle the --job-dir argument too, if specified. Even though the argument comes before the empty --, the --job-dir is also passed to your application as a command-line argument.

Input data

Every machine learning model starts with known data. There are only two limitations to the data that you can use in your trainer running on Cloud ML Engine:

  • It must be in a format that you can read and feed to your TensorFlow code.
  • It must be in a location that your code can access. This typically means that it should be stored with one of Google Cloud Platform's storage or big data services.

Output data

It is common for trainers to output data: checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. As with input data, it is easiest to save your outputs to a Google Cloud Storage bucket in the same Google Cloud Platform project as your training job.

Building training jobs that are resilient to VM restarts

Google Cloud VMs may be restarted occasionally. You should ensure that your training job is resilient to these restarts, by saving model checkpoints regularly, and by configuring your job to restore the most recent checkpoint.

You usually save model checkpoints in the Cloud Storage path that you specify with the --job-dir argument in the gcloud ml-engine jobs submit training command.

The TensorFlow Estimator API implements checkpoint functionality for you. If your model is wrapped in an Estimator, you do not need to worry about restart events on your VMs.

If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your training jobs to be resilient to restart events, you must write the checkpoint saving and restoration functionality into your model yourself. TensorFlow provides the following useful resources in the tf.train module:

Training with GPUs

You can run your training jobs on Cloud ML Engine with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.

As with many other aspects of running training jobs, the Cloud ML Engine training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. You assign TensorFlow Ops to GPUs in your trainer code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your trainer code on each machine.

Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.

See how to use GPUs for your training job.

What's next

Send feedback about...

Cloud ML Engine for TensorFlow