Training Concepts

Cloud Machine Learning Engine enables you to easily run your TensorFlow training applications in the cloud. This page describes that capability and some of the key concepts you'll need to understand to make the most of your model training. If you'd rather get right into the training process without detailed descriptions, start by working through the steps enumerated in the basic training how-to.

How training works

Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your trainer on computing resources in the cloud. Here's an overview of the process:

  1. You create a TensorFlow application that defines your computation graph and trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment.
  2. You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Google Cloud Storage, BigTable, or another Google Cloud Platform storage service associated with the same Google Cloud Platform project that you're using for Cloud ML Engine.
  3. When your application is ready to run, it must be packaged and transferred to a Google Cloud Storage bucket that your project can access. This is automated when you use the gcloud command-line tool to run a training job.
  4. The Cloud ML Engine training service sets up resources for your job. It allocates one or more virtual machines (sometimes called training instances) based on your job configuration. Each training instance is set up by:
    • Applying the standard machine image for the version of Cloud ML Engine your job uses.
    • Loading your trainer package and installing it with pip.
    • Installing any additional packages that you specify as dependencies.
  5. The training service runs your trainer, passing the command-line arguments you specify when you create the training job.
  6. You can get information about your running job in three ways:
    • On Stackdriver Logging
    • By requesting job details or running log streaming with the gcloud command-line tool.
    • By programmatically making status requests to the training service.
  7. When your trainer succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.

If you run a distributed TensorFlow job with Cloud ML Engine, you'll specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running trainer on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:

  • Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. It's asserted in the previous list that the training service runs until "your trainer" succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.

    If you are running a single-process job, the sole replica is the master for the job.

  • One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your trainer.

  • One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.

The typical case

There are steps in the description above where you might assume that a machine learning service would intervene or control processing but where Cloud ML Engine doesn't. The training service is designed to have as little an impact on your trainer as possible. This means you can focus on the TensorFlow code that makes the model you want instead of being confined by a rigid structure. Essentially this means that Cloud ML Engine doesn't know or take interest in your application's implementation.

While it's true that the training service imposes almost no restriction on your trainer's architecture, that doesn't mean that there isn't any guidance to follow. Most machine learning trainers:

  • Provide a way to get training data and evaluation data.
  • Process data instances in batches.
  • Use evaluation data to test the accuracy of the model (how often it predicts the right value).
  • Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
  • Provide a way to export the trained model when the trainer finishes.

Packaging your trainer

If you've never made a Python package before, this process can feel daunting. The good news is that you can rely on the gcloud command-line tool to do the heavy lifting for you. This section covers some of the specifics in more detail. You'll find detailed instructions in the packaging how-to.


There are two kinds of dependencies that your trainer might have: standard dependencies and custom dependencies. Standard dependencies are packages that you import that are available in the Python Package Index (PyPI). These are well-known libraries that can be installed with a simple pip command. Custom dependencies are generally other packages that you developed yourself, or that were developed in-house by someone else. Here's how to work with both kinds:

Including standard dependencies

You can specify your package's standard dependencies as part of its script. The Cloud ML Engine training service uses pip to install your trainer package on training instances that it allocates for your job. A pip install includes installing all of the dependencies listed in the script.

Including custom dependencies

You can specify your trainer's custom dependencies by passing their paths as part of your job configuration. Like your trainer package, any included custom dependencies must be in a Google Cloud Storage location. The service also uses pip to install custom dependencies, so they can have standard dependencies of their own in their scripts.

Parameters for cloud training

There are two kinds of parameters that you provide when creating a training job: job configuration parameters, and training application parameters. This section describes both of these types of parameters.

Parameter formats

You pass your parameters to the training service by setting the members of the Job resource in a JSON request string. The training parameters are defined in the TrainingInput object. If you use the gcloud command-line tool to create training jobs, the most common training parameters are defined as flags of the gcloud ml-engine jobs submit training command. You can pass the remaining parameters in a YAML configuration file. That file, called config.yaml by convention, mirrors the structure of the JSON representation of the Job resource.

Job configuration parameters

The Cloud ML Engine training service needs information to set up resources in the cloud and deploy your trainer application on each node in the processing cluster.

Job ID

You must give your training job a name following these rules:

  • It must be unique within your Google Cloud Platform project.
  • It may only contain mixed-case letters, digits, and underscores.
  • It must start with a letter.
  • It must be no more than 128 characters long.

You can use whatever job naming convention you want. If you don't run very many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.

A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name—all jobs for a model are grouped together in ascending order.

Scale tier

You must tell Cloud ML Engine the number and type of machines to run your training job on. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. The specific cluster configuration of each tier is not fixed: it may change as the availability of cloud resources changes over time. Instead, each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines will be allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in ML training units, also increases. The supported scale tiers are defined as part of the Cloud ML Engine API, and you use the same identifiers that are defined there when you specify your training input for the gcloud command. To learn more about graphics processing units (GPUs), see the section on training with GPUs.

For convenience, the scale tier definitions are duplicated here:

Scale tier Description
BASIC A single worker instance. This tier is suitable for learning how to use Cloud ML Engine and for experimenting with new models using small datasets.
STANDARD_1 Many workers and a few parameter servers.
PREMIUM_1 A large number of workers with many parameter servers.
BASIC_GPU A single worker instance with a single NVIDIA Tesla K80 GPU.
CUSTOM The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:
  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting. See the machine types described below.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers.

Machine types for the custom scale tier

If you want finer control over the processing cluster that you use to train your model, you can set the scale tier to CUSTOM and set values for the number of parameter servers and workers that you want to use along with the type of machine to use for each. You can specify a different machine type for the master worker, the parameter servers, and the workers, but you can't use different machine types for individual instances within a given type. For example, you can use a large_model machine type for your parameter servers, but you can't set some parameter servers to use large_model and some to use complex_model_m. As with scale tiers, the values for the available machine types are defined in the Cloud ML Engine API, as part of the definition of the TrainingInput object.

The machine types are duplicated here for convenience:

Machine type Description
standard A basic machine configuration suitable for training simple models with small to moderate datasets.
large_model A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).
complex_model_s A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.
complex_model_m A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.
complex_model_l A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.
standard_gpu A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU.
complex_model_m_gpu A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs.
complex_model_l_gpu A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs.
standard_p100 A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. The availability of these GPUs is in Alpha launch stage.
complex_model_m_p100 A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. The availability of these GPUs is in Alpha launch stage.

Comparing machine types

Even though the exact specifications of the machine types are subject to change at any time, you can compare them in terms of relative capability. The following table uses rough "t-shirt" sizing to describe the machine types.

Machine type CPU GPUs Memory ML units
standard XS - M 1
large_model S - XL 3
complex_model_s S - S 2
complex_model_m M - M 3
complex_model_l L - L 6
standard_gpu XS 1 (K80) M 3
complex_model_m_gpu M 4 (K80) M 12
complex_model_l_gpu L 8 (K80) L 24
standard_p100 XS 1 (P100) M No charge while in Alpha
complex_model_m_p100 M 4 (P100) M No charge while in Alpha

Each increase in size constitutes roughly double capacity in the area being measured. Possible sizes are (in increasing order): XS, S, M, L, XL.

Package URIs

Your trainer must be made into a Python package and copied to a Google Cloud Storage bucket before you can run it on Cloud ML Engine. You pass the URI of your package to the training service as an element of the package URI list. The URI of a Cloud Storage location takes this form:


Your package is an element of a package URI list, rather than a single string, because you can specify other packages as dependencies. Each URI you include is the path to another package, formatted as a tarball (*.tar.gz) or as a wheel. The training service installs each package (using pip install) on every virtual machine it allocates for your training job.

Python module

Your trainer package can contain multiple modules (Python files). You must identify the module that contains your application entry point. The training service runs that module by invoking Python, just as you would run it locally.

When you make your trainer application into a Python package, you create a namespace. For example, if you create a package named my_trainer, and your main module is called, you specify that package with the name my_trainer.task.


If you want to use hyperparameter tuning, you must include configurations details when you create your training job. A discussion of hyperparameter tuning is given in its feature overview. Instructions on configuring and using it are given in its how-to guide.


Google Cloud Platform uses zones and regions to define the geographic locations of physical computing resources. Cloud ML Engine uses regions to designate its processing. When you run a training job, you specify the region that you want it to run in.

If you store your training dataset on Google Cloud Storage, you should run your training job in the same region as the bucket you're using. If you must run your job in a different region from your data bucket, your job may take longer.

Using job-dir as a common output directory

Although Cloud ML Engine doesn't intervene in your input and output, it does provide a mechanism for specifying the output directory for a job. You can set a job directory when you configure your job. When you do, the Cloud ML Engine training service:

  • Validates the directory for you so that you can fix any problems before your job runs.
  • Passes the path to your trainer as a command-line argument named --job-dir.

If you want to use this feature, you need to account for the --job-dir argument in your trainer. Capture the argument value when you parse your other parameters and use it when saving your trainer's output.

Runtime version

You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version.

Training application parameters

You can send data to your trainer when it runs in the cloud by specifying command-line arguments for your main module. Assemble the list of arguments and include it in your training configuration.

The training service accepts the arguments as a list of strings with the following format:

['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']

Each expression that you would enter in the command-line invocation of your trainer is a member of the list, given in order.

When you use the gcloud command-line tool to submit your training job, you give the arguments as you would when running your application at the command line. Enter your arguments after all of the command flags, preceded by two hyphens:

gcloud ml-engine jobs submit training job123 \
    --package-path=gs://bucket/path/to/package.tar.gz \
    --module-name=trainer.task \
    --job-dir=gs://bucket/path/to/dir \
    --region=us-central1 \
    -- \
    --my_first_arg=first_arg_value \

Input data

Every machine learning model starts with known data. There are only two limitations to the data that you can use in your trainer running on Cloud ML Engine:

  • It must be in a format that you can read and feed to your TensorFlow code.
  • It must be in a location that your code can access. This typically means that it should be stored with one of Google Cloud Platform's storage or big data services.

Output data

It is common for trainers to output data: checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. As with input data, it is easiest to save your outputs to a Google Cloud Storage bucket in the same Google Cloud Platform project as your training job.

Training with GPUs

You can run your training jobs on Cloud ML Engine with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.

As with many other aspects of running training jobs, the Cloud ML Engine training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. You assign TensorFlow Ops to GPUs in your trainer code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your trainer code on each machine.

Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.

What's next

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)