Cloud Machine Learning Engine enables you to easily run your TensorFlow training applications in the cloud. This page describes that capability and some of the key concepts you'll need to understand to make the most of your model training. If you'd rather get right into the training process without detailed descriptions, start by working through the steps enumerated in the basic training how-to.
How training works
Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your trainer on computing resources in the cloud. Here's an overview of the process:
- You create a TensorFlow application that defines your computation graph and trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment.
- You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Google Cloud Storage, BigTable, or another Google Cloud Platform storage service associated with the same Google Cloud Platform project that you're using for Cloud ML Engine.
- When your application is ready to run, it must be packaged and transferred
to a Google Cloud Storage bucket that your project can access. This is
automated when you use the
gcloudcommand-line tool to run a training job.
- The Cloud ML Engine training service sets up resources for your
job. It allocates one or more virtual machines (sometimes called training
instances) based on your job configuration. Each training instance is set
- Applying the standard machine image for the version of Cloud ML Engine your job uses.
- Loading your trainer package and installing it with
- Installing any additional packages that you specify as dependencies.
- The training service runs your trainer, passing the command-line arguments you specify when you create the training job.
- You can get information about your running job in three ways:
- On Stackdriver Logging
- By requesting job details or running log streaming with the
- By programmatically making status requests to the training service.
- When your trainer succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.
If you run a distributed TensorFlow job with Cloud ML Engine, you'll specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running trainer on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. It's asserted in the previous list that the training service runs until "your trainer" succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.
If you are running a single-process job, the sole replica is the master for the job.
One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your trainer.
One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.
The typical case
There are steps in the description above where you might assume that a machine learning service would intervene or control processing but where Cloud ML Engine doesn't. The training service is designed to have as little an impact on your trainer as possible. This means you can focus on the TensorFlow code that makes the model you want instead of being confined by a rigid structure. Essentially this means that Cloud ML Engine doesn't know or take interest in your application's implementation.
While it's true that the training service imposes almost no restriction on your trainer's architecture, that doesn't mean that there isn't any guidance to follow. Most machine learning trainers:
- Provide a way to get training data and evaluation data.
- Process data instances in batches.
- Use evaluation data to test the accuracy of the model (how often it predicts the right value).
- Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
- Provide a way to export the trained model when the trainer finishes.
Packaging your trainer
If you've never made a Python package before, this process can feel daunting.
The good news is that you can rely on the
gcloud command-line tool to do the
heavy lifting for you. This section covers some of the specifics in more detail.
You'll find detailed instructions in the
There are two kinds of dependencies that your trainer might have: standard
dependencies and custom dependencies. Standard dependencies are packages that
you import that are available in the
Python Package Index
(PyPI). These are well-known libraries that can be installed with a simple
command. Custom dependencies are generally other packages that you developed
yourself, or that were developed in-house by someone else. Here's how to work
with both kinds:
Including standard dependencies
You can specify your package's standard dependencies as part of its
script. The Cloud ML Engine training service uses
pip to install your
trainer package on training instances that it allocates for your job. A
install includes installing all of the dependencies listed in the
Including custom dependencies
You can specify your trainer's custom dependencies by passing their paths as
part of your job configuration. Like your trainer package, any included custom
dependencies must be in a Google Cloud Storage location. The service also uses
pip to install custom dependencies, so they can have standard dependencies of
their own in their
Parameters for cloud training
There are two kinds of parameters that you provide when creating a training job: job configuration parameters, and training application parameters. This section describes both of these types of parameters.
You pass your parameters to the training service by setting the members of the
Job resource in a JSON request string. The training parameters are defined in
object. If you use the
gcloud command-line tool to create training jobs, the
most common training parameters are defined as flags of the
gcloud ml-engine jobs
submit training command. You can pass the remaining parameters in a YAML
configuration file. That file, called
config.yaml by convention, mirrors the
structure of the JSON representation of the Job resource.
Job configuration parameters
The Cloud ML Engine training service needs information to set up resources in the cloud and deploy your trainer application on each node in the processing cluster.
You must give your training job a name following these rules:
- It must be unique within your Google Cloud Platform project.
- It may only contain mixed-case letters, digits, and underscores.
- It must start with a letter.
- It must be no more than 128 characters long.
You can use whatever job naming convention you want. If you don't run very many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.
A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name—all jobs for a model are grouped together in ascending order.
You must tell Cloud ML Engine the number and type of machines to run
your training job on. To make the process easier, you can pick from a set of
predefined cluster specifications called scale tiers. The specific cluster
configuration of each tier is not fixed: it may change as the availability of
cloud resources changes over time. Instead, each scale tier is defined in terms
of its suitability for certain types of jobs. Generally, the more advanced the
tier, the more machines will be allocated to the cluster, and the more powerful
the specifications of each virtual machine. As you increase the complexity of
the scale tier, the hourly cost of training jobs, measured in
ML training units, also increases. The
supported scale tiers are defined as part of the
Cloud ML Engine API,
and you use the same identifiers that are defined there when you specify your
training input for the
For convenience, the scale tier definitions are duplicated here:
||A single worker instance. This tier is suitable for learning how to use Cloud ML Engine and for experimenting with new models using small datasets.|
||Many workers and a few parameter servers.|
||A large number of workers with many parameter servers.|
||A single worker instance with a GPU.|
Machine types for the custom scale tier
If you want finer control over the processing cluster that you use to train your
model, you can set the scale tier to
CUSTOM and set values for the number of
parameter servers and workers that you want to use along with the type of
machine to use for each. You can specify a different machine type for the master
worker, the parameter servers, and the workers, but you can't use different
machine types for individual instances within a given type. For example, you can
use a large_model machine type for your parameter servers, but you can't set
some parameter servers to use large_model and some to use complex_model_m. As
with scale tiers, the values for the available machine types are
defined in the Cloud ML Engine API (as part of the definition of the
The machine types are duplicated here for convenience:
||A basic machine configuration suitable for training simple models with small to moderate datasets.|
||A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).|
||A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.|
||A machine with roughly twice the number of cores and roughly double the
||A machine with roughly twice the number of cores and roughly double the
||A machine equivalent to standard that also includes a GPU that you can use in your trainer.|
||A machine equivalent to
||A machine equivalent to
Comparing machine types
Even though the exact specifications of the machine types are subject to change at any time, you can compare them in terms of relative capability. The following table uses rough "t-shirt" sizing to describe the machine types.
|Machine type||CPU||GPUs||Memory||ML units|
Each increase in size constitutes roughly double capacity in the area being measured. Possible sizes are (in increasing order): XS, S, M, L, XL, XXL.
You trainer must be made into a Python package and copied to a Google Cloud Storage bucket before you can run it on Cloud ML Engine. You pass the URI of your package to the training service as an element of the package URI list. The URI of a Cloud Storage location takes this form:
Your package is an element of a package URI list, rather than a single string, because you can specify other packages as dependencies. Each URI you include is the path to another package, formatted as a tarball (*.tar.gz) or as a wheel. The training service installs each package (using pip install) on every virtual machine it allocates for your training job.
Your trainer package can contain multiple modules (Python files). You must identify the module that contains your application entry point. The training service runs that module by invoking Python, just as you would run it locally.
When you make your trainer application into a Python package, you create a
namespace. For example, if you create a package named
my_trainer, and your
main module is called
task.py, you specify that package with the name
If you want to use hyperparameter tuning, you must include configurations details when you create your training job. A discussion of hyperparameter tuning is given in its feature overview. Instructions on configuring and using it are given in its how-to guide.
Google Cloud Platform uses zones and regions to define the geographic locations of physical computing resources. Cloud ML Engine uses regions to designate its processing. When you run a training job, you specify the region that you want it to run in.
If you store your training dataset on Google Cloud Storage, you should run your training job in the same region as the bucket you're using. If you must run your job in a different region from your data bucket, your job may take longer.
Using job-dir as a common output directory
Although Cloud ML Engine doesn't intervene in your input and output, it does provide a mechanism for specifying the output directory for a job. You can set a job directory when you configure your job. When you do, the Cloud ML Engine training service:
- Validates the directory for you so that you can fix any problems before your job runs.
- Passes the path to your trainer as a command-line argument named
If you want to use this feature, you need to account for the
argument in your trainer. Capture the argument value when you parse your other
parameters and use it when saving your trainer's output.
You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version.
Training application parameters
You can send data to your trainer when it runs in the cloud by specifying command-line arguments for your main module. Assemble the list of arguments and include it in your training configuration.
The training service accepts the arguments as a list of strings with the following format:
['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']
Each expression that you would enter in the command-line invocation of your trainer is a member of the list, given in order.
When you use the
gcloud command-line tool to submit your training job, you
give the arguments as you would when running your application at the command
line. Enter your arguments after all of the command flags, preceded by two
gcloud ml-engine jobs submit training job123 \ --package-path=gs://bucket/path/to/package.tar.gz \ --module-name=trainer.task \ --job-dir=gs://bucket/path/to/dir \ --region=us-central1 \ -- \ --my_first_arg=first_arg_value \ --my_second_arg=second_arg_value
Every machine learning model starts with known data. There are only two limitations to the data that you can use in your trainer running on Cloud ML Engine:
- It must be in a format that you can read and feed to your TensorFlow code.
- It must be in a location that your code can access. This typically means that it should be stored with one of Google Cloud Platform's storage or big data services.
It is common for trainers to output data: checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. As with input data, it is easiest to save your outputs to a Google Cloud Storage bucket in the same Google Cloud Platform project as your training job.
Training with GPUs
You can run your training jobs on Cloud ML Engine with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.
As with many other aspects of running training jobs, the Cloud ML Engine training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. You assign TensorFlow Ops to GPUs in your trainer code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your trainer code on each machine.
Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.