Cloud Machine Learning Engine enables you to easily run your TensorFlow training applications in the cloud. This page describes that capability and some of the key concepts you'll need to understand to make the most of your model training. If you'd rather get right into the training process without detailed descriptions, start by working through the steps enumerated in the basic training how-to.
How training works
Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your trainer on computing resources in the cloud. Here's an overview of the process:
- You create a TensorFlow application that defines your computation graph and trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment.
- You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Google Cloud Storage, Cloud Bigtable, or another Google Cloud Platform storage service associated with the same Google Cloud Platform project that you're using for Cloud ML Engine.
- When your application is ready to run, it must be packaged and transferred
to a Google Cloud Storage bucket that your project can access. This is
automated when you use the
gcloudcommand-line tool to run a training job.
- The Cloud ML Engine training service sets up resources for your
job. It allocates one or more virtual machines (sometimes called training
instances) based on your job configuration. Each training instance is set
- Applying the standard machine image for the version of Cloud ML Engine your job uses.
- Loading your trainer package and installing it with
- Installing any additional packages that you specify as dependencies.
- The training service runs your trainer, passing the command-line arguments you specify when you create the training job.
- You can get information about your running job in three ways:
- On Stackdriver Logging.
- By requesting job details or running log streaming with the
- By programmatically making status requests to the training service.
- When your trainer succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.
If you run a distributed TensorFlow job with Cloud ML Engine, you'll specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running trainer on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. It's asserted in the previous list that the training service runs until "your trainer" succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.
If you are running a single-process job, the sole replica is the master for the job.
One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your trainer.
One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.
The typical case
There are steps in the description above where you might assume that a machine learning service would intervene or control processing but where Cloud ML Engine doesn't. The training service is designed to have as little an impact on your trainer as possible. This means you can focus on the TensorFlow code that makes the model you want instead of being confined by a rigid structure. Essentially this means that Cloud ML Engine doesn't know or take interest in your application's implementation.
While it's true that the training service imposes almost no restriction on your trainer's architecture, that doesn't mean that there isn't any guidance to follow. Most machine learning trainers:
- Provide a way to get training data and evaluation data.
- Process data instances in batches.
- Use evaluation data to test the accuracy of the model (how often it predicts the right value).
- Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
- Provide a way to export the trained model when the trainer finishes.
Packaging your trainer
If you've never made a Python package before, this process can feel daunting.
The good news is that you can rely on the
gcloud command-line tool to do the
heavy lifting for you. This section covers some of the specifics in more detail.
You'll find detailed instructions in the
There are two kinds of dependencies that your trainer might have: standard
dependencies and custom dependencies. Standard dependencies are packages that
you import that are available in the
Python Package Index
(PyPI). These are well-known libraries that can be installed with a simple
command. Custom dependencies are generally other packages that you developed
yourself, or that were developed in-house by someone else. Here's how to work
with both kinds:
Including standard dependencies
You can specify your package's standard dependencies as part of its
script. The Cloud ML Engine training service uses
pip to install your
trainer package on training instances that it allocates for your job. A
install includes installing all of the dependencies listed in the
Including custom dependencies
You can specify your trainer's custom dependencies by passing their paths as
part of your job configuration. Like your trainer package, any included custom
dependencies must be in a Google Cloud Storage location. The service also uses
pip to install custom dependencies, so they can have standard dependencies of
their own in their
Parameters for cloud training
There are two kinds of parameters that you provide when creating a training job: job configuration parameters, and training application parameters. This section describes both of these types of parameters.
You pass your parameters to the training service by setting the members of the
Job resource in a JSON request string. The training parameters are defined in
object. If you use the
gcloud command-line tool to create training jobs, the
most common training parameters are defined as flags of the
gcloud ml-engine jobs submit
command. You can pass the remaining parameters in a YAML
configuration file. That file, called
config.yaml by convention, mirrors the
structure of the JSON representation of the Job resource. The path to your
configuration file should be passed to
gcloud ml-engine jobs submit training
--config argument. So, if the path to your configuration file is
config.yaml, you would set
Job configuration parameters
The Cloud ML Engine training service needs information to set up resources in the cloud and deploy your trainer application on each node in the processing cluster.
You must give your training job a name following these rules:
- It must be unique within your Google Cloud Platform project.
- It may only contain mixed-case letters, digits, and underscores.
- It must start with a letter.
- It must be no more than 128 characters long.
You can use whatever job naming convention you want. If you don't run very many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.
A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name—all jobs for a model are grouped together in ascending order.
You must tell Cloud ML Engine the number and type of machines to run your training job on. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Google may optimize the configuration of the scale tiers for different jobs over time, based on customer feedback and the availability of cloud resources. Each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines are allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in training units, also increases. See the pricing page to calculate the cost of your job.
Below are the scale tier definitions:
|Cloud ML Engine scale tier|
A single worker instance. This tier is suitable for learning how to use Cloud ML Engine and for experimenting with new models using small datasets.
Compute Engine machine type: n1-standard-4
One master instance, plus four workers and three parameter servers.
Compute Engine machine type, master: n1-highcpu-8, workers: n1-highcpu-8, parameter servers: n1-standard-4
One master instance, plus 19 workers and 11 parameter servers.
Compute Engine machine type, master: n1-highcpu-16, workers: n1-highcpu-16, parameter servers: n1-highmem-8
A single worker instance with a single NVIDIA Tesla K80 GPU. To learn more about graphics processing units (GPUs), see the section on training with GPUs.
Compute Engine machine type: n1-standard-8 with one k80 GPU
||The CUSTOM tier is not a set tier, but rather enables you
to use your own cluster specification. When you use this tier, set
values to configure your processing cluster according to these
Machine types for the custom scale tier
If you want finer control over the processing cluster that you use to train your
model, you can set the scale tier to
CUSTOM and set values for the number of
parameter servers and workers that you want to use along with the type of
machine to use for each. You can specify a different machine type for the master
worker, the parameter servers, and the workers, but you can't use different
machine types for individual instances within a given type. For example, you can
use a large_model machine type for your parameter servers, but you can't set
some parameter servers to use
large_model and some to use
As with scale tiers, the values for the available machine types are
defined in the Cloud ML Engine API, as part of the definition of the
Below are the machine type definitions:
|Cloud ML Engine machine type|
A basic machine configuration suitable for training simple models with small to moderate datasets.
Compute Engine machine type: n1-standard-4
A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).
Compute Engine machine type: n1-highmem-8
A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.
Compute Engine machine type: n1-highcpu-8
A machine with roughly twice the number of cores and roughly double the memory of complex_model_s.
Compute Engine machine type: n1-highcpu-16
A machine with roughly twice the number of cores and roughly double the memory of complex_model_m.
Compute Engine machine type: n1-highcpu-32
A machine equivalent to standard that also includes a single NVIDIA Tesla K80 GPU.
Compute Engine machine type: n1-standard-8 with one k80 GPU
A machine equivalent to complex_model_m that also includes four NVIDIA Tesla K80 GPUs.
Compute Engine machine type: n1-standard-16-k80x4
A machine equivalent to complex_model_l that also includes eight NVIDIA Tesla K80 GPUs.
Compute Engine machine type: n1-standard-32-k80x8
A machine equivalent to standard that also includes a single NVIDIA Tesla P100 GPU. The availability of these GPUs is in Beta launch stage.
Compute Engine machine type: n1-standard-8-p100x1
A machine equivalent to complex_model_m that also includes four NVIDIA Tesla P100 GPUs. The availability of these GPUs is in Beta launch stage.
Compute Engine machine type: n1-standard-16-p100x4
To learn more about graphics processing units (GPUs), see the section on training with GPUs.
Comparing machine types
Even though the exact specifications of the machine types are subject to change at any time, you can compare them in terms of relative capability. The following table uses rough "t-shirt" sizing to describe the machine types.
|standard_p100 (Beta)||XS||1 (P100)||M|
|complex_model_m_p100 (Beta)||M||4 (P100)||M|
Each increase in size constitutes roughly double capacity in the area being measured. Possible sizes are (in increasing order): XS, S, M, L, XL.
Your trainer must be made into a Python package and copied to a Google Cloud Storage bucket before you can run it on Cloud ML Engine. You pass the URI of your package to the training service as an element of the package URI list. The URI of a Cloud Storage location takes this form:
Your package is an element of a package URI list, rather than a single string, because you can specify other packages as dependencies. Each URI you include is the path to another package, formatted as a tarball (*.tar.gz) or as a wheel. The training service installs each package (using pip install) on every virtual machine it allocates for your training job.
Your trainer package can contain multiple modules (Python files). You must identify the module that contains your application entry point. The training service runs that module by invoking Python, just as you would run it locally.
When you make your trainer application into a Python package, you create a
namespace. For example, if you create a package named
my_trainer, and your
main module is called
task.py, you specify that package with the name
If you want to use hyperparameter tuning, you must include configurations details when you create your training job. A discussion of hyperparameter tuning is given in its feature overview. Instructions on configuring and using it are given in its how-to guide.
Google Cloud Platform uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on Cloud ML Engine, you specify the region that you want it to run in.
If you store your training dataset on Google Cloud Storage, you should run your training job in the same region as the bucket you're using. If you must run your job in a different region from your data bucket, your job may take longer.
To see the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.
Using job-dir as a common output directory
Although Cloud ML Engine doesn't intervene in your input and output, it does provide a mechanism for specifying the output directory for a job. You can set a job directory when you configure your job. When you do, the Cloud ML Engine training service:
- Validates the directory for you so that you can fix any problems before your job runs.
- Passes the path to your application as a command-line argument
If you want to use this feature, you need to account for the
argument in your application. Capture the argument value when you parse your
other parameters and use it when saving your trainer's output. See the section
on application parameters below.
You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version.
Training application parameters
You can send data to your application when it runs in the cloud by specifying command-line arguments for your main module. Assemble the list of arguments and include it in your training configuration.
The training service accepts the arguments as a list of strings with the following format:
['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']
Each expression that you would enter in the command-line invocation of your trainer is a member of the list, given in order.
When you use the
gcloud command-line tool to submit your training job, you
give the arguments as you would when running your application at the command
line. After all the
gcloud specific arguments, specify an empty
then all your own arguments, also known as
gcloud ml-engine jobs submit training job123 \ --package-path=gs://bucket/path/to/package.tar.gz \ --module-name=trainer.task \ --job-dir=gs://bucket/path/to/dir \ --region=us-central1 \ -- \ --user_first_arg=first_arg_value \ --user_second_arg=second_arg_value
- The empty
--argument marks the end of the
gcloudspecific arguments and the start of the
USER_ARGSthat you want to pass to your application.
- Arguments specific to Cloud ML Engine, such as
--job-dir, must come before the empty
--argument. The Cloud ML Engine service interprets these arguments.
--job-dirargument, if specified, must come before the empty
--argument, because Cloud ML Engine uses the
--job-dirargument to validate the path.
- Your application must handle the
--job-dirargument too, if specified. Even though the argument comes before the empty
--job-diris also passed to your application as a command-line argument.
Every machine learning model starts with known data. There are only two limitations to the data that you can use in your trainer running on Cloud ML Engine:
- It must be in a format that you can read and feed to your TensorFlow code.
- It must be in a location that your code can access. This typically means that it should be stored with one of Google Cloud Platform's storage or big data services.
It is common for trainers to output data: checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. As with input data, it is easiest to save your outputs to a Google Cloud Storage bucket in the same Google Cloud Platform project as your training job.
Building training jobs that are resilient to VM restarts
Google Cloud VMs may be restarted occasionally. You should ensure that your training job is resilient to these restarts, by saving model checkpoints regularly, and by configuring your job to restore the most recent checkpoint.
You usually save model checkpoints in the Cloud Storage path that
you specify with the
--job-dir argument in the
gcloud ml-engine jobs submit
The TensorFlow Estimator API implements checkpoint functionality for you. If your model is wrapped in an Estimator, you do not need to worry about restart events on your VMs.
If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your training jobs to be resilient to restart events, you must write the checkpoint saving and restoration functionality into your model yourself. TensorFlow provides the following useful resources in the tf.train module:
Training with GPUs
You can run your training jobs on Cloud ML Engine with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.
As with many other aspects of running training jobs, the Cloud ML Engine training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. You assign TensorFlow Ops to GPUs in your trainer code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your trainer code on each machine.
Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.
See how to use GPUs for your training job.