Using Cloud Machine Learning Engine, you can run your TensorFlow training applications in the cloud. This page describes the key concepts you need in order to make the most of your model training. If you'd rather get right into the training process, see how to start a training job.
How training works
Your training application, implemented in Python and TensorFlow, is the core of the training process. Cloud ML Engine runs your training job on computing resources in the cloud. Here's an overview of the process:
- You create a TensorFlow application that defines your computation graph and trains your model. Cloud ML Engine has almost no specific requirements of your application during the training process, so you build it as you would to run locally in your development environment.
- You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Cloud Storage, Cloud Bigtable, or another Google Cloud Platform storage service associated with the same GCP project that you're using for Cloud ML Engine.
- When your application is ready to run, you must package it and transfer it
to a Cloud Storage bucket that your project can access. This is
automated when you use the
gcloudcommand-line tool to run a training job.
- The Cloud ML Engine training service sets up resources for your
job. It allocates one or more virtual machines (called training
instances) based on your job configuration. Each training instance is set
- Applying the standard machine image for the version of Cloud ML Engine your job uses.
- Loading your application package and installing it with
- Installing any additional packages that you specify as dependencies.
- The training service runs your application, passing through any command-line arguments you specify when you create the training job.
- You can get information about your running job in the following ways:
- On Stackdriver Logging.
- By requesting job details or running log streaming with the
- By programmatically making status requests to the training service.
- When your training job succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.
If you run a distributed TensorFlow job with Cloud ML Engine, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify and performs step 4 above on each. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. The training service runs until your job succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.
If you are running a single-process job, the sole replica is the master for the job.
One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.
A typical machine learning application
The Cloud ML Engine training service is designed to have as little impact on your application as possible. This means you can focus on your TensorFlow code to define the model you want instead of being confined by a rigid structure.
Most machine learning applications:
- Provide a way to get training data and evaluation data.
- Process data instances in batches.
- Use evaluation data to test the accuracy of the model (how often it predicts the right value).
- Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
- Provide a way to export the trained model when the application finishes.
Packaging your application
Before you can run your training application with Cloud Machine Learning Engine, you must package your application and any additional dependencies you require, and upload the package to Cloud Storage bucket that your Google Cloud Platform project can access.
gcloud command-line tool automates much of the process. Specifically,
you can use
gcloud ml-engine jobs submit training
to upload your application package and submit your training job.
See the detailed instructions on packaging a training application.
Submitting your training job
Cloud Machine Learning Engine provides model training as an asynchronous (batch) service.
You can submit a training job by running
gcloud ml-engine jobs submit training
from the command line or by sending a request to the API at
See the detailed instructions on starting a training job.
You must give your training job a name that obeys these rules:
- It must be unique within your Google Cloud Platform project.
- It may only contain mixed-case letters, digits, and underscores.
- It must start with a letter.
- It must be no more than 128 characters long.
You can use whatever job naming convention you want. If you don't run very many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.
A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name, because all jobs for a model are then grouped together in ascending order.
When running a training job on Cloud ML Engine you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.
See the detailed definitions of scale tiers and machine types.
If you want to use hyperparameter tuning, you must include configuration details when you create your training job. See a conceptual guide to hyperparameter tuning. and how to use hyperparameter tuning.
Regions and zones
GCP uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on Cloud ML Engine, you specify the region that you want it to run in.
If you store your training dataset on Cloud Storage, you should run your training job in the same region as the Cloud Storage bucket you're using for the training data. If you must run your job in a different region from your data bucket, your job may take longer.
To see the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.
Using job-dir as a common output directory
You can specify the output directory for your job by setting a job directory when you configure the job. When you submit the job, Cloud ML Engine does the following:
- Validates the directory so that you can fix any problems before the job runs.
- Passes the path to your application as a command-line argument
You need to account for the
--job-dir argument in your application.
Capture the argument value when you parse your
other parameters and use it when saving your application's output. See the guide
to starting a training job.
You should specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Specify a version that gives you the functionality you need. If you run the training job locally as well as in the cloud, make sure the local and cloud jobs use the same runtime version.
If you don't specify a runtime version, your training job uses the default runtime version: 1.0.
The data that you can use in your training job must obey the following rules to run on Cloud ML Engine:
- The data must be in a format that you can read and feed to your TensorFlow code.
- The data must be in a location that your code can access. This typically means that it should be stored with one of the GCP storage or big data services.
It is common for applications to output data, including checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. It is easiest to save your output files to a Cloud Storage bucket in the same GCP project as your training job.
Building training jobs that are resilient to VM restarts
GCP VMs may be restarted occasionally. You should ensure that your training job is resilient to these restarts, by saving model checkpoints regularly, and by configuring your job to restore the most recent checkpoint.
You usually save model checkpoints in the Cloud Storage path that
you specify with the
--job-dir argument in the
gcloud ml-engine jobs submit
The TensorFlow Estimator API implements checkpoint functionality for you. If your model is wrapped in an Estimator, you do not need to worry about restart events on your VMs.
If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your training jobs to be resilient to restart events, you must write the checkpoint saving and restoration functionality into your model yourself. TensorFlow provides the following useful resources in the tf.train module:
Training with GPUs
You can run your training jobs on Cloud ML Engine with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.
The Cloud ML Engine training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. You assign TensorFlow Ops to GPUs in your code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your code on each machine.
Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.
See how to use GPUs for your training job.
Training with TPUs (Beta)
You can run your training jobs on Cloud ML Engine with Cloud TPU.
See how to use TPUs for your training job.