Use AI Platform to run your TensorFlow, scikit-learn, and XGBoost training applications in the cloud. AI Platform provides the dependencies required to train machine learning models using these hosted frameworks in its runtime versions. Additionally, you can use custom containers to run training jobs with other machine learning frameworks. This page describes the key concepts of AI Platform Training. If you'd rather get right into the training process, see how to start a training job.
How training works
AI Platform runs your training job on computing resources in the cloud. You can train a built-in algorithm (beta) against your dataset without writing a training application. If built-in algorithms do not fit your use case, you can create a training application to run on AI Platform.
Here's an overview of the process for using your training application:
- You create a Python application that trains your model, and you build it as you would to run locally in your development environment.
- You get your training and verification data into a source that AI Platform can access. This usually means putting it in Cloud Storage, Cloud Bigtable, or another Google Cloud storage service associated with the same Google Cloud project that you're using for AI Platform.
- When your application is ready to run, you must package it and transfer it
to a Cloud Storage bucket that your project can access. This is
automated when you use the
gcloudcommand-line tool to run a training job.
- The AI Platform training service sets up resources for your
job. It allocates one or more virtual machines (called training
instances) based on your job configuration. Each training instance is set
- Applying the standard machine image for the version of AI Platform your job uses.
- Loading your application package and installing it with
- Installing any additional packages that you specify as dependencies.
- The training service runs your application, passing through any command-line arguments you specify when you create the training job.
- You can get information about your running job in the following ways:
- On Stackdriver Logging.
- By requesting job details or running log streaming with the
- By programmatically making status requests to the training service.
- When your training job succeeds or encounters an unrecoverable error, AI Platform halts all job processes and cleans up the resources.
A typical machine learning application
The AI Platform training service is designed to have as little impact on your application as possible. This means you can focus on your model code.
Most machine learning applications:
- Provide a way to get training data and evaluation data.
- Process data instances.
- Use evaluation data to test the accuracy of the model (how often it predicts the right value).
- (For TensorFlow training applications) Provide a way to output checkpoints at intervals in the process to get a snapshot of the model's progress.
- Provide a way to export the trained model when the application finishes.
Distributed training structure
If you run a distributed TensorFlow job with AI Platform, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types that you specify. Your running job on a given node is called a replica. In accordance with the distributed TensorFlow model, each replica in the training cluster is given a single role or task in distributed training:
Master: Exactly one replica is designated as the master worker. This task manages the others and reports status for the job as a whole. The training service runs until your job succeeds or encounters an unrecoverable error. In distributed training, the status of the master replica signals the overall job status.
If you are running a single-process job, the sole replica is the master for the job.
Workers: One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
Parameter servers: One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.
Distributed training strategies
There are three basic strategies to train a model with multiple nodes:
- Data-parallel training with synchronous updates.
- Data-parallel training with asynchronous updates.
- Model-parallel training.
Because you can use the data-parallel strategy regardless of the model structure, it is a good starting point for applying the distributed training method to your custom model. In data-parallel training, the whole model is shared with all worker nodes. Each node calculates gradient vectors independently from some part of the training dataset in the same manner as the mini-batch processing. The calculated gradient vectors are collected into the parameter server node, and model parameters are updated with the total summation of the gradient vectors. If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches.
Data-parallel training can be done with either synchronous or asynchronous updates. When using asynchronous updates, the parameter server applies each gradient vector independently, right after receiving it from one of the worker nodes, as shown in the following diagram:
To learn how to perform data-parallel distributed training, read
Then learn how to configure distributed
training in AI Platform Training.
To learn more about model-parallel training, read about Mesh TensorFlow.
Packaging your application
Before you can run your training application on AI Platform Training, you must package your application and its dependencies. Then you must upload this package to a Cloud Storage bucket that your Google Cloud project can access.
gcloud command-line tool automates much of the process. Specifically,
you can use
gcloud ai-platform jobs submit training
to upload your application package and submit your training job.
See the detailed instructions on packaging a training application.
Submitting your training job
AI Platform provides model training as an asynchronous (batch) service.
You can submit a training job by running
gcloud ai-platform jobs submit training
from the command line or by sending a request to the API at
See the detailed instructions on starting a training job.
You must give your training job a name that obeys these rules:
- It must be unique within your Google Cloud project.
- It may only contain mixed-case letters, digits, and underscores.
- It must start with a letter.
- It must be no more than 128 characters long.
You can use whatever job naming convention you want. If you don't run many jobs, the name you choose may not be very important. If you run a lot of jobs, you may need to find your job ID in large lists. It's a good idea to make your job IDs easy to distinguish from one another.
A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name, because all jobs for a model are then grouped together in ascending order.
When you run a training job on AI Platform, you must specify the number and types of machines you need. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. Alternatively, you can choose a custom tier and specify the machine types yourself.
See the detailed definitions of scale tiers and machine types.
If you want to use hyperparameter tuning, you must include configuration details when you create your training job. See a conceptual guide to hyperparameter tuning and how to use hyperparameter tuning.
Regions and zones
Google Cloud uses regions, subdivided into zones, to define the geographic location of physical computing resources. When you run a training job on AI Platform, you specify the region that you want it to run in.
If you store your training dataset on Cloud Storage, you should run your training job in the same region as the Cloud Storage bucket you're using for the training data. If you must run your job in a different region from your data bucket, your job may take longer.
To see the available regions for AI Platform services, including model training and online/batch prediction, read the guide to regions.
Using job-dir as a common output directory
You can specify the output directory for your job by setting a job directory when you configure the job. When you submit the job, AI Platform does the following:
- Validates the directory so that you can fix any problems before the job runs.
- Passes the path to your application as a command-line argument
You need to account for the
--job-dir argument in your application.
Capture the argument value when you parse your
other parameters and use it when saving your application's output. See the guide
to starting a training job.
To train with one of AI Platform's hosted machine learning frameworks, specify a supported AI Platform runtime version to use for your training job. The runtime version dictates the versions of TensorFlow, scikit-learn, XGBoost, and other Python packages that are installed on your allocated training instances. Specify a version that gives you the functionality you need. If you run the training job locally as well as in the cloud, make sure the local and cloud jobs use the same runtime version.
The data that you can use in your training job must obey the following rules to run on AI Platform:
- The data must be in a format that you can read and feed to your training code.
- The data must be in a location that your code can access. This typically means that it should be stored with one of the Google Cloud storage or big data services.
It is common for applications to output data, including checkpoints during training and a saved model when training is complete. You can output other data as needed by your application. It is easiest to save your output files to a Cloud Storage bucket in the same Google Cloud project as your training job.
Building training jobs that are resilient to VM restarts
Google Cloud VMs restart occasionally. To ensure that your training job is resilient to these restarts, save model checkpoints regularly and configure your job to restore the most recent checkpoint.
You usually save model checkpoints in the Cloud Storage path that
you specify with the
--job-dir argument in the
gcloud ai-platform jobs submit
The TensorFlow Estimator API implements checkpoint functionality for you. If your model is wrapped in an Estimator, you do not need to worry about restart events on your VMs.
If you can't wrap your model in a TensorFlow Estimator, write functionality to save and restore checkpoints into your training code. TensorFlow provides the following useful resources in the tf.train module:
Training with GPUs
You can run your training jobs on AI Platform with graphics processing units (GPUs). GPUs are designed to perform mathematically intensive operations at high speed. They can be more effective at running certain operations on tensor data than adding another machine with one or more CPU cores.
The AI Platform training service doesn't provide any special interface for working with GPUs. You can specify GPU-enabled machines to run your job, and the service allocates them for you. For example, in a TensorFlow training job you can assign TensorFlow Ops to GPUs in your code. When you specify a machine type with GPU access for a task type, each instance assigned to that task type is configured identically (as always): the service runs a single replica of your code on each machine.
If you are training with a different machine learning framework using a custom container, that framework may provide a different interface for working with GPUs.
Some models don't benefit from running on GPUs. We recommend GPUs for large, complex models that have many mathematical operations. Even then, you should test the benefit of GPU support by running a small sample of your data through training.
See how to use GPUs for your training job.
Training with TPUs
You can run your training jobs on AI Platform with Cloud TPU.
See how to use TPUs for your training job.