Containers on AI Platform is a feature that allows you to run your application within a Docker image. You can build your own custom container to run jobs on AI Platform, using ML frameworks and versions as well as non-ML dependencies, libraries and binaries that are not otherwise supported on AI Platform.
How training with containers works
Your training application, implemented in the ML framework of your choice, is the core of the training process.
- Create an application that trains your model, using the ML framework of your choice.
- Decide whether to use a custom container. There could be a runtime version that already supports your dependencies. Otherwise, you'll need to build a custom container for your training job. In your custom container, you pre-install your training application and all its dependencies onto an image that you'll use to run your training job.
- Store your training and verification data in a source that AI Platform can access. This usually means putting it in Cloud Storage, Cloud Bigtable, or another Google Cloud Platform storage service associated with the same GCP project that you're using for AI Platform.
- When your application is ready to run, you must build your Docker image and push it to Container Registry, making sure that the AI Platform service can access your registry.
- Submit your job using
gcloud beta ml-engine jobs submit training, specifying your arguments in a
config.yamlfile or the corresponding
- The AI Platform training service sets up resources for your
job. It allocates one or more virtual machines (called
training instances) based on your job configuration. You set up a training
instance by using the custom container you specify as part of the
TrainingInputobject when you submit your training job.
- The training service runs your Docker image, passing through any command-line arguments you specify when you create the training job.
- You can get information about your running job in the following ways:
- On Stackdriver Logging. You can find a link to your job logs in the AI Platform Jobs detail page in GCP Console.
- By requesting job details or running log streaming with the
gcloudcommand-line tool (specifically,
gcloud ml-engine jobs stream-logs).
- By programmatically making status requests to the training service,
projects.jobs.getmethod. See more details about how to monitor training jobs.
- When your training job succeeds or encounters an unrecoverable error, AI Platform halts all job processes and cleans up the resources.
Advantages of custom containers
Custom containers allow you to specify and pre-install all the dependencies needed for your application.
- Faster start-up time. If you use a custom container with your dependencies pre-installed, you can save the time that your training application would otherwise take to install dependencies when starting up.
- Use the ML framework of your choice. If you can't find an AI Platform runtime version that supports the ML framework you want to use, then you can build a custom container that installs your chosen framework and use it to run jobs on AI Platform. For example, you can train with PyTorch.
- Extended support for distributed training. With custom containers, you can do distributed training using any ML framework.
- Use the newest version. You can also use the latest build or minor version
of an ML framework. For example, you can
build a custom container to train with
tf-nightlyor preview TensorFlow 2.0.
Hyperparameter tuning with custom containers
To do hyperparameter tuning on AI Platform, you specify a goal metric, along with whether to minimize or maximize it. For example, you might want to maximize your model accuracy, or minimize your model loss. You also list the hyperparameters you'd like to adjust, along with a target value for each hyperparameter. AI Platform does multiple trials of your training application, tracking and adjusting the hyperparameters after each trial. When the hyperparameter tuning job is complete, AI Platform reports values for the most effective configuration of your hyperparameters, as well as a summary for each trial.
In order to do hyperparameter tuning with custom containers, you need to make the following adjustments:
- In your Dockerfile: install
- In your training code:
- In your job request: add a
See an example of training with custom containers using hyperparameter tuning or learn more about how hyperparameter tuning works on AI Platform.
Using GPUs with custom containers
For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs.
- Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the
nvidia/cudaimage as your base image is the recommended way to handle this. It has the matching versions of CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.
- Install your training application, along with your required ML framework and other dependencies in your Docker image.