Containers on AI Platform Training is a feature that allows you to run your application within a Docker image. You can build your own custom container to run jobs on AI Platform Training, using ML frameworks and versions as well as non-ML dependencies, libraries and binaries that are not otherwise supported on AI Platform Training.
How training with containers works
Your training application, implemented in the ML framework of your choice, is the core of the training process.
- Create an application that trains your model, using the ML framework of your choice.
- Decide whether to use a custom container. There could be a runtime version that already supports your dependencies. Otherwise, you'll need to build a custom container for your training job. In your custom container, you pre-install your training application and all its dependencies onto an image that you'll use to run your training job.
- Store your training and verification data in a source that AI Platform Training can access. This usually means putting it in Cloud Storage, Cloud Bigtable, or another Google Cloud storage service associated with the same Google Cloud project that you're using for AI Platform Training.
- When your application is ready to run, you must build your Docker image and push it to Container Registry, making sure that the AI Platform Training service can access your registry.
- Submit your job using
gcloud ai-platform jobs submit training, specifying your arguments in a
config.yamlfile or the corresponding
- The AI Platform Training training service sets up resources for your
job. It allocates one or more virtual machines (called
training instances) based on your job configuration. You set up a training
instance by using the custom container you specify as part of the
TrainingInputobject when you submit your training job.
- The training service runs your Docker image, passing through any command-line arguments you specify when you create the training job.
- You can get information about your running job in the following ways:
- On Cloud Logging. You can find a link to your job logs in the AI Platform Training Jobs detail page in Cloud Console.
- By requesting job details or running log streaming with the
gcloudcommand-line tool (specifically,
gcloud ai-platform jobs stream-logs).
- By programmatically making status requests to the training service,
projects.jobs.getmethod. See more details about how to monitor training jobs.
- When your training job succeeds or encounters an unrecoverable error, AI Platform Training halts all job processes and cleans up the resources.
Advantages of custom containers
Custom containers allow you to specify and pre-install all the dependencies needed for your application.
- Faster start-up time. If you use a custom container with your dependencies pre-installed, you can save the time that your training application would otherwise take to install dependencies when starting up.
- Use the ML framework of your choice. If you can't find an AI Platform Training runtime version that supports the ML framework you want to use, then you can build a custom container that installs your chosen framework and use it to run jobs on AI Platform Training. For example, you can train with PyTorch.
- Extended support for distributed training. With custom containers, you can do distributed training using any ML framework.
- Use the newest version. You can also use the latest build or minor version
of an ML framework. For example, you can
build a custom container to train with
tf-nightlyor preview TensorFlow 2.0.
Hyperparameter tuning with custom containers
To do hyperparameter tuning on AI Platform Training, you specify a goal metric, along with whether to minimize or maximize it. For example, you might want to maximize your model accuracy, or minimize your model loss. You also list the hyperparameters you'd like to adjust, along with a target value for each hyperparameter. AI Platform Training does multiple trials of your training application, tracking and adjusting the hyperparameters after each trial. When the hyperparameter tuning job is complete, AI Platform Training reports values for the most effective configuration of your hyperparameters, as well as a summary for each trial.
In order to do hyperparameter tuning with custom containers, you need to make the following adjustments:
- In your Dockerfile: install
- In your training code:
- In your job request: add a
Using GPUs with custom containers
For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs.
- Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the
nvidia/cudaimage as your base image is the recommended way to handle this. It has the matching versions of CUDA toolkit and cuDNN pre-installed, and it helps you set up the related environment variables correctly.
- Install your training application, along with your required ML framework and other dependencies in your Docker image.