Choosing a custom training method

If you're writing your own training code instead of using AutoML, there are several ways of doing custom training to consider. This topic provides a brief overview and comparison of the different ways you can run custom training.

Custom training resources on Vertex AI

There are three types of Vertex AI resources you can create to train custom models on Vertex AI:

When you create a custom job, you specify settings that Vertex AI needs to run your training code, including:

Within the worker pool(s), you can specify the following settings:

Hyperparameter tuning jobs have additional settings to configure, such as the metric. Learn more about hyperparameter tuning.

A training pipeline orchestrates custom training jobs or hyperparameter tuning jobs with additional steps, such as loading a dataset or uploading the model to Vertex AI after the training job is successfully completed.

Viewing custom training resources in the Cloud Console

To view existing training pipelines in your project, go to the Training Pipelines page in the Vertex AI section of the Google Cloud Console.

Go to Training pipelines

To view existing custom jobs in your project, go to the Custom jobs page.

Go to Custom jobs

To view existing hyperparameter tuning jobs in your project, go to the Hyperparameter tuning page.

Go to Hyperparameter tuning

Choosing pre-built or custom containers

Before you submit a custom training job, hyperparameter tuning job, or a training pipeline to Vertex AI, you need to create a Python training application or a custom container to define the training code and dependencies you want to run on Vertex AI. If you create a Python training application using TensorFlow, scikit-learn, or XGBoost, you can use our pre-built containers to run your code. If you're not sure which of these options to choose, refer to the training code requirements to learn more.

Configuring distributed training

You can configure a custom training job, hyperparameter tuning job, or a training pipeline for distributed training by specifying multiple worker pools:

  • Use your first worker pool to configure your primary replica, and set the replica count to 1.
  • Add more worker pools to configure worker replicas, parameter server replicas, or evaluator replicas, if your machine learning framework supports these additional cluster tasks for distributed training.

Learn more about using distributed training.

What's next