Overview of custom training options in Vertex AI

Vertex AI Training lets you run your custom machine learning (ML) training applications on a fully managed platform, freeing you from the complexities of infrastructure management. It's the ideal solution for data scientists and ML engineers who require complete control over their code, frameworks, and hardware configurations, while offloading the operational burden of server provisioning, maintenance, and scaling to Google Cloud.

Key Concepts

The custom training workflow is designed for maximum flexibility. You bring your code, and Vertex AI runs it on the hardware you specify.

  1. Package your training application: Write your training code using any ML framework and package it, along with all its dependencies, as a Docker container image. This approach ensures a consistent and reproducible training environment. For simpler applications, you can also use a prebuilt Vertex AI container and supply only your Python training script.

  2. Configure your training job: Using the Vertex AI API or SDK, you define the specifications for your training run. This is where you connect your application to the Google Cloud infrastructure. Your configuration includes:

    • The container image to use
    • The command or script to execute
    • The selected machine type (vCPU and memory)
    • The type and number of hardware accelerators (NVIDIA GPUs or Google TPUs)
  3. Submit and monitor: Once you submit your job, Vertex AI's managed service handles the rest. It automatically provisions the requested compute resources, executes your training application, and tears down the infrastructure when the job completes. During the run, the service streams all logs and metrics to the Google Google Cloud console, giving you full visibility into your job's progress.

Why use custom training?

Vertex AI Training is the best choice when your project requires maximum flexibility and control. Vertex AI Training fully integrates with the Vertex AI MLOps ecosystem, providing a seamless path from training to production deployment.

  • Framework freedom: Bring your own code using any ML framework, including TensorFlow, PyTorch, Scikit-learn, and XGBoost.
  • Hardware: Train on a wide selection of CPUs and the latest AI accelerators.
  • Managed distributed training: Natively scale your training job across multiple machines to train large models on massive datasets, without managing the underlying cluster orchestration.
  • Hyperparameter tuning: Automate the process of finding the optimal hyperparameters for your model to maximize its predictive accuracy.
  • Experiment tracking: Automatically track and compare the parameters, metrics, and artifacts from all your training runs in a centralized location with Vertex AI Experiments.
  • Integrated workflow: Custom training is fully integrated with the Vertex AI MLOps ecosystem, including the Model Registry and Vertex AI Inference, enabling a seamless path from a trained model to production deployment.