Use Spot VMs with training

Overview

You can reduce the cost of running your custom training jobs by using Spot VMs. Spot VMs are virtual machine (VM) instances that are excess Compute Engine capacity. Spot VMs have significant discounts, but Compute Engine might preemptively stop or delete (preempt) Spot VMs to reclaim the capacity at any time.

To learn more, see Spot VMs.

Limitations and requirements

Consider the following limitations and requirements when using Spot VMs with Vertex AI:

  • All Spot VMs limitations apply when using Spot VMs with Vertex AI.
  • Using Spot VMs with Vertex AI is supported for only custom training and prediction.
  • Using Spot VMs with TPU Pods isn't supported.
  • Vertex AI training can only use Spot VMs with the following machine series:

    • A2
    • A3
  • Submitting your job through the Google Cloud console is not supported.

Billing

If your workloads are fault-tolerant and can withstand possible VM preemption, Spot VMs can reduce your compute costs significantly. If some of your VMs stop during processing, the job slows but does not completely stop. Spot VMs complete your batch processing tasks without placing additional load on your existing VMs and without requiring you to pay full price for additional standard VMs. See Preemption handling.

When you use Spot VMs, you're billed by job duration and machine type. You don't pay for the time that the job is in a queue or preempted.

Preemption handling

Spot VMs can be reclaimed by Compute Engine at any time. Therefore, your custom training job must be fault tolerant to get the most benefit from Spot VMs. When Spot VMs are preempted, the custom training job fails with a STOCKOUT error and Compute Engine tries to restart the job up to six times. To learn how to get the most out of your Spot VMs, see Spot VM best practices.

The following are some of the methods that you can use to make your custom training job fault tolerant:

  • Create checkpoints to save progress. By periodically storing the progress of your model, you can ensure that a terminated custom training job can resume from the last stored checkpoint, instead of starting over from the beginning.
  • Use Elastic Horovod. Elastic training enables Horovod to scale your compute resources without requiring a restart or resuming from checkpoints. To learn more, see Elastic Horovod.
  • Use a shutdown script. When Compute Engine preempts a Spot VM, you can use a shutdown script that tries to perform cleanup actions before the VM is preempted. To learn more, see Handle preemption with a shutdown script.

Before you begin

Prepare your custom training application:

Configure your training job to use Spot VMs

You can configure your custom training job to use Spot VMs by specifying a SPOT strategy in your scheduling configuration.

REST

Before using any of the request data, make the following replacements:

  • LOCATION: The region where the container or Python package will be run.
  • PROJECT_ID: Your project ID.
  • JOB_NAME: Required. A display name for the CustomJob.
  • Define the custom training job:
    • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
    • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
    • If your training application runs in a custom container, specify the following:
      • CUSTOM_CONTAINER_IMAGE_URI: the URI of a Docker container image with your training code. Learn how to create a custom container image.
      • CUSTOM_CONTAINER_COMMAND: Optional. The command to be invoked when the container is started. This command overrides the container's default entrypoint.
      • CUSTOM_CONTAINER_ARGS: Optional. The arguments to be passed when starting the container.
    • If your training application is a Python package that runs in a prebuilt container, specify the following:
      • EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided code. Refer to the available prebuilt containers for training.
      • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
      • PYTHON_MODULE: The Python module name to run after installing the packages.
      • PYTHON_PACKAGE_ARGS: Optional. Command-line arguments to be passed to the Python module.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs

Request JSON body:

{
  "displayName": "JOB_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "MACHINE_TYPE"
          }
        },
        "replicaCount": REPLICA_COUNT,

        // Union field task can be only one of the following:
        "containerSpec": {
          "imageUri": CUSTOM_CONTAINER_IMAGE_URI,
          "command": [
            CUSTOM_CONTAINER_COMMAND
          ],
          "args": [
            CUSTOM_CONTAINER_ARGS
          ]
        },
        "pythonPackageSpec": {
          "executorImageUri": EXECUTOR_IMAGE_URI,
          "packageUris": [
            PYTHON_PACKAGE_URIS
          ],
          "pythonModule": PYTHON_MODULE,
          "args": [
            PYTHON_PACKAGE_ARGS
          ]
        }
        // End of list of possible types for union field task.
      }
      // Specify one workerPoolSpec for single replica training, or multiple workerPoolSpecs
      // for distributed training.
    ],
    "scheduling": {
      "strategy": "SPOT"
    }
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs" | Select-Object -Expand Content

The response contains information about specifications as well as the JOB_ID.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

customJob = aiplatform.CustomJob(
    display_name=TEST_CASE_NAME,
    worker_pool_specs=worker_pool_spec,
    staging_bucket=OUTPUT_DIRECTORY
)
customJob.run(
    scheduling_strategy=aiplatform.compat.types.custom_job.Scheduling.Strategy.SPOT
)

What's next