Creating custom training jobs

This page shows you how to create custom training jobs to run your custom training applications on Vertex AI.

Before you submit a job

Before you submit a custom training job to Vertex AI, you need to create a Python training application or a custom container to define the training code and dependencies you want to run on Vertex AI. If you create a Python training application, you can use our pre-built containers to run your code. If you're not sure which of these options to choose, refer to the training code requirements to learn more.

What a custom job includes

When you create a custom job, you specify settings that Vertex AI needs to run your training code, including:

Within the worker pool(s), you can specify the following settings:

Configuring distributed training

You can configure a custom training job for distributed training by specifying multiple worker pools.

Most examples on this page show single-replica training jobs with one worker pool. To modify them for distributed training:

  • Use your first worker pool to configure your primary replica, and set the replica count to 1.
  • Add more worker pools to configure worker replicas, parameter server replicas, or evaluator replicas, if your machine learning framework supports these additional cluster tasks for distributed training.

Learn more about using distributed training.

Creating a custom job

To create a custom job:

Console

In the Google Cloud Console, you cannot create a CustomJob resource directly. However, you can create a TrainingPipeline resource that creates a CustomJob.

The following instructions describe how to create a TrainingPipeline that creates a CustomJob and doesn't do anything else. If you want to use additional TrainingPipeline features, like training with a managed dataset or creating a Model resource at the end of training, read Creating training pipelines.

  1. In the Cloud Console, in the Vertex AI section, go to the Training pipelines page.

    Go to Training pipelines

  2. Click Create to open the Train new model pane.

  3. On the Choose training method step, specify the following settings:

    1. In the Dataset drop-down list, select No managed dataset.

    2. Select Custom training (advanced).

    Click Continue.

  4. On the Define your model step, enter a name of your choice, MODEL_NAME, for your model. Click Continue.

  5. On the Training container step, specify the following settings:

    1. Select whether to use a Pre-built container or a Custom container for training.

    2. Depending on your choice, do one of the following:

    3. In the Model output directory field, you may specify the Cloud Storage URI of a directory in a bucket that you have access to. The directory does not need to exist yet.

      This value gets passed to Vertex AI in the baseOutputDirectory API field, which sets several environment variables that your training application can access when it runs.

    4. In the Arguments field, you may optionally specify arguments for Vertex AI to use when it starts running your training code. The behavior of these arguments differs depending on what type of container you are using:

    Click Continue.

  6. On the Hyperparameter tuning step, make sure that the Enable hyperparameter tuning checkbox is not selected. Click Continue.

  7. On the Compute and pricing step, specify the following settings:

    1. In the Region drop-down list, select a region that supports custom training.

    2. In the Worker pool 0 section, specify compute resources to use for training.

      If you specify accelerators, make sure the type of accelerator that you choose is available in your selected region.

      If you want to perform distributed training, then click Add more worker pools and specify an additional set of compute resources for each additional worker pool that you want.

    Click Continue.

  8. On the Prediction container step, select No prediction container.

  9. Click Start training to start the custom training pipeline.

gcloud

The following examples use the gcloud beta ai custom-jobs create command.

Depending on whether you have created a Python training application to use with a pre-built container or are using a custom container, run one of the following commands:

Pre-built container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --python-package-uris=PYTHON_PACKAGE_URIS \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE

Replace the following:

  • LOCATION: The region where the container or Python package will be run.
  • JOB_NAME: Required. A display name for the CustomJob.
  • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
  • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
  • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
  • PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to the available pre-built containers for training.
  • PYTHON_MODULE: The Python module name to run after installing the packages.

Custom container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Replace the following:

  • LOCATION: The region where the container or Python package will be run.
  • JOB_NAME: Required. A display name for the CustomJob.
  • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
  • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
  • CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry, Container Registry, or Docker Hub that is to be run on each worker replica.

Custom container based on local code

If you have training code on your local computer, you can use a single command to do the following:

  • Build a custom Docker image based on your code.
  • Push the image to Container Registry.
  • Start a CustomJob based on the image.

The result is similar to creating a CustomJob using any other custom container; you can use this version of the command if it is convenient for your workflow.

Before you begin

Since this version of the command builds and pushes a Docker image, you must perform the following configuration on your local computer:

  1. Install Docker Engine.

  2. If you are using Linux, configure Docker so you can run it without sudo.

  3. Enable the Container Registry API.

    Enable the API

  4. Configure authentication for Docker, so that you can push Docker images to Container Registry:

    gcloud auth configure-docker
    

Building and pushing the Docker image, and creating a CustomJob

The following command builds a Docker image based on a pre-built training container image and your local Python code, pushes the image to Container Registry, and creates a CustomJob.

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,local-package-path=WORKING_DIRECTORY,script=SCRIPT_PATH

Replace the following:

  • LOCATION: The region where the container or Python package will be run.

  • JOB_NAME: Required. A display name for the CustomJob.

  • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.

  • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.

  • PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to the available pre-built containers for training.

    This image acts as the base image for the new Docker image that you are building with this command.

  • WORKING_DIRECTORY: A directory in your local file system containing the entry point script that runs your training code (see the following list item).

    You can use the parent directory of the script, or a higher-level directory. You might want to use a higher-level directory in order to specify a fully-qualified Python module name (see the following list item). You might also want to use a higher-level directory if it contains a requirements.txt file or a setup.py file, in order to install additional dependencies in the container; this functionality is similar to the corresponding functionality in the gcloud beta ai custom-jobs local-run command. However, even if you specify a higher-level directory, this command only copies the parent directory of your entry point script to the Docker image.

  • SCRIPT_PATH: The path, relative to WORKING_DIRECTORY on your local file system, to the script that is the entry point for your training code. This can be a Python script (ending in .py) or a Bash script.

    For example, if you want to run /hello-world/trainer/task.py and WORKING_DIRECTORY is /hello-world, then use trainer/task.py for this value.

    Using python-module instead of script

    You can optionally replace script=SCRIPT_PATH with python-module=PYTHON_MODULE to specify the name of a Python module in WORKING_DIRECTORY to run as the entry point for training. For example, instead of script=trainer/task.py, you might specify python-module=trainer.task.

    In this case, the resulting Docker container loads your code as a module rather than as a script. You likely want to use this option if your entry point script imports other Python modules in WORKING_DIRECTORY.

Distributed training configuration

If you want to perform distributed training, then you can specify the --worker-pool-spec flag multiple times, once for each worker pool. For example, the following examples adapt the preceding examples to use a second worker pool:

Pre-built container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --python-package-uris=PYTHON_PACKAGE_URIS \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE \
  --worker-pool-spec=machine-type=SECOND_POOL_MACHINE_TYPE,replica-count=SECOND_POOL_REPLICA_COUNT,executor-image-uri=SECOND_POOL_PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=SECOND_POOL_PYTHON_MODULE

Custom container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI \
  --worker-pool-spec=machine-type=SECOND_POOL_MACHINE_TYPE,replica-count=SECOND_POOL_REPLICA_COUNT,container-image-uri=SECOND_POOL_CUSTOM_CONTAINER_IMAGE_URI

Custom container based on local code

When you use local-package-path=WORKING_DIRECTORY in the command, you must only configure a single worker pool. Distributed training is not available.

Advanced configuration

If you want to specify configuration options that are not available in the preceding examples, you can use the --config flag to specify the path to a config.yaml file in your local environment that contains the fields of CustomJobSpec. For example:

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --config=config.yaml

See an example of a config.yaml file.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • LOCATION: The region where the container or Python package will be run.
  • PROJECT_ID: Your project ID or project number.
  • JOB_NAME: Required. A display name for the CustomJob.
  • Define the custom training job:
    • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
    • ACCELERATOR_TYPE: (Optional.) The type of accelerator to attach to the job.
    • ACCELERATOR_COUNT: (Optional.) The number of accelerators to attach to the job.
    • DISK_TYPE: (Optional.) The type of the boot disk to use for the job, either pd-standard (default) or pd-ssd. Learn more about disk types.
    • DISK_SIZE: (Optional.) The size in GB of the boot disk to use for the job. The default value is 100.
    • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
    • If your training application runs in a custom container, specify the following:
      • CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry, Container Registry, or Docker Hub that is to be run on each worker replica.
      • CUSTOM_CONTAINER_COMMAND: (Optional.) The command to be invoked when the container is started. This command overrides the container's default entrypoint.
      • CUSTOM_CONTAINER_ARGS: (Optional.) The arguments to be passed when starting the container.
    • If your training application is a Python package that runs in a pre-built container, specify the following:
      • PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to the available pre-built containers for training.
      • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
      • PYTHON_MODULE: The Python module name to run after installing the packages.
      • PYTHON_PACKAGE_ARGS: (Optional.) Command-line arguments to be passed to the Python module.
    • Learn about job scheduling options.
    • TIMEOUT: (Optional.) The maximum running time for the job.
  • Specify the LABEL_NAME and LABEL_VALUE for any labels that you want to apply to this custom job.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs

Request JSON body:

{
  "displayName": "JOB_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": MACHINE_TYPE,
          "acceleratorType": ACCELERATOR_TYPE,
          "acceleratorCount": ACCELERATOR_COUNT
        },
        "replicaCount": REPLICA_COUNT,
        "diskSpec": {
          "bootDiskType": DISK_TYPE,
          "bootDiskSizeGb": DISK_SIZE
        },

        // Union field task can be only one of the following:
        "containerSpec": {
          "imageUri": CUSTOM_CONTAINER_IMAGE_URI,
          "command": [
            CUSTOM_CONTAINER_COMMAND
          ],
          "args": [
            CUSTOM_CONTAINER_ARGS
          ]
        },
        "pythonPackageSpec": {
          "executorImageUri": PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,
          "packageUris": [
            PYTHON_PACKAGE_URIS
          ],
          "pythonModule": PYTHON_MODULE,
          "args": [
            PYTHON_PACKAGE_ARGS
          ]
        }
        // End of list of possible types for union field task.
      }
      // Specify one workerPoolSpec for single replica training, or multiple workerPoolSpecs
      // for distributed training.
    ],
    "scheduling": {
      "timeout": TIMEOUT
    }
  },
  "labels": {
    LABEL_NAME_1": LABEL_VALUE_1,
    LABEL_NAME_2": LABEL_VALUE_2
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs" | Select-Object -Expand Content

The response contains information about specifications as well as the TRAININGPIPELINE_ID.

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response');
  console.log(`${JSON.stringify(response)}`);
}
createCustomJob();

Python

This example uses the Vertex SDK for Python. Before you run the following code sample, you must set up authentication.

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)

What's next