Creating custom training jobs

This page shows you how to create custom training jobs to run your custom training applications on AI Platform.

Before you submit a job

Before you submit a custom training job to AI Platform, you need to create a Python training application or a custom container to define the training code and dependencies you want to run on AI Platform. If you create a Python training application, you can use our pre-built containers to run your code. If you're not sure which of these options to choose, refer to the training code requirements to learn more.

What a custom job includes

When you create a custom job, you specify settings that AI Platform needs to run your training code, including:

Within the worker pool(s), you can specify the following settings:

Configuring distributed training

You can configure a custom training job for distributed training by specifying multiple worker pools.

Most examples on this page show single-replica training jobs with one worker pool. To modify them for distributed training:

  • Use your first worker pool to configure your primary replica, and set the replica count to 1.
  • Add more worker pools to configure worker replicas, parameter server replicas, or evaluator replicas, if your machine learning framework supports these additional cluster tasks for distributed training.

Learn more about using distributed training.

Creating a custom job

To create a custom job:

Console

In the Google Cloud Console, you cannot create a CustomJob resource directly. However, you can create a TrainingPipeline resource that creates a CustomJob.

The following instructions describe how to create a TrainingPipeline that creates a CustomJob and doesn't do anything else. If you want to use additional TrainingPipeline features, like training with a managed dataset or creating a Model resource at the end of training, read Creating training pipelines.

  1. In the Cloud Console, in the AI Platform section, go to the Training pipelines page.

    Go to Training pipelines

  2. Click Create to open the Train new model pane.

  3. On the Choose training method step, specify the following settings:

    1. In the Dataset drop-down list, select No managed dataset.

    2. Select Custom training (advanced).

    Click Continue.

  4. On the Define your model step, enter a name of your choice, MODEL_NAME, for your model. Click Continue.

  5. On the Training container step, specify the following settings:

    1. Select whether to use a Pre-built container or a Custom container for training.

    2. Depending on your choice, do one of the following:

    3. In the Model output directory field, you may specify the Cloud Storage URI of a directory in a bucket that you have access to. The directory does not need to exist yet.

      This value gets passed to AI Platform in the baseOutputDirectory API field, which sets several environment variables that your training application can access when it runs.

    4. In the Arguments field, you may optionally specify arguments for AI Platform to use when it starts running your training code. The behavior of these arguments differs depending on what type of container you are using:

    Click Continue.

  6. On the Hyperparameter tuning step, make sure that the Enable hyperparameter tuning checkbox is not selected. Click Continue.

  7. On the Compute and pricing step, specify the following settings:

    1. In the Region drop-down list, select a region that supports custom training.

    2. In the Worker pool 0 section, specify compute resources to use for training.

      If you specify accelerators, make sure the type of accelerator that you choose is available in your selected region.

      If you want to perform distributed training, then click Add more worker pools and specify an additional set of compute resources for each additional worker pool that you want.

    Click Continue.

  8. On the Prediction container step, select No prediction container.

  9. Click Start training to start the custom training pipeline.

gcloud

The following examples use the gcloud beta ai custom-jobs create command.

Depending on whether you have created a Python training application to use with a pre-built container or are using a custom container, run one of the following commands:

Pre-built container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --python-package-uris=PYTHON_PACKAGE_URIS \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE

Replace the following:

  • LOCATION: The region where the container or Python package will be run.
  • JOB_NAME: Required. A display name for the CustomJob.
  • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
  • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
  • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
  • PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to the available pre-built containers for training.
  • PYTHON_MODULE: The Python module name to run after installing the packages.

Custom container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI

Replace the following:

  • LOCATION: The region where the container or Python package will be run.
  • JOB_NAME: Required. A display name for the CustomJob.
  • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
  • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
  • CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry, Container Registry, or Docker Hub that is to be run on each worker replica.

Distributed training configuration

If you want to perform distributed training, then you can specify the --worker-pool-spec flag multiple times, once for each worker pool. For example, the following examples adapt the preceding examples to use a second worker pool:

Pre-built container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --python-package-uris=PYTHON_PACKAGE_URIS \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,executor-image-uri=PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=PYTHON_MODULE \
  --worker-pool-spec=machine-type=SECOND_POOL_MACHINE_TYPE,replica-count=SECOND_POOL_REPLICA_COUNT,executor-image-uri=SECOND_POOL_PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,python-module=SECOND_POOL_PYTHON_MODULE

Custom container

gcloud beta ai custom-jobs create \
  --region=LOCATION \
  --display-name=JOB_NAME \
  --worker-pool-spec=machine-type=MACHINE_TYPE,replica-count=REPLICA_COUNT,container-image-uri=CUSTOM_CONTAINER_IMAGE_URI \
  --worker-pool-spec=machine-type=SECOND_POOL_MACHINE_TYPE,replica-count=SECOND_POOL_REPLICA_COUNT,container-image-uri=SECOND_POOL_CUSTOM_CONTAINER_IMAGE_URI

Advanced configuration

If you want to specify configuration options that are not available in the preceding examples, you can use the --config flag to specify the path to a config.yaml file in your local environment that contains the fields of CustomJobSpec. For example:

gcloud beta ai custom-jobs create \
    --region=LOCATION \
    --display-name=JOB_NAME \
    --config=config.yaml

See an example of a config.yaml file.

REST & CMD LINE

Before using any of the request data below, make the following replacements:

  • LOCATION: The region where the container or Python package will be run.
  • PROJECT_ID: Your project ID or project number.
  • JOB_NAME: Required. A display name for the CustomJob.
  • Define the custom training job:
    • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
    • ACCELERATOR_TYPE: (Optional.) The type of accelerator to attach to the job.
    • ACCELERATOR_COUNT: (Optional.) The number of accelerators to attach to the job.
    • DISK_TYPE: (Optional.) The type of the boot disk to use for the job, either pd-standard (default) or pd-ssd. Learn more about disk types.
    • DISK_SIZE: (Optional.) The size in GB of the boot disk to use for the job. The default value is 100.
    • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
    • If your training application runs in a custom container, specify the following:
      • CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry, Container Registry, or Docker Hub that is to be run on each worker replica.
      • CUSTOM_CONTAINER_COMMAND: (Optional.) The command to be invoked when the container is started. This command overrides the container's default entrypoint.
      • CUSTOM_CONTAINER_ARGS: (Optional.) The arguments to be passed when starting the container.
    • If your training application is a Python package that runs in a pre-built container, specify the following:
      • PYTHON_PACKAGE_EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided Python package. Refer to the available pre-built containers for training.
      • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
      • PYTHON_MODULE: The Python module name to run after installing the packages.
      • PYTHON_PACKAGE_ARGS: (Optional.) Command-line arguments to be passed to the Python module.
    • Learn about job scheduling options.
    • TIMEOUT: (Optional.) The maximum running time for the job.
  • Specify the LABEL_NAME and LABEL_VALUE for any labels that you want to apply to this custom job.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs

Request JSON body:

{
  "displayName": "JOB_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": MACHINE_TYPE,
          "acceleratorType": ACCELERATOR_TYPE,
          "acceleratorCount": ACCELERATOR_COUNT
        },
        "replicaCount": REPLICA_COUNT,
        "diskSpec": {
          "bootDiskType": DISK_TYPE,
          "bootDiskSizeGb": DISK_SIZE
        },

        // Union field task can be only one of the following:
        "containerSpec": {
          "imageUri": CUSTOM_CONTAINER_IMAGE_URI,
          "command": [
            CUSTOM_CONTAINER_COMMAND
          ],
          "args": [
            CUSTOM_CONTAINER_ARGS
          ]
        },
        "pythonPackageSpec": {
          "executorImageUri": PYTHON_PACKAGE_EXECUTOR_IMAGE_URI,
          "packageUris": [
            PYTHON_PACKAGE_URIS
          ],
          "pythonModule": PYTHON_MODULE,
          "args": [
            PYTHON_PACKAGE_ARGS
          ]
        }
        // End of list of possible types for union field task.
      }
      // Specify one workerPoolSpec for single replica training, or multiple workerPoolSpecs
      // for distributed training.
    ],
    "scheduling": {
      "timeout": TIMEOUT
    }
   }
  },
  "labels": {
    LABEL_NAME_1": LABEL_VALUE_1,
    LABEL_NAME_2": LABEL_VALUE_2
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file called request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs

PowerShell

Save the request body in a file called request.json, and execute the following command:

$cred = gcloud auth application-default print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/customJobs" | Select-Object -Expand Content

The response contains information about specifications as well as the TRAININGPIPELINE_ID.

Node.js

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */

// const customJobDisplayName = 'YOUR_CUSTOM_JOB_DISPLAY_NAME';
// const containerImageUri = 'YOUR_CONTAINER_IMAGE_URI';
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';

// Imports the Google Cloud Job Service Client library
const {JobServiceClient} = require('@google-cloud/aiplatform');

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};

// Instantiates a client
const jobServiceClient = new JobServiceClient(clientOptions);

async function createCustomJob() {
  // Configure the parent resource
  const parent = `projects/${project}/locations/${location}`;
  const customJob = {
    displayName: customJobDisplayName,
    jobSpec: {
      workerPoolSpecs: [
        {
          machineSpec: {
            machineType: 'n1-standard-4',
            acceleratorType: 'NVIDIA_TESLA_K80',
            acceleratorCount: 1,
          },
          replicaCount: 1,
          containerSpec: {
            imageUri: containerImageUri,
            command: [],
            args: [],
          },
        },
      ],
    },
  };
  const request = {parent, customJob};

  // Create custom job request
  const [response] = await jobServiceClient.createCustomJob(request);

  console.log('Create custom job response');
  console.log(`${JSON.stringify(response)}`);
}
createCustomJob();

Python

This example uses the AI Platform (Unified) Client Library for Python. Before you run the following code sample, you must set up authentication.

from google.cloud import aiplatform


def create_custom_job_sample(
    project: str,
    display_name: str,
    container_image_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    custom_job = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_K80,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": container_image_uri,
                        "command": [],
                        "args": [],
                    },
                }
            ]
        },
    }
    parent = f"projects/{project}/locations/{location}"
    response = client.create_custom_job(parent=parent, custom_job=custom_job)
    print("response:", response)