Starting a Training Job

Cloud Machine Learning Engine provides model training as an asynchronous service. This page describes how to configure and submit a training job. If you're looking for more information about how training works or its features and procedures, you can find more detailed explanations on the training concepts page.

After you have done all of your preparatory configuration work, submitting a training job to Cloud ML Engine is a matter of setting up a single request to projects.jobs.create. The job creation service is also exposed through gcloud ml-engine jobs submit training from the command line. This page steps you through the process.

Before you begin

Running a training job is one step in the larger model training process. You should have completed these steps before you submit training jobs:

  1. Configure your development environment.

  2. Develop your trainer application with TensorFlow.

  3. Package your application and upload it and any unusual dependencies to a Google Cloud Storage bucket (this step is included in job creation when you use the gcloud command-line tool).

Configuring the Job

Before you can start a job you need to assemble its configuration details. The details are the required input objects in the Job resource, including the items in the TrainingInput resource. You can read more about the configuration information that you need to provide along with the other training concepts.

Gathering the job configuration data

Cloud ML Engine provides many options for training jobs in the cloud. The following parameters are used to define your job. You can find more detail about these items in the training overview.

Training application package
A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated. You'll find all the details on the trainer packaging how-to page.
Job name
A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter).
Cluster configuration
A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.
Module name
The name of the main module in your trainer package. The main module is the Python file you call to start the application.
Job directory
The path to a Google Cloud Storage location to use for job output.
Region
The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data.
Runtime version
The version of Cloud ML Engine to use for the job. If you don't specify a runtime version, the training service uses the latest stable version.

Defining environment variables

When you use the gcloud command-line tool to submit your training job, it can helpful to define your configuration details as environment variables. Here are some to use with the commands on this page:

TRAINER_PACKAGE_PATH="/path/to/your/application/sources"
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="census_$now"
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://your/chosen/job/output/path"
PACKAGE_STAGING_LOCATION="gs://your/chosen/staging/path"
REGION="us-east1"
RUNTIME_VERSION="1.2"

In addition to showing the creation of environment variables, the preceding example creates a job name from the model name with the date and time appended.

Formatting your configuration parameters

How you specify your configuration details depends on how you are starting your training job:

gcloud

You provide job configuration details to the gcloud ml-engine jobs submit training command in a combination of two ways:

  • With command-line flags.
  • In a YAML file representing the Job resource.

    You can name this file whatever you want, but by convention it's called config.yaml. You pass it to the tool using the --config flag.

The following example shows the contents of the configuration file for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3

The partial configuration shown in the file is possible because the other required details are available as command-line flags.

Python

When you use the Google API client library to submit a training job in a Python script, set your configuration in a dictionary with the same structure as the Job resource. This takes the form of a dictionary with two keys: jobId and trainingInput, with their respective data being the name for the job and a second dictionary with keys for the objects in the TrainingInput resource.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'complex_model_m',
    'workerType': 'complex_model_m',
    'parameterServerType': 'large_model',
    'workerCount': 9,
    'parameterServerCount': 3,
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '1.2'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown (to match the names in the Job and TrainingInput resources).

Submitting the job

With your job arguments configured, create your job:

gcloud

Submit a training job request using the gcloud ml-engine jobs submit training command:

gcloud ml-engine jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \
        -- \
        --trainer_arg_1 value_1 \
         ...
        --trainer_arg_n value_n

Python

  1. Save your project ID in the format the APIs need ('projects/project_name'):

    project_name = 'my_project_name'
    project_id = 'projects/{}'.format(project_name)
    
  2. Get your Cloud Platform credentials:

    credentials = GoogleCredentials.get_application_default()
    
  3. Get a Python representation of the Cloud ML Engine services:

    cloudml = discovery.build('ml', 'v1', credentials=credentials)
    
  4. Form your request and send it:

    request = cloudml.projects().jobs().create(body=job_spec,
                  parent=project_id)
    response = request.execute()
    
  5. Catch any HTTP errors. The simplest way is to put the previous command in a try block:

    try:
        response = request.execute()
        # You can put your code for handling success (if any) here.
    
    except errors.HttpError, err:
        # Do whatever error response is appropriate for your application.
        # For this example, just send some text to the logs.
        # You need to import logging for this to work.
        logging.error('There was an error creating the training job.'
                      ' Check the details:')
        logging.error(err._get_reason())
    

What's Next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)