Running a Training Job

Cloud Machine Learning Engine provides model training as an asynchronous (batch) service. This page describes how to configure and submit a training job by running gcloud ml-engine jobs submit training from the command line or by sending a request to the API at projects.jobs.create.

Before you begin

Before you can submit a training job, you must complete the following steps:

  1. Configure your development environment, as described in the getting-started guide.

  2. Develop your training application with TensorFlow.

  3. Package your application and upload it and any unusual dependencies to a Cloud Storage bucket. Note: If you use the gcloud command-line tool to submit your job, you can package the application and submit the job in the same step.

Configuring the job

You pass your parameters to the training service by setting the members of the Job resource, which includes the items in the TrainingInput resource.

If you use the gcloud command-line tool to submit your training jobs, you can:

  • Specify the most common training parameters as flags of the gcloud ml-engine jobs submit training command.
  • Pass the remaining parameters in a YAML configuration file, named config.yaml by convention. The configuration file mirrors the structure of the JSON representation of the Job resource. You pass the path of your configuration file in the --config flag of the gcloud ml-engine jobs submit training command. So, if the path to your configuration file is config.yaml, you must set --config=config.yaml.

Gathering the job configuration data

The following properties are used to define your job.

Job name (jobId)
A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter).
Cluster configuration (scaleTier)
A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.
Training application package (packageUris)
A packaged training application that is staged in a Cloud Storage location. If you are using the gcloud command-line tool, the application packaging step is largely automated. See the details in the guide to packaging your application.
Module name (pythonModule)
The name of the main module in your package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name flag. See the guide to packaging your application
Region (region)
The Compute Engine region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. See the available regions for Cloud ML Engine services.
Job directory (jobDir)
The path to a Cloud Storage location to use for job output.
Runtime version (runtimeVersion)
The Cloud ML Engine version to use for the job. If you don't specify a runtime version, the training service uses the default Cloud ML Engine runtime version 1.0.
Python version (pythonVersion)
The Python version to use for the job. Python 3.5 is available with Cloud ML Engine runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.

Defining environment variables

It's useful to define your configuration details as environment variables when you use the gcloud command-line tool to submit your training job. Here are some examples to use with the commands on this page:

TRAINER_PACKAGE_PATH="/path/to/your/application/sources"
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="your_name_$now"
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://your/chosen/job/output/path"
PACKAGE_STAGING_PATH="gs://your/chosen/staging/path"
REGION="us-east1"
RUNTIME_VERSION="1.8"

Formatting your configuration parameters

How you specify your configuration details depends on how you are starting your training job:

gcloud

You provide job configuration details to the gcloud ml-engine jobs submit training command in two ways, or in a combination of both ways:

  • With command-line flags.
  • In a YAML file representing the Job resource.

    You can name this file whatever you want. By convention it's named config.yaml. Pass it to the gcloud tool using the --config flag.

The following example shows the contents of the configuration file for a job with a custom processing cluster.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3

The partial configuration shown in the file is possible because the other required details are available as command-line flags.

You can use Python 3.5 by specifying a Python version in your configuration file. See how to submit a training job using Python 3.5.

Python

When you use the Google API client library to submit a training job in a Python script, set your configuration in a dictionary with the same structure as the Job resource. This takes the form of a dictionary with two keys: jobId and trainingInput, with their respective data being the name for the job and a second dictionary with keys for the objects in the TrainingInput resource.

The following example shows how to build a Job representation for a job with a custom processing cluster.

training_inputs = {'scaleTier': 'CUSTOM',
    'masterType': 'complex_model_m',
    'workerType': 'complex_model_m',
    'parameterServerType': 'large_model',
    'workerCount': 9,
    'parameterServerCount': 3,
    'packageUris': ['gs://my/trainer/path/package-0.0.0.tar.gz'],
    'pythonModule': 'trainer.task'
    'args': ['--arg1', 'value1', '--arg2', 'value2'],
    'region': 'us-central1',
    'jobDir': 'gs://my/training/job/directory',
    'runtimeVersion': '1.8',
    'pythonVersion': '3.5'}

job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Note that training_inputs and job_spec are arbitrary identifiers: you can name these dictionaries whatever you want. However, the dictionary keys must be named exactly as shown, to match the names in the Job and TrainingInput resources.

Python 3.5 is available when you use Cloud ML Engine runtime version 1.4 or greater.

Submitting the job

When submitting a training job, you specify two sets of flags:

  • Job configuration parameters. Cloud ML Engine needs these values to set up resources in the cloud and deploy your application on each node in the processing cluster.
  • User arguments, or application parameters. Cloud ML Engine passes the value of these flags through to your application.

Create your job:

gcloud

Submit a training job request using the gcloud ml-engine jobs submit training command:

gcloud ml-engine jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml \
        -- \
        --user_first_arg=first_arg_value \
        --user_second_arg=second_arg_value

Notes:

  • The empty -- flag marks the end of the gcloud specific flags and the start of the USER_ARGS that you want to pass to your application.
  • Flags specific to Cloud ML Engine, such as --module-name, --runtime-version, and --job-dir, must come before the empty -- flag. The Cloud ML Engine service interprets these flags.
  • The --job-dir flag, if specified, must come before the empty -- flag, because Cloud ML Engine uses the --job-dir to validate the path.
  • Your application must handle the --job-dir flag too, if specified. Even though the flag comes before the empty --, the --job-dir is also passed to your application as a command-line flag.
  • You can define as many USER_ARGS as you need. Cloud ML Engine passes --user_first_arg, --user_second_arg, and so on, through to your application.

Python

  1. Save your project ID in the format the APIs need ('projects/project_name'):

    project_name = 'my_project_name'
    project_id = 'projects/{}'.format(project_name)
    
  2. Get a Python representation of the Cloud ML Engine services:

    cloudml = discovery.build('ml', 'v1')
    
  3. Form your request and send it:

    request = cloudml.projects().jobs().create(body=job_spec,
                  parent=project_id)
    response = request.execute()
    
  4. Catch any HTTP errors. The simplest way is to put the previous command in a try block:

    try:
        response = request.execute()
        # You can put your code for handling success (if any) here.
    
    except errors.HttpError, err:
        # Do whatever error response is appropriate for your application.
        # For this example, just send some text to the logs.
        # You need to import logging for this to work.
        logging.error('There was an error creating the training job.'
                      ' Check the details:')
        logging.error(err._get_reason())
    

What's Next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud ML Engine for TensorFlow