gcloud alpha ml-engine jobs submit training

NAME
gcloud alpha ml-engine jobs submit training - submit an AI Platform training job
SYNOPSIS
gcloud alpha ml-engine jobs submit training JOB [--config=CONFIG] [--enable-web-access] [--job-dir=JOB_DIR] [--labels=[KEY=VALUE,…]] [--master-accelerator=[count=COUNT],[type=TYPE]] [--master-image-uri=MASTER_IMAGE_URI] [--master-machine-type=MASTER_MACHINE_TYPE] [--module-name=MODULE_NAME] [--network=NETWORK] [--package-path=PACKAGE_PATH] [--packages=[PACKAGE,…]] [--parameter-server-accelerator=[count=COUNT],[type=TYPE]] [--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI] [--python-version=PYTHON_VERSION] [--region=REGION] [--runtime-version=RUNTIME_VERSION] [--scale-tier=SCALE_TIER] [--service-account=SERVICE_ACCOUNT] [--staging-bucket=STAGING_BUCKET] [--tpu-tf-version=TPU_TF_VERSION] [--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG] [--worker-accelerator=[count=COUNT],[type=TYPE]] [--worker-image-uri=WORKER_IMAGE_URI] [--async     | --stream-logs] [--kms-key=KMS_KEY : --kms-keyring=KMS_KEYRING --kms-location=KMS_LOCATION --kms-project=KMS_PROJECT] [--parameter-server-count=PARAMETER_SERVER_COUNT --parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE] [--worker-count=WORKER_COUNT --worker-machine-type=WORKER_MACHINE_TYPE] [GCLOUD_WIDE_FLAG] [-- USER_ARGS …]
DESCRIPTION
(ALPHA) Submit an AI Platform training job.

This creates temporary files and executes Python code staged by a user on Cloud Storage. Model code can either be specified with a path, e.g.:

gcloud alpha ml-engine jobs submit training my_job --module-name trainer.task --staging-bucket gs://my-bucket --package-path /my/code/path/trainer --packages additional-dep1.tar.gz,dep2.whl

Or by specifying an already built package:

gcloud alpha ml-engine jobs submit training my_job --module-name trainer.task --staging-bucket gs://my-bucket --packages trainer-0.0.1.tar.gz,additional-dep1.tar.gz,dep2.whl

If --package-path=/my/code/path/trainer is specified and there is a setup.py file at /my/code/path/setup.py, the setup file will be invoked with sdist and the generated tar files will be uploaded to Cloud Storage. Otherwise, a temporary setup.py file will be generated for the build.

By default, this command runs asynchronously; it exits once the job is successfully submitted.

To follow the progress of your job, pass the --stream-logs flag (note that even with the --stream-logs flag, the job will continue to run after this command exits and must be cancelled with gcloud ai-platform jobs cancel JOB_ID).

For more information, see: https://cloud.google.com/ai-platform/training/docs/overview

POSITIONAL ARGUMENTS
JOB
Name of the job.
[-- USER_ARGS …]
Additional user arguments to be forwarded to user code

The '--' argument must be specified between gcloud specific args on the left and USER_ARGS on the right.

FLAGS
--config=CONFIG
Path to the job configuration file. This file should be a YAML document (JSON also accepted) containing a Job resource as defined in the API (all fields are optional): https://cloud.google.com/ml/reference/rest/v1/projects.jobs

EXAMPLES:

JSON:

{
  "jobId": "my_job",
  "labels": {
    "type": "prod",
    "owner": "alice"
  },
  "trainingInput": {
    "scaleTier": "BASIC",
    "packageUris": [
      "gs://my/package/path"
    ],
    "region": "us-east1"
  }
}

YAML:

jobId: my_job
labels:
  type: prod
  owner: alice
trainingInput:
  scaleTier: BASIC
  packageUris:
  - gs://my/package/path
  region: us-east1
If an option is specified both in the configuration file **and** via command line arguments, the command line arguments override the configuration file.
--enable-web-access
Whether you want AI Platform Training to enable [interactive shell access] (https://cloud.google.com/ai-platform/training/docs/monitor-debug-interactive-shell) to training containers. If set to true, you can access interactive shells at the URIs given by TrainingOutput.web_access_uris or HyperparameterOutput.web_access_uris (within TrainingOutput.trials).
--job-dir=JOB_DIR
Cloud Storage path in which to store training outputs and other data needed for training.

This path will be passed to your TensorFlow program as the --job-dir command-line arg. The benefit of specifying this field is that AI Platform will validate the path for use in training. However, note that your training program will need to parse the provided --job-dir argument.

If packages must be uploaded and --staging-bucket is not provided, this path will be used instead.

--labels=[KEY=VALUE,…]
List of label KEY=VALUE pairs to add.

Keys must start with a lowercase character and contain only hyphens (-), underscores (_), lowercase characters, and numbers. Values must contain only hyphens (-), underscores (_), lowercase characters, and numbers.

--master-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the master worker. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT).
type
Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must be greater than 0.
--master-image-uri=MASTER_IMAGE_URI
Docker image to run on each master worker. This image must be in Container Registry. Only one of --master-image-uri and --runtime-version must be specified.
--master-machine-type=MASTER_MACHINE_TYPE
Specifies the type of virtual machine to use for training job's master worker.

You must set this value when --scale-tier is set to CUSTOM.

--module-name=MODULE_NAME
Name of the module to run.
--network=NETWORK
Full name of the Google Compute Engine network (https://cloud.google.com/vpc/docs) to which the Job is peered with. For example, projects/12345/global/networks/myVPC. The format is of the form projects/{project}/global/networks/{network}, where {project} is a project number, as in '12345', and {network} is network name. Private services access must already have been configured (https://cloud.google.com/vpc/docs/configure-private-services-access) for the network. If unspecified, the Job is not peered with any network.
--package-path=PACKAGE_PATH
Path to a Python package to build. This should point to a local directory containing the Python source for the job. It will be built using setuptools (which must be installed) using its parent directory as context. If the parent directory contains a setup.py file, the build will use that; otherwise, it will use a simple built-in one.
--packages=[PACKAGE,…]
Path to Python archives used for training. These can be local paths (absolute or relative), in which case they will be uploaded to the Cloud Storage bucket given by --staging-bucket, or Cloud Storage URLs ('gs://bucket-name/path/to/package.tar.gz').
--parameter-server-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the parameter servers. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT).
type
Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must be greater than 0.
--parameter-server-image-uri=PARAMETER_SERVER_IMAGE_URI
Docker image to run on each parameter server. This image must be in Container Registry. If not specified, the value of --master-image-uri is used.
--python-version=PYTHON_VERSION
Version of Python used during training. Choices are 3.7, 3.5, and 2.7. However, this value must be compatible with the chosen runtime version for the job.

Must be used with a compatible runtime version:

  • 3.7 is compatible with runtime versions 1.15 and later.
  • 3.5 is compatible with runtime versions 1.4 through 1.14.
  • 2.7 is compatible with runtime versions 1.15 and earlier.
--region=REGION
Region of the machine learning training job to submit. If not specified, you might be prompted to select a region (interactive mode only).

To avoid prompting when this flag is omitted, you can set the compute/region property:

gcloud config set compute/region REGION

A list of regions can be fetched by running:

gcloud compute regions list

To unset the property, run:

gcloud config unset compute/region

Alternatively, the region can be stored in the environment variable CLOUDSDK_COMPUTE_REGION.

--runtime-version=RUNTIME_VERSION
AI Platform runtime version for this job. Must be specified unless --master-image-uri is specified instead. It is defined in documentation along with the list of supported versions: https://cloud.google.com/ai-platform/prediction/docs/runtime-version-list
--scale-tier=SCALE_TIER
Specify the machine types, the number of replicas for workers, and parameter servers. SCALE_TIER must be one of:
basic
Single worker instance. This tier is suitable for learning how to use AI Platform, and for experimenting with new models using small datasets.
basic-gpu
Single worker instance with a GPU.
basic-tpu
Single worker instance with a Cloud TPU.
custom
CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines (using the --config flag):
  • You must set TrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting.
  • You may set TrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers. Note that all of your workers must use the same machine type, which can be different from your parameter server type and master type. Your parameter servers must likewise use the same machine type, which can be different from your worker type and master type.
premium-1
Large number of workers with many parameter servers.
standard-1
Many workers and a few parameter servers.
--service-account=SERVICE_ACCOUNT
The email address of a service account to use when running the training appplication. You must have the iam.serviceAccounts.actAs permission for the specified service account. In addition, the AI Platform Training Google-managed service account must have the roles/iam.serviceAccountAdmin role for the specified service account. Learn more about configuring a service account. If not specified, the AI Platform Training Google-managed service account is used by default.
--staging-bucket=STAGING_BUCKET
Bucket in which to stage training archives.

Required only if a file upload is necessary (that is, other flags include local paths) and no other flags implicitly specify an upload path.

--tpu-tf-version=TPU_TF_VERSION
Runtime version of TensorFlow used by the container. This field must be specified if a custom container on the TPU worker is being used.
--use-chief-in-tf-config=USE_CHIEF_IN_TF_CONFIG
Use "chief" role in the cluster instead of "master". This is required for TensorFlow 2.0 and newer versions. Unlike "master" node, "chief" node does not run evaluation.
--worker-accelerator=[count=COUNT],[type=TYPE]
Hardware accelerator config for the worker nodes. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT).
type
Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod,tpu-v4-pod
count
Number of accelerators to attach to each machine running the job. Must be greater than 0.
--worker-image-uri=WORKER_IMAGE_URI
Docker image to run on each worker node. This image must be in Container Registry. If not specified, the value of --master-image-uri is used.
At most one of these can be specified:
--async
(DEPRECATED) Display information about the operation in progress without waiting for the operation to complete. Enabled by default and can be omitted; use --stream-logs to run synchronously.
--stream-logs
Block until job completion and stream the logs while the job runs.

Note that even if command execution is halted, the job will still run until cancelled with

gcloud ai-platform jobs cancel JOB_ID
Key resource - The Cloud KMS (Key Management Service) cryptokey that will be used to protect the job. The 'AI Platform Service Agent' service account must hold permission 'Cloud KMS CryptoKey Encrypter/Decrypter'. The arguments in this group can be used to specify the attributes of this resource.
--kms-key=KMS_KEY
ID of the key or fully qualified identifier for the key.

To set the kms-key attribute:

  • provide the argument --kms-key on the command line.

This flag argument must be specified if any of the other arguments in this group are specified.

--kms-keyring=KMS_KEYRING
The KMS keyring of the key.

To set the kms-keyring attribute:

  • provide the argument --kms-key on the command line with a fully specified name;
  • provide the argument --kms-keyring on the command line.
--kms-location=KMS_LOCATION
The Google Cloud location for the key.

To set the kms-location attribute:

  • provide the argument --kms-key on the command line with a fully specified name;
  • provide the argument --kms-location on the command line.
--kms-project=KMS_PROJECT
The Google Cloud project for the key.

To set the kms-project attribute:

  • provide the argument --kms-key on the command line with a fully specified name;
  • provide the argument --kms-project on the command line;
  • set the property core/project.
Configure parameter server machine type settings.
--parameter-server-count=PARAMETER_SERVER_COUNT
Number of parameter servers to use for the training job.

This flag argument must be specified if any of the other arguments in this group are specified.

--parameter-server-machine-type=PARAMETER_SERVER_MACHINE_TYPE
Type of virtual machine to use for training job's parameter servers. This flag must be specified if any of the other arguments in this group are specified machine to use for training job's parameter servers.

This flag argument must be specified if any of the other arguments in this group are specified.

Configure worker node machine type settings.
--worker-count=WORKER_COUNT
Number of worker nodes to use for the training job.

This flag argument must be specified if any of the other arguments in this group are specified.

--worker-machine-type=WORKER_MACHINE_TYPE
Type of virtual machine to use for training job's worker nodes.

This flag argument must be specified if any of the other arguments in this group are specified.

GCLOUD WIDE FLAGS
These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.

Run $ gcloud help for details.

NOTES
This command is currently in alpha and might change without notice. If this command fails with API permission errors despite specifying the correct project, you might be trying to access an API with an invitation-only early access allowlist. These variants are also available:
gcloud ml-engine jobs submit training
gcloud beta ml-engine jobs submit training