Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in image object detection algorithm works, and how to use it.
Overview
The built-in image object detection algorithm uses your training and validation datasets to train models continuously, and then it outputs the most accurate SavedModel generated during the course of the training job. You can also use hyperparameter tuning to achieve the best model accuracy. The exported SavedModel can be used directly for prediction, either locally or deployed to AI Platform Prediction for production service.
Limitations
Image built-in algorithms support training with single CPUs, GPUs or TPUs. The resulting SavedModel is compatible with serving on CPUs and GPUs.
The following features are not supported for training with the built-in image object detection algorithm:
- Distributed training. To run a TensorFlow distributed training job on AI Platform Training, you must create a training application.
- Multi-GPU training. Built-in algorithms use only one GPU at a time. To take full advantage of training with multiple GPUs on one machine, you must create a training application. Find more information about machine types.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierBASIC_TPU
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
standard_gpu
standard_p100
standard_v100
large_model_v100
complex_model_m_gpu
complex_model_l_gpu
complex_model_m_p100
complex_model_m_v100
complex_model_l_v100
TPU_V2
(8 cores)
Authorize your Cloud TPU to access your project
Follow these steps to authorize the Cloud TPU service account name associated with your Google Cloud project:
Get your Cloud TPU service account name by calling
projects.getConfig
. Example:PROJECT_ID=PROJECT_ID curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ https://ml.googleapis.com/v1/projects/$PROJECT_ID:getConfig
Save the value of the
serviceAccountProject
andtpuServiceAccount
field returned by the API.Initialize the Cloud TPU service account:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" -d '{}' \ https://serviceusage.googleapis.com/v1beta1/projects/<serviceAccountProject>/services/tpu.googleapis.com:generateServiceIdentity
Now add the Cloud TPU service account as a member in your project,
with the role Cloud ML Service Agent. Complete the following steps in the
Google Cloud console or using the gcloud
command:
Console
- Log in to the Google Cloud console and choose the project in which you're using the TPU.
- Choose IAM & Admin > IAM.
- Click the Add button to add a member to the project.
- Enter the TPU service account in the Members text box.
- Click the Roles dropdown list.
- Enable the Cloud ML Service Agent role (Service Agents > Cloud ML Service Agent).
gcloud
Set environment variables containing your project ID and the Cloud TPU service account:
PROJECT_ID=PROJECT_ID SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com
Grant the
ml.serviceAgent
role to the Cloud TPU service account:gcloud projects add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
For more details about granting roles to service accounts, see the IAM documentation.
Format input data for training
The built-in image object detection algorithm requires your input data to be formatted
as tf.Examples
, saved in TFRecord file(s). The tf.Example
data structure and
TFRecord file format are both designed for efficient data reading with
TensorFlow.
The TFRecord format is a simple format for storing a sequence of binary records.
In this case, all the records contain binary representations of images. Each
image, along with its class label(s), is represented as a tf.Example
. You can
save many tf.Example
s to a single TFRecord file. You can also shard a large
dataset among multiple TFRecord files.
Learn more about TFRecord and tf.Example
.
Convert your images to TFRecords
To convert images to the format required for getting predictions, follow the TensorFlow Model Garden's guide to preparing inputs for object detection.
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Required input format
To train with the built-in image object detection algorithm, your image data must
be structured as tf.Example
s that include the following fields:
image/encoded
is the raw image encoded as a string.image/object/class/label
is a list of integer labels for the corresponding image (one label per box).The set of integer labels used for your dataset must be a consecutive sequence starting at
1
. For example, if your dataset has five classes, then each label must be an integer in the interval[1, 5]
.image/object/bbox/xmin
is a list of normalized left x coordinates for the corresponding image (one coordinate per box). Each coordinate must be in the interval[0, 1]
.image/object/bbox/xmax
is a list of normalized right x coordinates for the corresponding image (one coordinate per box). Each coordinate must be in the interval[0, 1]
.image/object/bbox/ymin
is a list of normalized top y coordinates for the corresponding image (one coordinate per box). Each coordinate must be in the interval[0, 1]
.image/object/bbox/ymax
is a list of normalized bottom y coordinates for the corresponding image (one coordinate per box). Each coordinate must be in the interval[0, 1]
.
The following example shows the structure of a tf.Example
for an image
containing two bounding boxes. The first box has the label 1
, its top-left
corner is at the normalized coordinates (0.1, 0.4)
, and its bottom-right
corner is at the normalized coordinates (0.5, 0.8)
. The second box has the
label 2
, its top-left corner is at the normalized coordinates (0.3, 0.5)
,
and its bottom-right corner is at the normalized coordinates (0.4, 0.7)
.
{
'image/encoded': '<encoded image data>',
'image/object/class/label': [1, 2],
'image/object/bbox/xmin': [0.1, 0.3],
'image/object/bbox/xmax': [0.5, 0.4],
'image/object/bbox/ymin': [0.4, 0.5],
'image/object/bbox/ymax': [0.8, 0.7]
}
This tf.Example
format follows the same one used in the
TFRecord object detection script.
Getting the best SavedModel as output
When the training job completes, AI Platform Training writes a TensorFlow SavedModel to
the Cloud Storage bucket you specified as jobDir
when you submitted
the job. The SavedModel is written to jobDir/model
. For example, if you
submit the job to gs://your-bucket-name/your-job-dir
, then AI Platform Training writes
the SavedModel to gs://your-bucket-name/your-job-dir/model
.
If you enabled hyperparameter tuning, AI Platform Training returns the TensorFlow SavedModel with the highest accuracy achieved during the training process. For example, if you submitted a training job with 2,500 training steps, and the accuracy was highest at 2,000 steps, you get a TensorFlow SavedModel saved from that particular point.
Each trial of AI Platform Training writes the TensorFlow SavedModel with the highest
accuracy to its own directory within your Cloud Storage bucket. For
example,
gs://your-bucket-name/your-job-dir/model/trial_{trial_id}
.
The signature of the output SavedModel is:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['encoded_image'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: encoded_image_string_tensor:0
inputs['key'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: key:0
The given SavedModel SignatureDef contains the following output(s):
outputs['detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100, 4)
name: detection_boxes:0
outputs['detection_classes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_classes:0
outputs['detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_scores:0
outputs['key'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: Identity:0
outputs['num_detections'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: num_detections:0
Method name is: tensorflow/serving/predict
Inputs:
encoded_image
: The raw (not decoded) image bytes. This is the same asimage/encoded
stored in tf.Example.key
: The string value identifier of prediction input. This value is passed through to the outputkey
. In batch prediction, this helps to map the prediction output to the input.
Outputs:
num_detections
: The number of detected bounding boxes.detection_boxes
: A list of relative (value in [0,1
]) coordinates ([ymin, xmin, ymax, xmax
]) of the detection bounding boxes.detection_classes
: A list of predicted class (integer) labels for each detection box indetection_boxes
.detection_scores
: A list ofscores
for each detection box indetection_boxes
.key
: The output key.
The following is an example of prediction outputs:
{u'detection_classes': [1.0, 3.0, 3.0, ...],
u'key': u'test_key',
u'num_detections': 100.0,
u'detection_scores': [0.24401935935020447, 0.19375669956207275, 0.18359294533729553, ...]]}
Example configurations
If you submit a job using gcloud
, you need to create a config.yaml
file
for your machine type and hyperparameter tuning specifications. If you use
Google Cloud console, you don't need to create this file. Learn
how to submit a training job.
The following example config.yaml
file shows how to allocate TPU resources
for your training job:
cat << EOF > config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: n1-standard-16
masterConfig:
imageUri: gcr.io/cloud-ml-algos/image_object_detection:latest
acceleratorConfig:
type: NVIDIA_TESLA_P100
count: 1
workerType: cloud_tpu
workerConfig:
imageUri: gcr.io/cloud-ml-algos/image_object_detection:latest
tpuTfVersion: 1.14
acceleratorConfig:
type: TPU_V2
count: 8
workerCount: 1
EOF
Next, use your config.yaml
file to submit a training job.
Hyperparameter tuning configuration
To use hyperparameter tuning, include your hyperparameter tuning configuration
in the same config.yaml
file as your machine configuration.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in image object detection algorithm.
The following example config.yaml
file shows how to allocate TPU resources for
your training job, and includes hyperparameter tuning configuration:
cat << EOF > config.yaml
trainingInput:
# Use a cluster with many workers and a few parameter servers.
scaleTier: CUSTOM
masterType: n1-standard-16
masterConfig:
imageUri: gcr.io/cloud-ml-algos/image_object_detection:latest
acceleratorConfig:
type: NVIDIA_TESLA_P100
count: 1
workerType: cloud_tpu
workerConfig:
imageUri: gcr.io/cloud-ml-algos/image_object_detection:latest
acceleratorConfig:
type: TPU_V2
count: 8
workerCount: 1
# The following are hyperparameter configs.
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: "AP"
maxTrials: 6
maxParallelTrials: 3
enableTrialEarlyStopping: True
params:
- parameterName: initial_learning_rate
type: DOUBLE
minValue: 0.001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
EOF
Submit an image object detection training job
This section explains how to submit a training job using the built-in image object detection algorithm.
Console
Select your algorithm
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select image object detection and click Next.
Select your training and validation data
In the drop-down box under Training data, specify whether you are using a single file or multiple files:
- For a single file, leave "Use single file in a GCS bucket" selected.
- For multiple files, select "Use multiple files stored in one Cloud Storage directory".
For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.
If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.
In the drop-down box under Validation data, specify whether you are using a single file or multiple files:
- For a single file, leave "Use single file in a GCS bucket" selected.
- For multiple files, select "Use multiple files stored in one Cloud Storage directory".
For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.
If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.
In Output directory, enter the path to the Cloud Storage bucket where you want AI Platform Training to store the outputs from your training job. You can fill in your Cloud Storage bucket path directly, or click the Browse button to select it.
To keep things organized, create a new directory within your Cloud Storage bucket for this training job. You can do this within the Browse pane.
Click Next.
Set the algorithm arguments
Each algorithm-specific argument displays a default value for training jobs without hyperparameter tuning. If you enable hyperparameter tuning on an algorithm argument, you must specify its minimum and maximum value.
To learn more about all the algorithm arguments, follow the links in the Google Cloud console and refer to the built-in image object detection reference for more details.
Submit the job
On the Job settings tab:
- Enter a unique Job ID.
- Enter an available region (such as "us-central1").
- To select machine types, select "CUSTOM" for the scale tier.
A section to provide your Custom cluster specification
displays.
- Select an available machine type for Master type.
- If you want to use TPUs, set the Worker type to cloud_tpu. The worker count defaults to 1.
Click Done to submit the training job.
gcloud
Set environment variables for your job:
PROJECT_ID="YOUR_PROJECT_ID" BUCKET_NAME="YOUR_BUCKET_NAME" # Specify the same region where your data is stored REGION="YOUR_REGION" gcloud config set project $PROJECT_ID gcloud config set compute/region $REGION # Set Cloud Storage paths to your training and validation data # Include a wildcard if you select multiple files. TRAINING_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/train-*.tfrecord" VALIDATION_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/eval-*.tfrecord" # Specify the Docker container for your built-in algorithm selection IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest" # Variables for constructing descriptive names for JOB_ID and JOB_DIR DATASET_NAME="coco" ALGORITHM="object_detection" MODEL_NAME="${DATASET_NAME}_${ALGORITHM}" DATE="$(date '+%Y%m%d_%H%M%S')" # Specify an ID for this job JOB_ID="${MODEL_NAME}_${DATE}" # Specify the directory where you want your training outputs to be stored JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${JOB_ID}"
Submit the job:
gcloud ai-platform jobs submit training $JOB_ID \ --region=$REGION \ --config=config.yaml \ --job-dir=$JOB_DIR \ -- \ --training_data_path=$TRAINING_DATA_PATH \ --validation_data_path=$VALIDATION_DATA_PATH \ --train_batch_size=64 \ --num_eval_images=500 \ --train_steps_per_eval=2000 \ --max_steps=22500 \ --num_classes=90 \ --warmup_steps=500 \ --initial_learning_rate=0.08 \ --fpn_type="nasfpn" \ --aug_scale_min=0.8 \ --aug_scale_max=1.2
After the job is submitted successfully, you can view the logs using the following
gcloud
commands:gcloud ai-platform jobs describe $JOB_ID gcloud ai-platform jobs stream-logs $JOB_ID
What's next
- Learn how to use your own dataset with the TFRecord object detection script.
- Refer to the built-in image object detection reference to learn about all the different parameters.