Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in image classification algorithm works, and how to use it.
Overview
The built-in image classification algorithm uses your training and validation datasets to train models continuously, and then it outputs the most accurate SavedModel generated during the course of the training job. You can also use hyperparameter tuning to achieve the best model accuracy. The exported SavedModel can be used directly for prediction, either locally or deployed to AI Platform Prediction for production service.
Limitations
Image built-in algorithms support training with single CPUs, GPUs or TPUs. The resulting SavedModel is compatible with serving on CPUs and GPUs.
The following features are not supported for training with the built-in image classification algorithm:
- Distributed training. To run a TensorFlow distributed training job on AI Platform Training, you must create a training application.
- Multi-GPU training. Built-in algorithms use only one GPU at a time. To take full advantage of training with multiple GPUs on one machine, you must create a training application. Find more information about machine types.
Supported machine types
The following AI Platform Training scale tiers and machine types are supported:
BASIC
scale tierBASIC_TPU
scale tierCUSTOM
scale tier with any of the Compute Engine machine types supported by AI Platform Training.CUSTOM
scale tier with any of the following legacy machine types:standard
large_model
complex_model_s
complex_model_m
complex_model_l
standard_gpu
standard_p100
standard_v100
large_model_v100
complex_model_m_gpu
complex_model_l_gpu
complex_model_m_p100
complex_model_m_v100
complex_model_l_v100
TPU_V2
(8 cores)
Authorize your Cloud TPU to access your project
Format input data for training
The built-in image classification algorithm requires your input data to be formatted
as tf.Examples
, saved in TFRecord file(s). The tf.Example
data structure and
TFRecord file format are both designed for efficient data reading with
TensorFlow.
The TFRecord format is a simple format for storing a sequence of binary records.
In this case, all the records contain binary representations of images. Each
image, along with its class label(s), is represented as a tf.Example
. You can
save many tf.Example
s to a single TFRecord file. You can also shard a large
dataset among multiple TFRecord files.
Learn more about TFRecord and tf.Example
.
Convert your images to TFRecords
TensorFlow provides a script you can use to convert your images from JPEG to TFRecord format.
You can use the script if:
- You store the images in Cloud Storage.
You have CSV files with the paths to the images in Cloud Storage, and their corresponding labels. For example:
gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
You store these CSV files in Cloud Storage.
The following example shows how to run the script:
Download the script:
curl https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/jpeg_to_tf_record.py > ./jpeg_to_tf_record.py
Set variables for your project ID and bucket name, if you have not already done so:
PROJECT_ID="YOUR_PROJECT_ID" BUCKET_NAME="YOUR_BUCKET_NAME"
Create a list of all the possible labels for your dataset in a temporary file:
cat << EOF > /tmp/labels.txt daisy dandelion roses sunflowers tulips EOF
Run the script using flowers data from the public
cloud-ml-data
bucket and your list of labels:python -m jpeg_to_tf_record.py \ --train_csv gs://cloud-ml-data/img/flower_photos/train_set.csv \ --validation_csv gs://cloud-ml-data/img/flower_photos/eval_set.csv \ --labels_file /tmp/labels.txt \ --project_id $PROJECT_ID \ --output_dir gs://$BUCKET_NAME/flowers_as_tf_record
Check Cloud Storage bucket permissions
To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.
Required input format
To train with the built-in image classification algorithm, your image data must
be structured as tf.Example
s that include the following fields:
image/encoded
is the raw image string.image/class/label
is a single integer label for the corresponding image. Multiple labels per instance are not supported.The set of integer labels used for your dataset must be a consecutive sequence starting at
1
. For example, if your dataset has five classes, then each label must be an integer in the interval[1, 5]
.
For example:
{
'image/encoded': '<encoded image data>',
'image/class/label': 2
}
Getting the best SavedModel as output
When the training job completes, AI Platform Training writes a TensorFlow SavedModel to
the Cloud Storage bucket you specified as jobDir
when you submitted
the job. The SavedModel is written to jobDir/model
. For example, if you
submit the job to gs://your-bucket-name/your-job-dir
, then AI Platform Training writes
the SavedModel to gs://your-bucket-name/your-job-dir/model
.
If you enabled hyperparameter tuning, AI Platform Training returns the TensorFlow SavedModel with the highest accuracy achieved during the training process. For example, if you submitted a training job with 2,500 training steps, and the accuracy was highest at 2,000 steps, you get a TensorFlow SavedModel saved from that particular point.
Each trial of AI Platform Training writes the TensorFlow SavedModel with the highest
accuracy to its own directory within your Cloud Storage bucket. For
example,
gs://your-bucket-name/your-job-dir/model/trial_{trial_id}
.
The output SavedModel signature is:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['image_bytes'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: Placeholder:0
inputs['key'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: key:0
The given SavedModel SignatureDef contains the following output(s):
outputs['classes'] tensor_info:
dtype: DT_INT64
shape: (-1)
name: ArgMax:0
outputs['key'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: Identity:0
outputs['probabilities'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1000)
name: softmax_tensor:0
Method name is: tensorflow/serving/predict
Inputs:
image_bytes
: The raw (not decoded) image bytes. This is the same asimage/encoded
stored in tf.Example.key
: The string value identifier of prediction input. This value is passed through to the outputkey
. In batch prediction, this helps to map the prediction output to the input.
Outputs:
classes
: The predicted class (integer) label, which is the one with the highest probability.key
: The output key.probabilities
: Theprobability
(between 0 and 1) for eachclass
(ranging from 0 tonum_classes
).
The following is an example of prediction inputs and outputs:
prediction_input: {
'image_bytes': 'some_raw_image_bytes',
'key': ['test_key'])
}
prediction_output: {
'probabilities': [[0.1, 0.3, 0.6]],
'classes': [2],
'key': ['test_key'],
}
Example configurations
If you submit a job using gcloud
, you need to create a config.yaml
file
for your machine type and hyperparameter tuning specifications. If you use
Google Cloud console, you don't need to create this file. Learn
how to submit a training job.
The following example config.yaml
file shows how to allocate TPU resources
for your training job:
cat << EOF > config.yaml
trainingInput:
# Use a cluster with many workers and a few parameter servers.
scaleTier: CUSTOM
masterType: n1-highmem-16
masterConfig:
imageUri: gcr.io/cloud-ml-algos/image_classification:latest
workerType: cloud_tpu
workerConfig:
imageUri: gcr.io/cloud-ml-algos/image_classification:latest
acceleratorConfig:
type: TPU_V2
count: 8
workerCount: 1
EOF
Next, use your config.yaml
file to submit a training job.
Hyperparameter tuning configuration
To use hyperparameter tuning, include your hyperparameter tuning configuration
in the same config.yaml
file as your machine configuration.
You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in image classification algorithm.
The following example config.yaml
file shows how to allocate TPU resources for
your training job, and includes hyperparameter tuning configuration:
cat << EOF > config.yaml
trainingInput:
# Use a cluster with many workers and a few parameter servers.
scaleTier: CUSTOM
masterType: n1-highmem-16
masterConfig:
imageUri: gcr.io/cloud-ml-algos/image_classification:latest
workerType: cloud_tpu
workerConfig:
imageUri: gcr.io/cloud-ml-algos/image_classification:latest
tpuTfVersion: 1.14
acceleratorConfig:
type: TPU_V2
count: 8
workerCount: 1
# The following are hyperparameter configs.
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: top_1_accuracy
maxTrials: 6
maxParallelTrials: 3
enableTrialEarlyStopping: True
params:
- parameterName: initial_learning_rate
type: DOUBLE
minValue: 0.001
maxValue: 0.2
scaleType: UNIT_LOG_SCALE
EOF
Submit an image classification training job
This section explains how to submit a training job using the built-in image classification algorithm.
Console
Select your algorithm
Go to the AI Platform Training Jobs page in the Google Cloud console:
Click the New training job button. From the options that display below, click Built-in algorithm training.
On the Create a new training job page, select image classification and click Next.
Select your training and validation data
In the drop-down box under Training data, specify whether you are using a single file or multiple files:
- For a single file, leave "Use single file in a GCS bucket" selected.
- For multiple files, select "Use multiple files stored in one Cloud Storage directory".
For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.
If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.
In the drop-down box under Validation data, specify whether you are using a single file or multiple files:
- For a single file, leave "Use single file in a GCS bucket" selected.
- For multiple files, select "Use multiple files stored in one Cloud Storage directory".
For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.
If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.
In Output directory, enter the path to the Cloud Storage bucket where you want AI Platform Training to store the outputs from your training job. You can fill in your Cloud Storage bucket path directly, or click the Browse button to select it.
To keep things organized, create a new directory within your Cloud Storage bucket for this training job. You can do this within the Browse pane.
Click Next.
Set the algorithm arguments
Each algorithm-specific argument displays a default value for training jobs without hyperparameter tuning. If you enable hyperparameter tuning on an algorithm argument, you must specify its minimum and maximum value.
To learn more about all the algorithm arguments, follow the links in the Google Cloud console and refer to the built-in image classification reference for more details.
Submit the job
On the Job settings tab:
- Enter a unique Job ID.
- Enter an available region (such as "us-central1").
- To select machine types, select "CUSTOM" for the scale tier.
A section to provide your Custom cluster specification
displays.
- Select an available machine type for Master type.
- If you want to use TPUs, set the Worker type to cloud_tpu. The worker count defaults to 1.
Click Done to submit the training job.
gcloud
Set environment variables for your job:
PROJECT_ID="YOUR_PROJECT_ID" BUCKET_NAME="YOUR_BUCKET_NAME" # Specify the same region where your data is stored REGION="YOUR_REGION" gcloud config set project $PROJECT_ID gcloud config set compute/region $REGION # Set Cloud Storage paths to your training and validation data # Include a wildcard if you select multiple files. TRAINING_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/train-*.tfrecord" VALIDATION_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/eval-*.tfrecord" # Specify the Docker container for your built-in algorithm selection IMAGE_URI="gcr.io/cloud-ml-algos/image_classification:latest" # Variables for constructing descriptive names for JOB_ID and JOB_DIR DATASET_NAME="flowers" ALGORITHM="image_classification" MODEL_NAME="${DATASET_NAME}_${ALGORITHM}" DATE="$(date '+%Y%m%d_%H%M%S')" # Specify an ID for this job JOB_ID="${MODEL_NAME}_${DATE}" # Specify the directory where you want your training outputs to be stored JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${JOB_ID}"
Submit the job:
gcloud ai-platform jobs submit training $JOB_ID \ --region=$REGION \ --config=config.yaml \ --master-image-uri=$IMAGE_URI \ -- \ --training_data_path=$TRAINING_DATA_PATH \ --validation_data_path=$VALIDATION_DATA_PATH \ --job-dir=$JOB_DIR \ --max_steps=30000 \ --train_batch_size=128 \ --num_classes=5 \ --num_eval_images=100 \ --initial_learning_rate=0.128 \ --warmup_steps=1000 \ --model_type='efficientnet-b4'
After the job is submitted successfully, you can view the logs using the following
gcloud
commands:gcloud ai-platform jobs describe $JOB_ID gcloud ai-platform jobs stream-logs $JOB_ID
What's next
- Refer to the built-in image classification reference to learn about all the different parameters.