Training using the built-in image classification algorithm

Training with built-in algorithms on AI Platform Training allows you to submit your dataset and train a model without writing any training code. This page explains how the built-in image classification algorithm works, and how to use it.

Overview

The built-in image classification algorithm uses your training and validation datasets to train models continuously, and then it outputs the most accurate SavedModel generated during the course of the training job. You can also use hyperparameter tuning to achieve the best model accuracy. The exported SavedModel can be used directly for prediction, either locally or deployed to AI Platform Prediction for production service.

Limitations

Image built-in algorithms support training with single CPUs, GPUs or TPUs. The resulting SavedModel is compatible with serving on CPUs and GPUs.

The following features are not supported for training with the built-in image classification algorithm:

Supported machine types

The following AI Platform Training scale tiers and machine types are supported:

  • BASIC scale tier
  • BASIC_TPU scale tier
  • CUSTOM scale tier with any of the Compute Engine machine types supported by AI Platform Training.
  • CUSTOM scale tier with any of the following legacy machine types:
    • standard
    • large_model
    • complex_model_s
    • complex_model_m
    • complex_model_l
    • standard_gpu
    • standard_p100
    • standard_v100
    • large_model_v100
    • complex_model_m_gpu
    • complex_model_l_gpu
    • complex_model_m_p100
    • complex_model_m_v100
    • complex_model_l_v100
    • TPU_V2 (8 cores)

Authorize your Cloud TPU to access your project

Format input data for training

The built-in image classification algorithm requires your input data to be formatted as tf.Examples, saved in TFRecord file(s). The tf.Example data structure and TFRecord file format are both designed for efficient data reading with TensorFlow.

The TFRecord format is a simple format for storing a sequence of binary records. In this case, all the records contain binary representations of images. Each image, along with its class label(s), is represented as a tf.Example. You can save many tf.Examples to a single TFRecord file. You can also shard a large dataset among multiple TFRecord files.

Learn more about TFRecord and tf.Example.

Convert your images to TFRecords

TensorFlow provides a script you can use to convert your images from JPEG to TFRecord format.

You can use the script if:

  • You store the images in Cloud Storage.
  • You have CSV files with the paths to the images in Cloud Storage, and their corresponding labels. For example:

    gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy
    gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
    
  • You store these CSV files in Cloud Storage.

The following example shows how to run the script:

  1. Download the script:

    curl https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/jpeg_to_tf_record.py > ./jpeg_to_tf_record.py
    
  2. Set variables for your project ID and bucket name, if you have not already done so:

    PROJECT_ID="YOUR_PROJECT_ID"
    BUCKET_NAME="YOUR_BUCKET_NAME"
    
  3. Create a list of all the possible labels for your dataset in a temporary file:

    cat << EOF > /tmp/labels.txt
    daisy
    dandelion
    roses
    sunflowers
    tulips
    EOF
    
  4. Run the script using flowers data from the public cloud-ml-data bucket and your list of labels:

    python -m jpeg_to_tf_record.py \
           --train_csv gs://cloud-ml-data/img/flower_photos/train_set.csv \
           --validation_csv gs://cloud-ml-data/img/flower_photos/eval_set.csv \
           --labels_file /tmp/labels.txt \
           --project_id $PROJECT_ID \
           --output_dir gs://$BUCKET_NAME/flowers_as_tf_record
    

Check Cloud Storage bucket permissions

To store your data, use a Cloud Storage bucket in the same Google Cloud project you're using to run AI Platform Training jobs. Otherwise, grant AI Platform Training access to the Cloud Storage bucket where your data is stored.

Required input format

To train with the built-in image classification algorithm, your image data must be structured as tf.Examples that include the following fields:

  • image/encoded is the raw image string.

  • image/class/label is a single integer label for the corresponding image. Multiple labels per instance are not supported.

    The set of integer labels used for your dataset must be a consecutive sequence starting at 1. For example, if your dataset has five classes, then each label must be an integer in the interval [1, 5].

For example:

{
    'image/encoded': '<encoded image data>',
    'image/class/label': 2
}

Getting the best SavedModel as output

When the training job completes, AI Platform Training writes a TensorFlow SavedModel to the Cloud Storage bucket you specified as jobDir when you submitted the job. The SavedModel is written to jobDir/model. For example, if you submit the job to gs://your-bucket-name/your-job-dir, then AI Platform Training writes the SavedModel to gs://your-bucket-name/your-job-dir/model.

If you enabled hyperparameter tuning, AI Platform Training returns the TensorFlow SavedModel with the highest accuracy achieved during the training process. For example, if you submitted a training job with 2,500 training steps, and the accuracy was highest at 2,000 steps, you get a TensorFlow SavedModel saved from that particular point.

Each trial of AI Platform Training writes the TensorFlow SavedModel with the highest accuracy to its own directory within your Cloud Storage bucket. For example, gs://your-bucket-name/your-job-dir/model/trial_{trial_id}.

The output SavedModel signature is:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['image_bytes'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder:0
    inputs['key'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: key:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (-1)
        name: ArgMax:0
    outputs['key'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Identity:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

Inputs:

  • image_bytes: The raw (not decoded) image bytes. This is the same as image/encoded stored in tf.Example.
  • key: The string value identifier of prediction input. This value is passed through to the output key. In batch prediction, this helps to map the prediction output to the input.

Outputs:

  • classes: The predicted class (integer) label, which is the one with the highest probability.
  • key: The output key.
  • probabilities: The probability (between 0 and 1) for each class (ranging from 0 to num_classes).

The following is an example of prediction inputs and outputs:

prediction_input: {
  'image_bytes': 'some_raw_image_bytes',
  'key': ['test_key'])
}

prediction_output: {
  'probabilities': [[0.1, 0.3, 0.6]],
  'classes': [2],
  'key': ['test_key'],
}

Example configurations

If you submit a job using gcloud, you need to create a config.yaml file for your machine type and hyperparameter tuning specifications. If you use Google Cloud console, you don't need to create this file. Learn how to submit a training job.

The following example config.yaml file shows how to allocate TPU resources for your training job:

cat << EOF > config.yaml
trainingInput:
  # Use a cluster with many workers and a few parameter servers.
  scaleTier: CUSTOM
  masterType: n1-highmem-16
  masterConfig:
    imageUri: gcr.io/cloud-ml-algos/image_classification:latest
  workerType:  cloud_tpu
  workerConfig:
   imageUri: gcr.io/cloud-ml-algos/image_classification:latest
   acceleratorConfig:
     type: TPU_V2
     count: 8
  workerCount: 1
EOF

Next, use your config.yaml file to submit a training job.

Hyperparameter tuning configuration

To use hyperparameter tuning, include your hyperparameter tuning configuration in the same config.yaml file as your machine configuration.

You can find brief explanations of each hyperparameter within the Google Cloud console, and a more comprehensive explanation in the reference for the built-in image classification algorithm.

The following example config.yaml file shows how to allocate TPU resources for your training job, and includes hyperparameter tuning configuration:

cat << EOF > config.yaml
trainingInput:
  # Use a cluster with many workers and a few parameter servers.
  scaleTier: CUSTOM
  masterType: n1-highmem-16
  masterConfig:
    imageUri: gcr.io/cloud-ml-algos/image_classification:latest
  workerType:  cloud_tpu
  workerConfig:
   imageUri: gcr.io/cloud-ml-algos/image_classification:latest
   tpuTfVersion: 1.14
   acceleratorConfig:
     type: TPU_V2
     count: 8
  workerCount: 1
  # The following are hyperparameter configs.
  hyperparameters:
   goal: MAXIMIZE
   hyperparameterMetricTag: top_1_accuracy
   maxTrials: 6
   maxParallelTrials: 3
   enableTrialEarlyStopping: True
   params:
   - parameterName: initial_learning_rate
     type: DOUBLE
     minValue: 0.001
     maxValue: 0.2
     scaleType: UNIT_LOG_SCALE
EOF

Submit an image classification training job

This section explains how to submit a training job using the built-in image classification algorithm.

Console

Select your algorithm

  1. Go to the AI Platform Training Jobs page in the Google Cloud console:

    AI Platform Training Jobs page

  2. Click the New training job button. From the options that display below, click Built-in algorithm training.

  3. On the Create a new training job page, select image classification and click Next.

Select your training and validation data

  1. In the drop-down box under Training data, specify whether you are using a single file or multiple files:

    • For a single file, leave "Use single file in a GCS bucket" selected.
    • For multiple files, select "Use multiple files stored in one Cloud Storage directory".
  2. For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.

    If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.

  3. In the drop-down box under Validation data, specify whether you are using a single file or multiple files:

    • For a single file, leave "Use single file in a GCS bucket" selected.
    • For multiple files, select "Use multiple files stored in one Cloud Storage directory".
  4. For Directory path, click Browse. In the right panel, click the name of the bucket where you uploaded the training data, and navigate to your file.

    If you're selecting multiple files, place your wildcard characters in Wildcard name. The "Complete GCS path" displays below to help you confirm that the path is correct.

  5. In Output directory, enter the path to the Cloud Storage bucket where you want AI Platform Training to store the outputs from your training job. You can fill in your Cloud Storage bucket path directly, or click the Browse button to select it.

    To keep things organized, create a new directory within your Cloud Storage bucket for this training job. You can do this within the Browse pane.

    Click Next.

Set the algorithm arguments

Each algorithm-specific argument displays a default value for training jobs without hyperparameter tuning. If you enable hyperparameter tuning on an algorithm argument, you must specify its minimum and maximum value.

To learn more about all the algorithm arguments, follow the links in the Google Cloud console and refer to the built-in image classification reference for more details.

Submit the job

On the Job settings tab:

  1. Enter a unique Job ID.
  2. Enter an available region (such as "us-central1").
  3. To select machine types, select "CUSTOM" for the scale tier. A section to provide your Custom cluster specification displays.
    1. Select an available machine type for Master type.
    2. If you want to use TPUs, set the Worker type to cloud_tpu. The worker count defaults to 1.

Click Done to submit the training job.

gcloud

  1. Set environment variables for your job:

    PROJECT_ID="YOUR_PROJECT_ID"
    BUCKET_NAME="YOUR_BUCKET_NAME"
    
    # Specify the same region where your data is stored
    REGION="YOUR_REGION"
    
    gcloud config set project $PROJECT_ID
    gcloud config set compute/region $REGION
    
    # Set Cloud Storage paths to your training and validation data
    # Include a wildcard if you select multiple files.
    TRAINING_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/train-*.tfrecord"
    VALIDATION_DATA_PATH="gs://${BUCKET_NAME}/YOUR_DATA_DIRECTORY/eval-*.tfrecord"
    
    # Specify the Docker container for your built-in algorithm selection
    IMAGE_URI="gcr.io/cloud-ml-algos/image_classification:latest"
    
    # Variables for constructing descriptive names for JOB_ID and JOB_DIR
    DATASET_NAME="flowers"
    ALGORITHM="image_classification"
    MODEL_NAME="${DATASET_NAME}_${ALGORITHM}"
    DATE="$(date '+%Y%m%d_%H%M%S')"
    
    # Specify an ID for this job
    JOB_ID="${MODEL_NAME}_${DATE}"
    
    # Specify the directory where you want your training outputs to be stored
    JOB_DIR="gs://${BUCKET_NAME}/algorithm_training/${JOB_ID}"
    
  2. Submit the job:

    gcloud ai-platform jobs submit training $JOB_ID \
      --region=$REGION \
      --config=config.yaml \
      --master-image-uri=$IMAGE_URI \
      -- \
      --training_data_path=$TRAINING_DATA_PATH \
      --validation_data_path=$VALIDATION_DATA_PATH \
      --job-dir=$JOB_DIR \
      --max_steps=30000 \
      --train_batch_size=128 \
      --num_classes=5 \
      --num_eval_images=100 \
      --initial_learning_rate=0.128 \
      --warmup_steps=1000 \
      --model_type='efficientnet-b4'
  3. After the job is submitted successfully, you can view the logs using the following gcloud commands:

    gcloud ai-platform jobs describe $JOB_ID
    gcloud ai-platform jobs stream-logs $JOB_ID
    

What's next