Training AmoebaNet-D on Cloud TPU

The AmoebaNet-D model is one of the image classifier architectures discovered using Evolutionary AutoML. The model is based on results from the AmoebaNet paper: Real, E., Aggarwal, A., Huang, Y. and Le, Q.V., 2018, Regularized Evolution for Image Classifier Architecture Search, arXiv preprint arXiv:1802.01548.

This model uses TPUEstimator —a high-level TensorFlow API—which is the recommended way to build and run a machine learning model on a Cloud TPU.

The API simplifies the model development process by hiding most of the low-level implementation, making it easier to switch between TPU and other platforms such as GPU or CPU.

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output.
  • Prepare a test version of the ImageNet dataset, referred to as the fake_imagenet dataset.
  • Run the training job.
  • Verify the output results.

Costs

This tutorial uses billable components of Google Cloud Platform, including:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage. New GCP users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud Platform project. Learn how to enable billing.

  4. This walkthrough uses billable components of Google Cloud Platform. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Set up your resources

This section provides information on setting up Cloud Storage storage, VM, and Cloud TPU resources for tutorials.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create a variable for your project's name.

    export PROJECT_NAME=project_name
    
  3. Configure gcloud command-line tool to use the project where you want to create Cloud TPU.

    gcloud config set project ${PROJECT_NAME}
    
  4. Create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_NAME} -c standard -l us-central1 -b on gs://bucket-name/
    

    This Cloud Storage bucket stores the data you use to train your model and the training results. The ctpu up tool used in this tutorial sets up default permissions for the Cloud TPU service account. If you want finer-grain permissions, review the access level permissions.

    The bucket location must be in the same region as your virtual machine (VM) and your TPU node. VMs and TPU nodes are located in specific zones, which are subdivisions within a region.

  5. Launch the Compute Engine and Cloud TPU resources required for this using the ctpu up command.

    ctpu up --zone=us-central1-a \
    --tpu-size=v2-8 \
    --disk-size-gb=300 \
    --machine-type=n1-standard-8 \
    --tf-version=1.14

    For more information on the CTPU utility, see CTPU Reference.

    The following configuration message appears:

    ctpu will use the following configuration:
    
    Name: [your TPU's name]
    Zone: [your project's zone]
    GCP Project: [your project's name]
    TensorFlow Version: 1.14
    VM:
      Machine Type: [your machine type]
      Disk Size: [your disk size]
      Preemptible: [true or false]
    Cloud TPU:
      Size: [your TPU size]
      Preemptible: [true or false]
    
    OK to create your Cloud TPU resources with the above configuration? [Yn]:
    
  6. Press y to create your Cloud TPU resources.

When the ctpu up command has finished executing, verify that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged into your Compute Engine VM.

From this point on, a prefix of (vm)$ means you should run the command on the Compute Engine VM instance.

Prepare the data

Set up the following environment variables, replacing bucket-name with the name of your Cloud Storage bucket:

(vm)$ export STORAGE_BUCKET=gs://bucket-name
(vm)$ export MODEL_BUCKET=${STORAGE_BUCKET}/amoebanet

The training application expects your training data to be accessible in Cloud Storage. The training application also uses your Cloud Storage bucket to store checkpoints during training.

Run the AmoebaNet-D model with fake_imagenet

ImageNet is an image database. The images in the database are organized into a hierarchy, with each node of the hierarchy depicted by hundreds and thousands of images.

This tutorial uses a demonstration version of the full ImageNet dataset, referred to as fake_imagenet. This demonstration version allows you to test the tutorial, while reducing the storage and time requirements typically associated with running a model against the full ImageNet database.

The fake_imagenet dataset is at this location on Cloud Storage:

gs://cloud-tpu-test-datasets/fake_imagenet

The fake_imagenet dataset is only useful for understanding how to use a Cloud TPU and validating end-to-end performance. The accuracy numbers and saved model will not be meaningful.

For information on how to download and process the full ImageNet dataset, see Downloading, preprocessing, and uploading the ImageNet dataset.

In the following steps, a prefix of (vm)$ means you should run the command on your Compute Engine VM.

  1. Add the top-level /models folder to the Python path with the command

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    

    The AmoebaNet-D model is pre-installed on your Compute Engine VM.

  2. Navigate to the directory:

    (vm)$ cd /usr/share/tpu/models/official/amoeba_net/
    
  3. Run the training script.

    (vm)$ python amoeba_net.py \
       --tpu=$TPU_NAME \
       --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \
       --model_dir=${MODEL_BUCKET}
    
    Parameter Description
    tpu Specifies the name of the Cloud TPU. Note that ctpu passes this name to the Compute Engine VM as an environment variable (TPU_NAME).
    data_dir Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
    model_dir Specifies the directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the model_dir must be a Cloud Storage path (`gs://...`). You can reuse an existing folder to load current checkpoint data and to store additional checkpoints as long as the previous checkpoints were created using TPU of the same size and TensorFlow version.

For a single Cloud TPU device, the procedure trains the AmoebaNet-D model for 90 epochs and evaluates every fixed number of steps. Using the specified flags, the model should train in about 10 hours.

Since the training and evaluation was done on the fake_imagenet dataset, the output results do not reflect actual output that would appear if the training and evaluation was performed on a real dataset.

Evaluate the trained AmoebaNet-D model

In this step, you use Cloud TPU to evaluate the above trained model against the fake_imagenet validation data.

  1. Exit the Compute Engine instance you used to train the model, if you have not already done so.

    (vm)$ exit
    
  2. Allocate and start a new Compute Engine VM and v2-8 Cloud TPU by running ctpu up:

    ctpu up --zone=us-central1-a \
    --tpu-size=v2-8 \
    --disk-size-gb=300 \
    --machine-type=n1-standard-8 \
    --tf-version=1.14
  3. Set up the variables needed to re-run the model and navigate to the model directory:

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    (vm)$ export MODEL_BUCKET=${STORAGE_BUCKET}/amoebanet
    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    (vm)$ cd /usr/share/tpu/models/official/amoeba_net/
    
  4. Run the model evaluation with the following flags:

    (vm)$ python amoeba_net.py \
    --tpu_name=$TPU_NAME \
    --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \
    --model_dir=$MODEL_BUCKET \
    --num_cells=6 \
    --image_size=224 \
    --train_batch_size=4096 \
    --eval_batch_size=1000 \
    --mode=eval \
    --iterations_per_loop=1000
    

This generates output similar to the following:

Evaluation results: {'loss': 6.908725, 'top_1_accuracy': 0.001, 'global_step': 10955, 'top_5_accuracy': 0.005}

Since the training and evaluation was done on the fake_imagenet dataset, the output results do not reflect actual output that would appear if the training and evaluation was performed on a real dataset.

At this point, you can either conclude this tutorial and clean up your GCP resources, or you can further explore running the model on Cloud TPU Pods.

Scaling your model with Cloud TPU Pods

You can get results faster by scaling your model with Cloud TPU Pods. The fully supported Amoebanet model can work with the following Pod slices:

  • v2-32
  • v3-32
  1. Run the ctpu up command, using the tpu-size parameter to specify the Pod slice you want to use. For example, the following command uses a v2-32 Pod slice.

    ctpu up --tpu-size=v2-32
  2. Run the model on your Compute Engine VM with the following flags:

    (vm)$ python amoeba_net.py \
       --tpu=$TPU_NAME \
       --data_dir=gs://cloud-tpu-test-datasets/fake_imagenet \
       --model_dir=$MODEL_BUCKET \
       --num_cells=6 \
       --image_size=224 \
       --num_epochs=35 \
       --train_batch_size=4096 \
       --eval_batch_size=1000 \
       --lr=10.24 \
       --lr_decay_value=0.88 \
       --num_shards=32 \
       --lr_warmup_epochs=0.35 \
       --mode=train \
       --iterations_per_loop=1000
    
    Parameter Description
    tpu Specifies the name of the Cloud TPU. Note that ctpu passes this name to the Compute Engine VM as an environment variable (TPU_NAME).
    data_dir Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
    model_dir Specifies the directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the `model_dir` must be a Cloud Storage path (`gs://...`). You can reuse an existing folder to load current checkpoint data and to store additional checkpoints as long as the previous checkpoints were created using TPU of the same size and TensorFlow version.

The procedure trains AmoebaNet-D model on the fake_imagnet dataset to 35 epochs. This training takes approximately 90 minutes on a v3-128 Cloud TPU.

You can also try the model using larger Cloud TPU Pod slices. The following table shows the recommended values for these slices.

Parameter Description 128 cores 256 cores
num_cells Total number of cells 6 6
image_size Size of image, assuming image height and width 224 224
num_epochs Number of steps use for training 60 90
train_batch_size Global (not per-shard) batch size for training 8192 16384
lr Learning rate 20.48 40.96
lr_decay_value Exponential decay rate used in learning rate adjustment 0.91 0.94
num_shards Number of shards (TPU cores) 128 256

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete [optional: --zone]
    
  3. Run ctpu status to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "us-central1-b"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing YOUR-BUCKET-NAME with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://bucket-name
    

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.