The EfficientNet models are a family of image classification models, which achieve state-of-the-art accuracy, while also being smaller and faster than other models. EfficientNet-EdgeTpu are models customized to run efficiently on the Google EdgeTPU devices.
The model in this tutorial is based on EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Researchers developed a new technique to improve model performance: carefully balancing network depth, width, and resolution, using a simple yet highly effective compound coefficient.
The family of models from efficientnet-b0
to efficientnet-b7
, can achieve
decent image classification accuracy given the resource constrained Google EdgeTPU devices.
efficientnet-b0
, the model used in this tutorial, corresponds to the smallest
base model, whereas efficientnet-b7
corresponds to the most power but
computation-expensive model. The tutorial demonstrates training the model using
TPUEstimator.
Objectives
- Create a Cloud Storage bucket to hold your dataset and model output.
- Prepare a test version of the ImageNet dataset, referred to as the fake_imagenet dataset.
- Run the training job.
- Verify the output results.
Costs
This tutorial uses billable components of Google Cloud, including:- Compute Engine
- Cloud TPU
- Cloud Storage
Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.
Before you begin
Before starting this tutorial, check that your Google Cloud project is correctly set up.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.
Set up your resources
This section provides information on setting up Cloud Storage, VM, and Cloud TPU resources for tutorials.
Open a Cloud Shell window.
Create a variable for your project's ID.
export PROJECT_ID=project-id
Configure
gcloud
command-line tool to use the project where you want to create Cloud TPU.gcloud config set project ${PROJECT_ID}
The first time you run this command in a new Cloud Shell VM, an
Authorize Cloud Shell
page is displayed. ClickAuthorize
at the bottom of the page to allowgcloud
to make GCP API calls with your credentials.Create a Service Account for the Cloud TPU project.
gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
The command returns a Cloud TPU Service Account with following format:
service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
Create a Cloud Storage bucket using the following command:
gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 -b on gs://bucket-name
This Cloud Storage bucket stores the data you use to train your model and the training results. The
gcloud compute tpus execution-groups
command used in this tutorial sets up default permissions for the Cloud TPU Service Account. If you want finer-grain permissions, review the access level permissions.The bucket location must be in the same region as your virtual machine (VM) and your TPU node. VMs and TPU nodes are located in specific zones, which are subdivisions within a region.
Launch the Compute Engine and Cloud TPU resources required for this tutorial using the
gcloud compute tpus execution-groups
command.gcloud compute tpus execution-groups create \ --vm-only \ --name=efficientnet-tutorial \ --zone=europe-west4-a \ --disk-size=300 \ --machine-type=n1-standard-8 \ --tf-version=1.15.5
Command flag descriptions
vm-only
- Create the Compute Engine VM only, do not create a Cloud TPU.
name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
disk-size
- The size of the hard disk in GB of the VM created by the
gcloud
command. machine-type
- The machine type of the Compute Engine VM to create.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
For more information on the
gcloud
command, see the gcloud Reference.When prompted, press y to create your Cloud TPU resources.
When the
gcloud compute tpus execution-groups
command has finished executing, verify that your shell prompt has changed fromusername@projectname
tousername@vm-name
. This change shows that you are now logged into your Compute Engine VM.gcloud compute ssh efficientnet-tutorial --zone=europe-west4-a
From this point on, a prefix of
(vm)$
means you should run the command on the Compute Engine VM instance.
Prepare the data
Set up the following environment variables, replacing bucket-name with the name of your Cloud Storage bucket:
Create an environment variable for your bucket name. Replace bucket-name with your bucket name.
(vm)$ export STORAGE_BUCKET=gs://bucket-name
Create some additional environment variables.
(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/efficientnet (vm)$ export DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet (vm)$ export TPU_NAME=efficientnet-tutorial (vm)$ export PYTHONPATH=$PYTHONPATH:/usr/share/tpu/models
The training application expects your training data to be accessible in Cloud Storage. The training application also uses your Cloud Storage bucket to store checkpoints during training.
Train and evaluate the EfficientNet model with fake_imagenet
ImageNet is an image database. The images in the database are organized into a hierarchy, with each node of the hierarchy depicted by hundreds and thousands of images.
This tutorial uses a demonstration version of the full ImageNet dataset, referred to as fake_imagenet. This demonstration version allows you to test the tutorial, while reducing the storage and time requirements typically associated with running a model against the full ImageNet database.
The fake_imagenet dataset is at this location on Cloud Storage:
gs://cloud-tpu-test-datasets/fake_imagenet
The fake_imagenet dataset is only useful for understanding how to use a Cloud TPU and validating end-to-end performance. The accuracy numbers and saved model will not be meaningful.
For information on how to download and process the full ImageNet dataset, see Downloading, preprocessing, and uploading the ImageNet dataset.
Launch a Cloud TPU resource.
(vm)$ gcloud compute tpus execution-groups create \ --tpu-only \ --name=efficientnet-tutorial \ --zone=europe-west4-a \ --disk-size=300 \ --machine-type=n1-standard-8 \ --tf-version=1.15.5
Command flag descriptions
tpu-only
- Create the Cloud TPU only, does not create a Compute Engine.
name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
disk-size
- The size of the hard disk in GB of the VM created by the
gcloud
command. machine-type
- The machine type of the Compute Engine VM to create.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
Navigate to the model directory:
(vm)$ cd /usr/share/tpu/models/official/efficientnet/
Run the training script.
(vm)$ python3 main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --model_name='efficientnet-b0' \ --skip_host_call=true \ --train_batch_size=2048 \ --train_steps=1000
Command flag descriptions
tpu
- Uses the name specified in the TPU_NAME variable.
data_dir
- Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
model_name
- The name of the model to train. For example,
efficientnet
. skip_host_call
- Set to
true
to instruct the script skip thehost_call
which is executed every training step. This is generally used for generating training summaries (training loss, learning rate, and so on). Whenskip_host_call=false
, there could be a performance drop if thehost_call
function is slow and cannot keep up with the TPU-side computation. train_batch_size
- The training batch size.
train_steps
- The number of steps to use for training. The default value is
218,949 steps which is approximately 350 epochs at a batch size of
2048. This flag should be adjusted according to the
train_batch_size
flag value.
This trains the EfficientNet model (efficientnet-b0
variant) for only 1000 steps
because it is using the fake ImageNet dataset. When training with the full
ImageNet data set, you can train to convergence by using the following command:
(vm)$ python3 main.py \
--tpu=${TPU_NAME} \
--data_dir=${DATA_DIR} \
--model_dir=${MODEL_DIR} \
--model_name='efficientnet-b0' \
--skip_host_call=true \
--train_batch_size=2048 \
--train_steps=218948
This trains the EfficientNet model for 350 epochs and evaluates after processing
one batch of data. Using the specified flags, the model should train in about 23
hours. These settings should obtain ~76.5% top-1 accuracy on ImageNet validation
dataset. The best model checkpoint and corresponding evaluation result is inside
the archive
folder in the model directory: ${STORAGE_BUCKET}/efficientnet/archive
.
Scaling your model with Cloud TPU Pods
You can get results faster by scaling your model with Cloud TPU Pods. The fully supported model can work with the following Pod slices:
- v2-32
- v3-32
When working with Cloud TPU Pods, you first train the model using a Pod, then use a single Cloud TPU device to evaluate the model.
Training with Cloud TPU Pods
Delete the Cloud TPU resource you created for training the model on a single device.
(vm)$ gcloud compute tpus execution-groups delete efficientnet-tutorial \ --zone=europe-west4-a \ --tpu-only
Run the
gcloud compute tpus execution-groups
command, using theaccelerator-type
parameter to specify the Pod slice you want to use. For example, the following command uses a v3-32 Pod slice.(vm)$ gcloud compute tpus execution-groups create --tpu-only \ --name=efficientnet-tutorial \ --zone=europe-west4-a \ --accelerator-type=v2-32 \ --tf-version=1.15.5
Command flag descriptions
tpu-only
- Create a Cloud TPU only. By default the
gcloud
command creates a VM and a Cloud TPU. name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
accelerator-type
- The type of the Cloud TPU to create.
tf-version
- The version of Tensorflow
gcloud compute tpus execution-groups
installs on the VM.
Create an environment variable for your TPU name.
(vm)$ export TPU_NAME=efficientnet-tutorial
Update the
MODEL_DIR
directory to store the training data.(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/efficientnet-tutorial
Train the model.
(vm)$ python3 main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --model_name='efficientnet-b3' \ --skip_host_call=true \ --mode=train \ --train_steps=1000 \ --train_batch_size=4096 \ --iterations_per_loop=100
Command flag descriptions
tpu
- The name of the Cloud TPU.
data_dir
- The Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
model_name
- The name of the model to train.
skip_host_call
- Set to
true
to instruct the script skip thehost_call
which is executed every training step. This is generally used for generating training summaries (training loss, learning rate, and so on). Whenskip_host_call=false
, there could be a performance drop if thehost_call
function is slow and cannot keep up with the TPU-side computation. mode
- One of
train_and_eval
,train
oreval
.train_and_eval
trains and evaluates the model.train
trains the model.eval
evaluates the model. train_steps
- Specifies the number of training steps.
train_batch_size
- The training batch size.
iterations_per_loop
- The number of training steps to run on the TPU before sending metrics to the CPU.
This command trains the EfficientNet model (efficientnet-b0
variant) for only
1000 steps because it is using the fake ImageNet dataset. When training with the
full ImageNet data set, you can train to convergence by using the following
command:
(vm)$ python3 main.py \
--tpu=${TPU_NAME} \
--data_dir=${DATA_DIR} \
--model_dir=${MODEL_DIR} \
--model_name='efficientnet-b3' \
--skip_host_call=true \
--mode=train \
--train_steps=109474 \
--train_batch_size=4096 \
--iterations_per_loop=100
This command trains the EfficientNet model (efficientnet-b3
variant) for
350 epochs. The model should reach 81.1% accuracy on ImageNet dev set, which
should finish in approximately 20 hours. The best model checkpoint and
corresponding evaluation result is inside the archive
folder in the model
directory: ${STORAGE_BUCKET}/efficientnet/archive
.
Evaluating the model
In this set of steps, you use Cloud TPU to evaluate the above trained model against the fake_imagenet validation data.
Delete the Cloud TPU resource you created to train the model on a Pod.
(vm)$ gcloud compute tpus execution-groups delete efficientnet-tutorial \ --tpu-only \ --zone=europe-west4-a
Start a v2-8 Cloud TPU to run the evaluation. Use the same name that you used for the Compute Engine VM, which should still be running.
(vm)$ gcloud compute tpus execution-groups create --tpu-only \ --name=efficientnet-tutorial \ --accelerator-type=v2-8 \ --zone=europe-west4-a \ --tf-version=1.15.5
Command flag descriptions
tpu-only
- Create a Cloud TPU only. By default the
gcloud
command creates a VM and a Cloud TPU. name
- The name of the Cloud TPU to create.
accelerator-type
- The type of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
Create an environment variable for your TPU name.
(vm)$ export TPU_NAME=efficientnet-tutorial
Run the model evaluation. This time, add the
mode
flag and set it toeval
.(vm)$ python3 main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --model_name='efficientnet-b3' \ --skip_host_call=true \ --mode=eval
Command flag descriptions
tpu
- Uses the name specified in the TPU_NAME variable.
data_dir
- Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
model_name
- The name of the model to train. For example,
efficientnet
, etc. skip_host_call
- Set to
true
to instruct the script skip thehost_call
which is executed every training step. This is generally used for generating training summaries (training loss, learning rate, and so on). Whenskip_host_call=false
, there could be a performance drop if thehost_call
function is slow and cannot keep up with the TPU-side computation. mode
- When set to
train_and_eval
this script trains and evaluates the model. When set toexport_only
this script exports a saved model.
This generates output similar to the following:
Eval results: { 'loss': 7.532023, 'top_1_accuracy': 0.0010172526, 'global_step': 100, 'top_5_accuracy': 0.005065918 } Elapsed seconds: 88
Cleaning up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Disconnect from the Compute Engine instance, if you have not already done so:
(vm)$ exit
Your prompt should now be
username@projectname
, showing you are in the Cloud Shell.In your Cloud Shell, use the following command to delete your Compute Engine VM and Cloud TPU:
$ gcloud compute tpus execution-groups delete efficientnet-tutorial \ --zone=europe-west4-a
Verify the resources have been deleted by running
gcloud compute tpus execution-groups list
. The deletion might take several minutes. A response like the one below indicates your instances have been successfully deleted:$ gcloud compute tpus execution-groups list \ --zone=europe-west4-a
You should see an empty list of TPUs like the following:
NAME STATUS
Delete your Cloud Storage bucket using
gsutil
as shown below. Replace bucket-name with the name of your Cloud Storage bucket.$ gsutil rm -r gs://bucket-name
What's next
In this tutorial you have trained the EfficientNet model using a sample dataset. The results of this training are (in most cases) not usable for inference. To use a model for inference you can train the data on a publicly available dataset or your own data set. Models trained on Cloud TPUs require datasets to be in TFRecord format.
You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example
Hyperparameter tuning
To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on GitHub. Information about model-specific hyperparameters can be found in the source code for each model. For more information on hyperparameter tuning, see Overview of hyperparameter tuning, Using the Hyperparameter tuning service and Tune hyperparameters.
Inference
Once you have trained your model you can use it for inference (also called prediction). AI Platform is a cloud-based solution for developing, training, and deploying machine learning models. Once a model is deployed, you can use the AI Platform Prediction service.
- Explore the TPU tools in TensorBoard.