This tutorial shows you how to train the Tensorflow MnasNet model using a Cloud TPU device or Cloud TPU Pod slice (multiple TPU devices). You can apply the same pattern to other TPU-optimised image classification models that use TensorFlow and the ImageNet dataset.
Model description
The model in this tutorial is based on MnasNet: Platform-Aware Neural Architecture Search for Mobile, which first introduces the AutoML mobile neural network (MnasNet) architecture. The tutorial uses the state-of-the-art variant, 'mnasnet-a1', and demonstrates training the model using TPUEstimator.
Special considerations when training on a Pod slice (v2-32/v3-32 and above)
If you plan to train on a TPU Pod slice, please make sure you read this document that explains the special considerations when training on a Pod slice.
Objectives
- Create a Cloud Storage bucket to hold your dataset and model output.
- Run the training job.
- Verify the output results.
Costs
This tutorial uses billable components of Google Cloud, including:
- Compute Engine
- Cloud TPU
- Cloud Storage
Use the pricing calculator to generate a cost estimate based on your projected usage.
New Google Cloud users might be eligible for a free trial.
Before you begin
Before starting this tutorial, check that your Google Cloud project is correctly set up.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.
Set up your resources
This section provides information on setting up Cloud Storage, VM, and Cloud TPU resources for tutorials.
Open a Cloud Shell window.
Create a variable for your project's ID.
export PROJECT_ID=project-id
Configure
gcloud
command-line tool to use the project where you want to create Cloud TPU.gcloud config set project ${PROJECT_ID}
The first time you run this command in a new Cloud Shell VM, an
Authorize Cloud Shell
page is displayed. ClickAuthorize
at the bottom of the page to allowgcloud
to make GCP API calls with your credentials.Create a Service Account for the Cloud TPU project.
gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
The command returns a Cloud TPU Service Account with following format:
service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
Create a Cloud Storage bucket using the following command. Replace bucket-name with a name for your bucket.
gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 -b on gs://bucket-name
This Cloud Storage bucket stores the data you use to train your model and the training results. The
gcloud compute tpus execution-groups
tool used in this tutorial sets up default permissions for the Cloud TPU Service Account. If you want finer-grain permissions, review the access level permissions.The bucket location must be in the same region as your virtual machine (VM) and your TPU node. VMs and TPU nodes are located in specific zones, which are subdivisions within a region.
Launch the Compute Engine and Cloud TPU resources required for this tutorial using the
gcloud compute tpus execution-groups
command.gcloud compute tpus execution-groups create \ --vm-only \ --name=mnasnet-tutorial \ --zone=europe-west4-a \ --disk-size=300 \ --machine-type=n1-standard-8 \ --tf-version=1.15.5
Command flag descriptions
vm-only
- Create the Compute Engine VM only, do not create a Cloud TPU.
name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
disk-size
- The size of the hard disk in GB of the VM created by the
gcloud
command. machine-type
- The machine type of the Compute Engine VM to create.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
For more information on the
gcloud
command, see the gcloud Reference.When prompted, press y to create your Cloud TPU resources.
When the
gcloud compute tpus execution-groups
command has finished executing, verify that your shell prompt has changed fromusername@projectname
tousername@vm-name
. This change shows that you are now logged into your Compute Engine VM. If you are not connected to the Compute Engine instance, you can do so by running the following command:gcloud compute ssh mnasnet-tutorial --zone=europe-west4-a
From this point on, a prefix of
(vm)$
means you should run the command on the Compute Engine VM instance.Create an environment variable for the storage bucket. Replace bucket-name with the name of your Cloud Storage bucket.
(vm)$ export STORAGE_BUCKET=gs://bucket-name
Create an environment variable for the model directory.
(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/mnasnet
Add the top-level
/models
folder to the Python path with the command(vm)$ export PYTHONPATH=${PYTHONPATH}:/usr/share/tpu/models:/usr/share/tpu/models/official/efficientnet
Locate the data
ImageNet is an image database. The images in the database are organized into a hierarchy, each node of the hierarchy contains hundreds and thousands of images.
This tutorial uses a demonstration version of the full ImageNet dataset, referred to as fake_imagenet. This demonstration version allows you to test the tutorial, while reducing the storage and time requirements typically associated with running a model against the full ImageNet database.
The fake_imagenet dataset is at this location on Cloud Storage:
gs://cloud-tpu-test-datasets/fake_imagenet
Create an environment variable for the data directory.
(vm)$ export DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
The fake_imagenet dataset is only useful for understanding how to use a Cloud TPU and validating end-to-end performance. The accuracy numbers and saved model will not be meaningful.
For information on how to download and process the full ImageNet dataset, see Downloading, preprocessing, and uploading the ImageNet dataset.
Train the MnasNet model with fake_imagenet
The Mnasnet TPU model is pre-installed on your Compute Engine VM in the following directory:
/usr/share/tpu/models/official/mnasnet/
Run the following command to create your Cloud TPU.
(vm)$ gcloud compute tpus execution-groups create \ --tpu-only \ --accelerator-type=v3-8 \ --name=mnasnet-tutorial \ --zone=europe-west4-a \ --tf-version=1.15.5
Command flag descriptions
Create an environment variable for the Cloud TPU name.
(vm)$ export TPU_NAME=mnasnet-tutorial
Navigate to the model directory.
(vm)$ cd /usr/share/tpu/models/official/mnasnet/
Run the training script.
(vm)$ python3 mnasnet_main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --model_name="mnasnet-a1" \ --skip_host_call=true \ --train_steps=109474 \ --train_batch_size=4096
Command flag descriptions
tpu
- The name of the Cloud TPU to run training or evaluation.
data_dir
- Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
model_name
- The name of the model to train. For example,
mnasnet-a1
. skip_host_call
- Set to
true
to instruct the script skip thehost_call
which is executed every training step. This is generally used for generating training summaries (training loss, learning rate, and so on). Whenskip_host_call=false
, there could be a performance drop if thehost_call
function is slow and cannot keep up with the TPU-side computation. train_steps
- The number of steps to use for training. The default value is
218,949 which is approximately 350 epochs at a batch size of 2048.
This flag should be adjusted according to the
train_batch_size
value. train_batch_size
- The training batch size.
For a single Cloud TPU device, the procedure trains the MnasNet model ('mnasnet-a1' variant) for 350 epochs and evaluates every fixed number of steps. Using the specified flags, the model should train in about 23 hours. With real imagenet data, the settings will reproduce the state-of-the-art research result, while users should be able to tune up the training speed.
Scaling your model with Cloud TPU Pods
You can get results faster by scaling your model with Cloud TPU Pods. The fully supported Mnasnet model can work with the following Pod slices:
- v2-32
- v2-128
- v2-256
- v2-512
- v3-32
- v3-128
- v3-256
- v3-512
- v3-1024
- v3-2048
When working with Cloud TPU Pods, you first train the model using a Pod, then use a single Cloud TPU device to evaluate the model.
Training with Cloud TPU Pods
Delete the Cloud TPU resource you created for training the model on a single device.
(vm)$ gcloud compute tpus execution-groups delete mnasnet-tutorial \ --tpu-only \ --zone=europe-west4-a
Create a new Cloud TPU pod.
(vm)$ gcloud compute tpus execution-groups create --tpu-only \ --accelerator-type=v2-32 \ --zone=europe-west4-a \ --name=mnasnet-tutorial \ --tf-version=1.15.5
Command flag descriptions
tpu-only
- Create a Cloud TPU only. By default the
gcloud
command creates a VM and a Cloud TPU. accelerator-type
- The type of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
name
- The name of the Cloud TPU to create.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
Update the
TPU_NAME
environment variable.(vm)$ export TPU_NAME=Cloud TPU
Update the
MODEL_DIR
directory to store the training data.(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/mnasnet-tutorial
Train the model by running the following script.
The script trains the model on the fake_imagnet dataset to 35 epochs. This takes approximately 90 minutes to run on a v3-128 Cloud TPU.
(vm)$ python3 mnasnet_main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --model_name="mnasnet-a1" \ --skip_host_call=true \ --train_steps=109474 \ --train_batch_size=4096
Command flag descriptions
tpu
- The name of the Cloud TPU.
data_dir
- Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
model_name
- The name of the model to train. For example,
efficientnet
. skip_host_call
- Set to
true
to instruct the script skip thehost_call
which is executed every training step. This is generally used for generating training summaries (training loss, learning rate, and so on). Whenskip_host_call=false
, there could be a performance drop if thehost_call
function is slow and cannot keep up with the TPU-side computation. train_steps
- The number of steps to use for training. The default value is
218,949 steps which is approximately 350 epochs at a batch size of
2048. This flag should be adjusted according to the
train_batch_size
flag value. train_batch_size
- The training batch size.
The procedure trains the MnasNet model ('mnasnet-a1' variant) on the fake_imagent data set to 350 epochs. It should finish in around 5 hours.
Evaluating the model
In this set of steps, you use Cloud TPU to evaluate the above trained model against the fake_imagenet validation data.
Delete the Cloud TPU resource you created to train the model on a Pod.
(vm)$ gcloud compute tpus execution-groups delete mnasnet-tutorial \ --tpu-only \ --zone=europe-west4-a
Start a v2-8 Cloud TPU. Use the same name that you used for the Compute Engine VM, which should still be running.
(vm)$ gcloud compute tpus execution-groups create --tpu-only \ --accelerator-type=v3-8 \ --zone=europe-west4-a \ --name=mnasnet-tutorial \ --tf-version=1.15.5
Command flag descriptions
tpu-only
- Create a Cloud TPU only. By default the
gcloud
command creates a VM and a Cloud TPU. accelerator-type
- The type of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
name
- The name of the Cloud TPU to create.
tf-version
- The version of Tensorflow
gcloud
installs on the VM.
Create an environment variable for your accelerator type.
(vm)$ export ACCELERATOR_TYPE=v3-8
Run the model evaluation. This time, add the
mode
flag and set it toeval
.(vm)$ python3 mnasnet_main.py \ --tpu=${TPU_NAME} \ --data_dir=${DATA_DIR} \ --model_dir=${MODEL_DIR} \ --mode=eval \ --config_file=configs/cloud/${ACCELERATOR_TYPE}.yaml
Command flag descriptions
tpu
- The name of the Cloud TPU.
data_dir
- Specifies the Cloud Storage path for training input. It is set to the fake_imagenet dataset in this example.
model_dir
- The Cloud Storage path where checkpoints and summaries are stored during model training. You can reuse an existing folder to load previously generated checkpoints and to store additional checkpoints as long as the previous checkpoints were created using a Cloud TPU of the same size and TensorFlow version.
mode
- One of
train_and_eval
,train
oreval
.train_and_eval
trains and evaluates the model.train
trains the model.eval
evaluates the model. config_file
- The configuration file used by the training/evaluation script.
This generates output similar to the following:
Eval results: { 'loss': 7.532023, 'top_1_accuracy': 0.0010172526, 'global_step': 100, 'top_5_accuracy': 0.005065918 } Elapsed seconds: 88
Cleaning up
To avoid incurring charges to your GCP account for the resources used in this topic:
Disconnect from the Compute Engine VM:
(vm)$ exit
Your prompt should now be
user@projectname
, showing you are in the Cloud Shell.Delete your Cloud TPU and Compute Engine resources.
$ gcloud compute tpus execution-groups delete mnasnet-tutorial \ --zone=europe-west4-a
Verify the resources have been deleted by running
gcloud compute tpus execution-groups list
. The deletion might take several minutes. A response like the one below indicates your instances have been successfully deleted.$ gcloud compute tpus execution-groups list \ --zone=europe-west4-a
NAME STATUS
Delete your Cloud Storage bucket using
gsutil
as shown below. Replace bucket-name with the name of your Cloud Storage bucket.$ gsutil rm -r gs://bucket-name
What's next
In this tutorial you have trained the MNASNET model using a sample dataset. The results of this training are (in most cases) not usable for inference. To use a model for inference you can train the data on a publicly available dataset or your own data set. Models trained on Cloud TPUs require datasets to be in TFRecord format.
You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example
Hyperparameter tuning
To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on GitHub. Information about model-specific hyperparameters can be found in the source code for each model. For more information on hyperparameter tuning, see Overview of hyperparameter tuning, Using the Hyperparameter tuning service and Tune hyperparameters.
Inference
Once you have trained your model you can use it for inference (also called prediction). AI Platform is a cloud-based solution for developing, training, and deploying machine learning models. Once a model is deployed, you can use the AI Platform Prediction service.
- Explore the TPU tools in TensorBoard.