Downloading, preprocessing, and uploading the COCO dataset

COCO is a large-scale object detection, segmentation, and captioning dataset. Machine learning models that use the COCO dataset include:

  • Mask-RCNN
  • Retinanet
  • ShapeMask

Before you can train a model on a Cloud TPU, you must prepare the training data. Since Cloud TPU charges begin when the TPU is set up, best practice is to set up the Compute Engine VM, prepare the dataset, and then set up the Cloud TPU.

This topic describes how to prepare the COCO dataset for models that run on Cloud TPU. The COCO dataset can only be prepared after you have created a Compute Engine VM. The script used to prepare the data, download_and_preprocess_coco.sh, is installed on the VM and must be run on the VM.

After preparing the data by running the download_and_preprocess_coco.sh script, you can bring up the Cloud TPU and run the training.

Prepare the COCO dataset

The COCO dataset will be stored on your Cloud Storage, so set a storage bucket variable specifying the name of the bucket you created:

(vm)$ export STORAGE_BUCKET=gs://bucket-name
(vm)$ export DATA_DIR=${STORAGE_BUCKET}/coco

Run the download_and_preprocess_coco.sh script to convert the COCO dataset into a set of TFRecords (*.tfrecord) that the training application expects.

(vm)$ sudo bash /usr/share/tpu/tools/datasets/download_and_preprocess_coco.sh ./data/dir/coco

This installs the required libraries and then runs the preprocessing script. It outputs a number of *.tfrecord files in your local data directory. The COCO download and conversion script takes approximately 1 hour to complete.

Copy the data to your Cloud Storage bucket

After you convert the data into TFRecords, copy them from local storage to your Cloud Storage bucket using the gsutil command. You must also copy the annotation files. These files help validate the model's performance.

(vm)$ gsutil -m cp ./data/dir/coco/*.tfrecord ${DATA_DIR}
(vm)$ gsutil cp ./data/dir/coco/raw-data/annotations/*.json ${DATA_DIR}