Cloud Bigtable for Streaming Data

Cloud Bigtable is a low-latency, scalable, wide-row structured storage system that can store and serve training data for your machine learning model. Using Cloud Bigtable, you can stream your training data at ultra-high speed for an efficient use of Cloud TPU.

This tutorial shows you how to train the TensorFlow ResNet-50 model on Cloud TPU using Cloud Bigtable to host your training data. The process uses TensorFlow's integration with Cloud Bigtable.

Disclaimer

This tutorial uses a third-party dataset. Google provides no representation, warranty, or other guarantees about the validity, or any other aspects of, this dataset.

Requirements and limitations

Note the following when defining your configuration:

  • You must use TensorFlow 1.11 or above for Cloud Bigtable support.
  • Cloud Bigtable is recommended for high performance (pod-scale) training jobs on massive amounts of data, processing hundreds of gigabytes (GB) to hundreds of terabytes (TB) at tens-to-hundreds of gigabits per second (Gbps). We recommend that you use Cloud Storage for workloads that do not fit this description. Note: We also recommend Cloud Bigtable for use in reinforcement learning (RL) workloads where training data is generated on the fly.

About the model and the data

The model in this tutorial is based on Deep Residual Learning for Image Recognition, which introduced the residual network (ResNet) architecture. This tutorial uses the 50-layer variant known as ResNet-50.

The training dataset is ImageNet, which is a popular choice for training image recognition systems.

The tutorial uses TPUEstimator to train the model. TPUEstimator is based on tf.estimator, a high-level TensorFlow API, and is the recommended way to build and run a machine learning model on Cloud TPU. The API simplifies the model development process by hiding most of the low-level implementation, making it easier to switch between Cloud TPU and other platforms such as GPU or CPU.

Before you begin

Before starting this tutorial, check that your Google Cloud Platform project is correctly set up.

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Google Cloud Platform project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

  4. This walkthrough uses billable components of GCP. Check the Cloud TPU pricing page to estimate your costs, and follow the instructions to clean up resources when you've finished with them.

  5. Install the Cloud SDK including the gcloud command-line tool.

    Install the Cloud SDK

Create a Cloud Storage bucket

You need a Cloud Storage bucket to store the checkpoints and learned weights from training your model.

  1. Go to the Cloud Storage page on the GCP Console.

    Go to the Cloud Storage page

  2. Create a new bucket, specifying the following options:

    • A unique name of your choosing
    • Default storage class: Regional
    • Location: us-central1

Create a Cloud Bigtable instance

Create a Cloud Bigtable instance for streaming your training data:

  1. Go to the Cloud Bigtable page on the GCP Console.

    Go to the Cloud Bigtable page

  2. Create an instance, specifying the following options:

    • Instance name: A name to help you identify the instance.
    • Instance ID: A permanent identifier for the instance.
    • Instance type: Select Production for best performance.
    • Storage type: Select HDD.
    • Cluster ID: A permanent identifier for the cluster.
    • Region: Select a region. For example, us-central1. See the guide to available regions and zones for Cloud TPU.
    • Zone: Select a zone. For example, us-central1-b. See the guide to available regions and zones for Cloud TPU.

    You can leave the other values at their default settings. For details, see the guide to creating a Cloud Bigtable instance.

Download and set up the ctpu tool

This guide uses the Cloud TPU Provisioning Utility (ctpu) as a simple tool for setting up and managing your Cloud TPU. The guide also assumes that you want to run ctpu and the Cloud SDK locally rather than using Cloud Shell. The Cloud Shell environment is not suitable for lengthy procedures like the downloading of the ImageNet data, as Cloud Shell times out after a period of inactivity.

  1. Follow the ctpu guide to download, install, and configure ctpu.

  2. Set the project, zone, and region for the gcloud tool's environment, replacing YOUR-PROJECT-ID with your GCP project ID:

    $ gcloud config set project YOUR-PROJECT-ID
    $ gcloud config set compute/region us-central1
    $ gcloud config set compute/zone us-central1-b
    
  3. Type the following into your Cloud Shell, to check your ctpu configuration:

    $ ctpu print-config
    

    You should see a message like this:

    2018/04/29 05:23:03 WARNING: Setting zone to "us-central1-b"
    ctpu configuration:
            name: [your TPU's name]
            project: [your-project-id]
            zone: us-central1-b
    If you would like to change the configuration for a single command invocation, please use the command line flags.
    If you would like to change the configuration generally, see `gcloud config configurations`.
    

    In the output message, the name is the name of your Cloud TPU (defaults to your username) and zone is the default geographic zone (defaults to us-central1-b) for your Compute Engine resources. See the list of available regions and zones for Cloud TPU. You can change the name, zone, and other properties when you run ctpu up as described below.

Set access permissions for Cloud TPU

Set the following access permissions for the Cloud TPU service account:

  1. Give the Cloud TPU access to Cloud Bigtable for your GCP project. (Note that ctpu applies some access permissions by default, but not this one.)

    $ ctpu auth add-bigtable
    

    You can check the Cloud TPU permissions by running this command:

    $ ctpu auth list
    
  2. (Optional) Update the Cloud Storage permissions. The ctpu up command sets default permissions for Cloud Storage. If you want finer-grained permissions, review and update the access level permissions.

Create a Compute Engine VM and a Cloud TPU

Run the following command to set up a Compute Engine virtual machine (VM) and a Cloud TPU with associated services. This combination of resources and services is called a Cloud TPU flock:

$ ctpu up [optional: --name --zone]

You should see a message that follows this pattern:

ctpu will use the following configuration:

   Name: [your TPU's name]
   Zone: [your project's zone]
   GCP Project: [your project's name]
   TensorFlow Version: 1.13
   VM:
     Machine Type: [your machine type]
     Disk Size: [your disk size]
     Preemptible: [true or false]
   Cloud TPU:
     Size: [your TPU size]
     Preemptible: [true or false]

OK to create your Cloud TPU resources with the above configuration? [Yn]:

Enter y and press Enter to create your Cloud TPU resources.

The ctpu up command performs the following tasks:

  • Enables the Compute Engine and Cloud TPU services.
  • Creates a Compute Engine VM instance with the latest stable TensorFlow version pre-installed.
  • Creates a Cloud TPU with the corresponding version of TensorFlow, and passes the name of the Cloud TPU to the Compute Engine VM instance as an environment variable (TPU_NAME).
  • Ensures your Cloud TPU has access to resources it needs from your GCP project, by granting specific IAM roles to your Cloud TPU service account.
  • Performs a number of other checks.
  • Logs you in to your new Compute Engine VM instance.

You can run ctpu up as often as you like. For example, if you lose the SSH connection to the Compute Engine VM instance, run ctpu up to restore the connection. (If you changed the default values for --name and --zone, you must specify them again each time you run ctpu up.) See the ctpu documentation for details.

Verify that you're logged in to your VM instance

When the ctpu up command has finished executing, check that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged in to your Compute Engine VM instance.

Set up some handy variables

In the following steps, a prefix of (vm)$ means you should run the command on the Compute Engine VM instance in your Cloud TPU flock.

Set up some environment variables to simplify the commands in this tutorial:

  1. Set up a PROJECT variable, replacing YOUR-PROJECT-ID with your GCP project ID:

    (vm)$ export PROJECT=YOUR-PROJECT-ID
    
  2. Set up a STORAGE_BUCKET variable, replacing YOUR-BUCKET-NAME with the name of your Cloud Storage bucket:

    (vm)$ export STORAGE_BUCKET=gs://YOUR-BUCKET-NAME
    
  3. Set up a BIGTABLE_INSTANCE variable, replacing YOUR-INSTANCE-ID with the ID of the Cloud Bigtable instance that you created earlier:

    (vm)$ export BIGTABLE_INSTANCE=YOUR-INSTANCE-ID
    

Prepare the data

The training application expects your training data to be accessible in Cloud Bigtable. This section shows you how to download the ImageNet data and convert it to TFRecord files, then upload the data to Cloud Bigtable.

Note that it can take a few hours to download and process the ImageNet dataset.

Alternatively, you can use a fake dataset for a quick test of the model training process.

(Optional) Use a fake dataset for testing

For a quick test, you can use a randomly generated fake dataset instead of the full ImageNet dataset. The fake dataset is only useful for understanding how to use a Cloud TPU and validating the end-to-end procedure. The resulting accuracy numbers and saved model are not meaningful, and the training process doesn't effectively test streaming of data from Cloud Bigtable.

The fake dataset is at this location on Cloud Storage:

gs://cloud-tpu-test-datasets/fake_imagenet

To use the fake dataset:

  1. Set the following environment variable:

    (vm)$ export TRAINING_DATA_DIR=gs://cloud-tpu-test-datasets/fake_imagenet
    
  2. Go directly to the section on copying your data to Cloud Bigtable.

Check your disk space

The following sections assume you want to use the full ImageNet dataset.

You need about 500GB of space available on your local machine or VM instance to run the script that downloads and converts the ImageNet data.

If you decide to process the data on your Compute Engine VM instance, follow these steps to add disk space to the VM instance:

  • Follow the Compute Engine guide to add a disk to your VM instance.
  • Set the disk size to 500GB or more.
  • Set When deleting instance to Delete disk to ensure that the disk is removed when you remove the VM instance.
  • Make a note of the path to your new disk. For example: /mnt/disks/mnt-dir.

Download and convert the ImageNet data

The instructions below assume you're processing the data on your Compute Engine VM instance.

Download the ImageNet data and convert it to TFRecord format:

  1. Sign up for an ImageNet account. Make a note of the username and password that you use to create the account.

  2. Set up a TRAINING_DATA_DIR variable to contain the data files created by the data-download script. The variable must specify a location on your local machine or on your Compute Engine VM instance. For example, the following location assumes you've mounted extra disk space on your Compute Engine VM instance:

    (vm)$ export TRAINING_DATA_DIR=/mnt/disks/mnt-dir/imagenet-data
    

    Alternatively, set a location in your $HOME directory:

    (vm)$ export TRAINING_DATA_DIR=$HOME/imagenet-data
    
  3. Download the imagenet_to_gcs.py script from GitHub:

    (vm)$ wget https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/imagenet_to_gcs.py
    
  4. Install the Cloud Storage command-line tool:

    (vm)$ pip install gcloud google-cloud-storage
    
  5. Run the imagenet_to_gcs.py script as shown below. Use the --nogcs_upload option to download the files only, rather than downloading and then uploading to Cloud Storage.

    (vm)$ python imagenet_to_gcs.py \
      --local_scratch_dir=${TRAINING_DATA_DIR} \
      --nogcs_upload \
      --imagenet_username=YOUR-IMAGENET-USERNAME \
      --imagenet_access_key=YOUR-IMAGENET-PASSWORD
    

    Note: Downloading and preprocessing the data can take many hours, depending on your network and computer speed. Do not interrupt the script.

    The script produces directories containing the images for the training and validation data, of the following pattern:

    • Training data: ${TRAINING_DATA_DIR}/train/n03062245/n03062245_4620.JPEG
    • Validation data: ${TRAINING_DATA_DIR}/validation/ILSVRC2012_val_00000001.JPEG
    • Validation labels: ${TRAINING_DATA_DIR}/synset_labels.txt

Copy the data to Cloud Bigtable

The following script copies the training data from your local drive or VM instance to Cloud Bigtable for streaming to your training application. If you chose to use the fake dataset, the script copies the data directly from Cloud Storage to Cloud Bigtable, as set in the TRAINING_DATA_DIR environment variable.)

  1. Download the tfrecords_to_bigtable script from GitHub:

    (vm)$ wget https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/tfrecords_to_bigtable.py
    
  2. Install the cbt tool, which is a command-line interface for Cloud Bigtable:

    (vm)$ sudo apt install google-cloud-sdk-cbt
    
  3. Create a .cbtrc file in your home directory to store your default project settings for Cloud Bigtable:

    (vm)$ echo -e "project = $PROJECT\ninstance = $BIGTABLE_INSTANCE" > ~/.cbtrc
    
  4. Create a Cloud Bigtable table and a column family for your training data. The example below sets some environment variables and creates a Cloud Bigtable table named imagenet-data and a column family named tfexample:

    (vm)$ export TABLE_NAME=imagenet-data
    (vm)$ export FAMILY_NAME=tfexample
    (vm)$ export COLUMN_QUALIFIER=example
    (vm)$ export ROW_PREFIX_TRAIN=train_
    (vm)$ export ROW_PREFIX_EVAL=validation_
    (vm)$ cbt createtable $TABLE_NAME
    (vm)$ cbt createfamily $TABLE_NAME $FAMILY_NAME
    
  5. Run the tfrecords_to_bigtable.py script to copy the training data to Cloud Bigtable:

    (vm)$ pip install --upgrade google-cloud-bigtable
    (vm)$ python tfrecords_to_bigtable.py \
      --source_glob=$TRAINING_DATA_DIR/train* \
      --bigtable_instance=$BIGTABLE_INSTANCE \
      --bigtable_table=$TABLE_NAME \
      --column_family=$FAMILY_NAME \
      --column=$COLUMN_QUALIFIER \
      --row_prefix=$ROW_PREFIX_TRAIN
    

    The script may take many minutes to run. When it's finished, you should see output similar to the following example:

    Found 1024 files (from "gs://cloud-tpu-test-datasets/fake_imagenet/train-00000-of-01024" to "gs://cloud-tpu-test-datasets/fake_imagenet/train-01023-of-01024")
    --project was not set on the command line, attempting to infer it from the metadata service...
    Dataset ops created; about to create the session.
    2018-10-12 21:07:12.585287: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    Starting transfer...
    Complete!
    
  6. Run the tfrecords_to_bigtable.py script again, this time copying the evaluation data to Cloud Bigtable:

    (vm)$ python tfrecords_to_bigtable.py \
      --source_glob=$TRAINING_DATA_DIR/validation* \
      --bigtable_instance=$BIGTABLE_INSTANCE \
      --bigtable_table=$TABLE_NAME \
      --column_family=$FAMILY_NAME \
      --column=$COLUMN_QUALIFIER \
      --row_prefix=$ROW_PREFIX_EVAL
    

    Example output:

    Found 128 files (from "gs://cloud-tpu-test-datasets/fake_imagenet/validation-00000-of-00128" to "gs://cloud-tpu-test-datasets/fake_imagenet/validation-00127-of-00128")
    --project was not set on the command line, attempting to infer it from the metadata service...
    Dataset ops created; about to create the session.
    2018-10-12 22:21:56.891413: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    Starting transfer...
    Complete!
    

(Optional) Set up TensorBoard

TensorBoard offers a suite of tools designed to present TensorFlow data visually. When used for monitoring, TensorBoard can help identify bottlenecks in processing and suggest ways to improve performance.

If you don't need to monitor the model's output at this time, you can skip the TensorBoard setup steps.

If you want to monitor the model's output and performance, follow the guide to setting up TensorBoard.

Run the ResNet-50 model

You are now ready to train and evaluate the ResNet-50 model on your Cloud TPU, streaming the training data from Cloud Bigtable. The training application writes out the trained model and intermediate checkpoints to Cloud Storage.

Run the following commands on your Compute Engine VM instance:

  1. Add the top-level /models folder to the Python path:

    (vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
    
  2. Go to the directory on your Compute Engine VM instance where the ResNet-50 model is pre-installed:

    (vm)$ cd /usr/share/tpu/models/official/resnet/
    
  3. Run the training script:

    (vm)$ python resnet_main.py \
      --tpu=$TPU_NAME \
      --model_dir=${STORAGE_BUCKET}/resnet \
      --bigtable_project=$PROJECT \
      --bigtable_instance=$BIGTABLE_INSTANCE \
      --bigtable_table=$TABLE_NAME \
      --bigtable_column_family=$FAMILY_NAME \
      --bigtable_column_qualifier=$COLUMN_QUALIFIER \
      --bigtable_train_prefix=$ROW_PREFIX_TRAIN \
      --bigtable_eval_prefix=$ROW_PREFIX_EVAL
    
    • --tpu specifies the name of the Cloud TPU. Note that ctpu passes this name to the Compute Engine VM instance as an environment variable (TPU_NAME).
    • --model_dir specifies the directory where checkpoints and summaries are stored during model training. If the folder is missing, the program creates one. When using a Cloud TPU, the model_dir must be a Cloud Storage path (gs://...). You can reuse an existing directory to load current checkpoint data and to store additional checkpoints.
    • --bigtable_project specifies the identifier of the GCP project for the Cloud Bigtable instance that holds your training data. If you do not supply this value, the program assumes that your Cloud TPU and your Cloud Bigtable instance are in the same GCP project.
    • --bigtable_instance specifies the ID of the Cloud Bigtable instance that holds your training data.
    • --bigtable_table specifies the name of the Cloud Bigtable table that holds your training data.
    • --bigtable_column_family specifies the name of the Cloud Bigtable family.
    • --bigtable_column_qualifier specifies the name of the Cloud Bigtable column qualifer. See the Cloud Bigtable overview for a description of the storage model.

What to expect

The above procedure trains the ResNet-50 model for 90 epochs and evaluates the model every 1,251 steps. For information on the default values that the model uses, and the flags you can use to change the defaults, see the code and README for the TensorFlow ResNet-50 model. With the default flags, the model should train to above 76% accuracy.

In our tests, Cloud Bigtable provides very high throughput performance for workloads such as ImageNet training, supporting scan throughput at hundreds of megabytes per second per node.

Clean up

To avoid incurring charges to your GCP account for the resources used in this topic:

  1. Disconnect from the Compute Engine VM:

    (vm)$ exit
    

    Your prompt should now be user@projectname, showing you are in the Cloud Shell.

  2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:

    $ ctpu delete [optional: --zone]
    
  3. Run ctpu status to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:

    2018/04/28 16:16:23 WARNING: Setting zone to "us-central1-b"
    No instances currently exist.
            Compute Engine VM:     --
            Cloud TPU:             --
    
  4. Run gsutil as shown, replacing YOUR-BUCKET-NAME with the name of the Cloud Storage bucket you created for this tutorial:

    $ gsutil rm -r gs://YOUR-BUCKET-NAME
    
  1. When you no longer need your training data on Cloud Bigtable, run the following command in your Cloud Shell to delete your Cloud Bigtable instance. (The BIGTABLE_INSTANCE variable should represent the Cloud Bigtable instance ID that you used earlier.)

    $ cbt deleteinstance $BIGTABLE_INSTANCE
    

What's next

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Cloud TPU