Running Distributed TensorFlow on Compute Engine

This tutorial shows how to use a distributed configuration of TensorFlow on multiple Google Compute Engine instances to train a convolutional neural network model using the MNIST dataset. The MNIST dataset enables handwritten digit recognition, and is widely used in machine learning as a training set for image recognition.

TensorFlow is Google's open source library for machine learning, developed by researchers and engineers in Google's Machine Intelligence organization, which is part of Research at Google. TensorFlow is designed to run on multiple computers to distribute the training workloads, and Cloud Machine Learning Engine provides a managed service where you can run TensorFlow code in a distributed manner by using service APIs.

The following diagram describes the architecture for running a distributed configuration of TensorFlow on Compute Engine, and using Cloud ML Engine with Google Cloud Datalab to execute predictions with your trained model.

image

This tutorial shows you how to set up and use this architecture, and explains some of the concepts along the way.

Objectives

  • Set up Compute Engine to create a cluster of virtual machines (VMs) to run TensorFlow.
  • Learn how to run the distributed TensorFlow sample code on your Compute Engine cluster to train a model. The example code uses the latest TensorFlow libraries and patterns, so you can use it as a reference when designing your own training code.
  • Deploy the trained model to Cloud ML Engine to create a custom API for predictions and then execute predictions using a Cloud Datalab notebook.

Costs

The estimated price to run this tutorial, assuming you use every resource for an entire day, is approximately $1.20 based on this pricing calculator.

Before you begin

  1. Sign in to your Google account.

    If you don't already have one, sign up for a new account.

  2. Select or create a Cloud Platform project.

    Go to the Manage resources page

  3. Enable billing for your project.

    Enable billing

  4. Enable the Compute Engine and Cloud Machine Learning APIs.

    Enable the APIs

Creating the template instance

This tutorial uses Cloud Shell, a fully functioning Linux shell in the Google Cloud Platform Console.

  1. Go to Cloud Shell.

    Open Cloud Shell

  2. Set your default Compute Engine zone and your default project:

    gcloud config set compute/zone us-east1-c
    gcloud config set project [YOUR_PROJECT_ID]

  3. Clone the GitHub repository:

    git clone https://github.com/GoogleCloudPlatform/cloudml-dist-mnist-example
    cd cloudml-dist-mnist-example

  4. Create the initial VM instance from an Ubuntu Wily image:

    gcloud compute instances create template-instance \
    --image-project ubuntu-os-cloud \
    --image-family ubuntu-1604-lts \
    --boot-disk-size 10GB \
    --machine-type n1-standard-1

  5. Use SSH to connect to the VM:

    gcloud compute ssh template-instance

  6. Install pip:

    sudo apt-get update
    sudo apt-get -y upgrade \
    && sudo apt-get install -y python-pip python-dev

  7. Install TensorFlow:

    sudo pip install tensorflow
    
  8. (Optional) Follow the steps to verify your installation.

  9. Type exit to return to Cloud Shell.

  10. Check the version of TensorFlow running in your Cloud Shell instance:

    sudo python -c 'import tensorflow as tf; print(tf.__version__)'

  11. If the version is lower than 1.2.1, use pip to upgrade it:

    sudo pip install --upgrade tensorflow

Creating a Cloud Storage bucket

Next, create a Google Cloud Storage bucket to store your MNIST files. Follow these steps:

  1. Create a regional Cloud Storage bucket to hold the MNIST data files that are to be shared among the worker instances:

    BUCKET="mnist-$RANDOM-$RANDOM"
    gsutil mb -c regional -l us-east1 gs://${BUCKET}

  2. Use the following script to download the MNIST data files and copy them to the bucket:

    sudo ./scripts/create_records.py
    gsutil cp /tmp/data/train.tfrecords gs://${BUCKET}/data/
    gsutil cp /tmp/data/test.tfrecords gs://${BUCKET}/data/

Creating the template image and training instances

To create the worker, master, and parameter server instances, convert the template instance into an image, and then use the image to create each new instance.

  1. Turn off auto-delete for the template-instance VM, which preserves the disk when the VM is deleted:

    gcloud compute instances set-disk-auto-delete template-instance \
    --disk template-instance --no-auto-delete

  2. Delete template-instance:

    gcloud compute instances delete template-instance

  3. Create the image template-image from the template-instance disk:

    gcloud compute images create template-image \
    --source-disk template-instance

  4. Create the additional instances. For this tutorial, create four instances named master-0, worker-0,worker-1,andps-0. The storage-rw scope allows the instances to access your Cloud Storage bucket. Be sure to delimit the instance names with spaces, as follows:

    gcloud compute instances create \
    master-0 worker-0 worker-1 ps-0 \
    --image template-image \
    --machine-type n1-standard-4 \
    --scopes=default,storage-rw

The cluster is ready to run distributed TensorFlow.

Running the distributed TensorFlow code

In this section, you run a script to instruct all of your VM instances to run TensorFlow code to train the model.

In Cloud Shell, run the following command from the cloudml-dist-mnist-example directory:

./scripts/start-training.sh gs://${BUCKET}

The script named start-training.sh pushes the code to each VM and sends the necessary parameters to start the TensorFlow process on each machine to create the distributed cluster. The output stream in Cloud Shell shows the loss and the accuracy values for the test dataset.

Accuracy values in terminal

When the training is done, the script prints out the location of the newly generated model files:

Trained model is stored in gs://${BUCKET}/job_[TIMESTAMP]/export/Servo/[JOB_ID]/

Copy the location of your bucket path for use in later steps.

Publishing your model for predictions

You've successfully generated a new model that you can use to make predictions. Training more-sophisticated models requires more-complex TensorFlow code, but the configuration of the compute and storage resources are similar.

Training the model is only half of the story. You need to plug your model into your application, or wrap an API service around it with authentication, and eventually make it all scale. There is a relatively significant amount of engineering work left to make your model useful.

Cloud ML Engine can help with some of this work. Cloud ML Engine is a fully managed version of TensorFlow running on Cloud Platform. Cloud ML Engine gives you all of powerful features of TensorFlow without needing to set up any additional infrastructure or install any software. You can automatically scale your distributed training to use thousands of CPUs or GPUs, and you pay for only what you use.

Because Cloud ML Engine runs TensorFlow behind the scenes, all of your work is portable, and you aren't locked into a proprietary tool.

Try the tutorial Using Distributed TensorFlow with Cloud Datalab to use the same example code to train your model by using Cloud ML Engine.

You can also configure Cloud ML Engine to host your model for predictions. Use the following steps to publish your model to Cloud ML Engine. Hosting your model helps give you the ability to quickly test and apply your model at scale, with all the security and reliability features you would expect from a Google-managed service.

The following steps use the model bucket path that was output earlier by the script named start-training.sh.

  1. Recall the output path to your Cloud Storage bucket with generated models. It is in the following format, where [JOB_ID] is the job ID. You will use this path in the next step:

    MODEL_BUCKET: gs://${BUCKET}/job_[TIMESTAMP]/export/Servo/[JOB_ID]

  2. Define a new v1 version of your model by using the gcloud command-line tool, and point it to the model files in your bucket. The following command could take several minutes to complete. Replace [YOUR_BUCKET_PATH] with the output path from the previous step. The path starts with gs://.

    MODEL="MNIST"
    MODEL_BUCKET=[YOUR_BUCKET_PATH]
    gcloud ml-engine models create ${MODEL} --regions us-east1
    gcloud ml-engine versions create \
     --origin=${MODEL_BUCKET} --model=${MODEL} v1

  3. Set the default version of your model to "v1":

    gcloud ml-engine versions set-default --model=${MODEL} v1

The model is now running with Cloud ML and able to process predictions. In the next section, you use Cloud Datalab to make and visualize predictions.

Executing predictions with Cloud Datalab

To test your predictions, create a Cloud Datalab instance that uses interactive Jupyter Notebooks to execute code.

  1. In Cloud Shell, enter the following command to create a Cloud Datalab instance:

    datalab create mnist-datalab
    
  2. From Cloud Shell, launch Cloud Datalab notebook listing page by clicking Cloud Shell Web preview (the square icon in the top right).

  3. Select Change port and select Port 8081 to launch a new tab in your browser.

  4. In the Cloud Datalab application, create a new notebook by clicking on the +Notebook icon in the upper right.

  5. Paste the following text into the first cell of the new notebook:

    %%bash
    wget https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-dist-mnist-example/master/notebooks/Online%20prediction%20example.ipynb
    cat Online\ prediction\ example.ipynb > Untitled\ Notebook.ipynb

  6. Click Run at the top of the page to download the Online prediction example.ipynb notebook. The script copies the remote notebook's contents into the current notebook.

  7. Refresh the browser page to load the new notebook content. Then select the first cell containing the JavaScript code and click Run to execute it.

  8. Scroll down the page until you see the number drawing panel, and draw a number with your cursor:

    The number 3 drawn with a cursor.

  9. Click in the next cell to activate it and then click on the down arrow next to the Run button at the top and select Run from this Cell.

  10. The output of the prediction is a length-10 array in which each index, 0-9, contains a number corresponding to that digit. The closer the number is to 1, the more likely that index matches the digit you entered. You can see that the number 3 slot highlighted in the list is very close to 1, and therefore has a high probability of matching the digit.

    PROBABILITIES
    [4.181503356903704e-07,
    7.12400151314796e-07,
    0.00017898145597428083,
    0.9955494403839111,
    5.323939553103507e-11,
    0.004269002005457878,
    7.927398321116996e-11,
    1.2688398953741853e-07,
    1.0825967819982907e-06,
    2.2037748692582682e-07]

The last cell in the notebook displays a bar chart that clearly shows that it predicted your number, which was 3 in this case.

Bar chart shows number 3 was selected.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

To avoid incurring charges for the resources used in this tutorial, follow these steps:

  1. Delete the version of your model:

    gcloud ml-engine versions delete v1 --model=MNIST

  2. Delete the model:

    gcloud ml-engine models delete MNIST

  3. Delete your Cloud Storage bucket:

    gsutil rm -r gs://${BUCKET}

  4. Delete your virtual machines including Cloud DataLab:

    gcloud compute instances delete master-0 worker-0 worker-1 ps-0 mnist-datalab

  5. Delete your VM template image:

    gcloud compute images delete template-image

  6. Delete your Cloud Datalab persistent disk:

    gcloud compute disks delete mnist-datalab-pd --zone us-east1-c

What's next

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...