Running Distributed TensorFlow on Compute Engine

This tutorial shows how to use a distributed configuration of TensorFlow on multiple Compute Engine instances to train a convolutional neural network model using the MNIST dataset. The MNIST dataset enables handwritten digit recognition, and is widely used in machine learning as a training set for image recognition.

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state of the art in ML and developers build and deploy ML-powered applications. TensorFlow is designed to run on multiple computers to distribute training workloads. In this tutorial, you run TensorFlow on multiple Compute Engine virtual machine (VM) instances to train the model. You can use AI Platform instead, which manages resource allocation tasks for you and can host your trained models. We recommend that you use AI Platform unless you have a specific reason not to. You can learn more in the version of this tutorial that uses AI Platform and Datalab.

The following diagram describes the architecture for running a distributed configuration of TensorFlow on Compute Engine, and using AI Platform with Datalab to execute predictions with your trained model.

Diagramn of running Tensorflow on Compute Engine

This tutorial shows you how to set up and use this architecture, and explains some of the concepts along the way.


  • Set up Compute Engine to create a cluster of VMs to run TensorFlow.
  • Learn how to run the distributed TensorFlow sample code on your Compute Engine cluster to train a model. The example code uses the latest TensorFlow libraries and patterns, so you can use it as a reference when designing your own training code.
  • Deploy the trained model to AI Platform to create a custom API for predictions and then execute predictions using a Datalab notebook.


The estimated price to run this tutorial, assuming you use every resource for an entire day, is approximately $1.20 based on this pricing calculator.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine API, AI Platform Training and Prediction API, and Cloud Source Repositories APIs.

    Enable the APIs

Creating the template instance

This tutorial uses Cloud Shell, a fully functioning Linux shell in the Google Cloud Console.

  1. Go to Cloud Shell.

    Open Cloud Shell

  2. Set your default Compute Engine zone and your default project. Replace [YOUR_PROJECT_ID] with your Google Cloud project.

    gcloud config set compute/zone us-east1-c
    gcloud config set project [YOUR_PROJECT_ID]
  3. Clone the GitHub repository:

    git clone
    cd cloudml-dist-mnist-example
  4. Create the initial VM instance from an Ubuntu Wily image:

    gcloud compute instances create template-instance \
    --image-project ubuntu-os-cloud \
    --image-family ubuntu-1604-lts \
    --boot-disk-size 10GB \
    --machine-type n1-standard-1
  5. Use ssh to connect to the VM:

    gcloud compute ssh template-instance
  6. Install pip:

    sudo apt-get update
    sudo apt-get -y upgrade
    sudo apt-get install -y python3-pip
  7. Install TensorFlow:

    sudo apt-get remove -y python3-setuptools
    sudo pip3 install setuptools==50.3.2
    sudo pip3 install tensorflow==1.14.0 \
      numpy==1.18.5 futures==2.2.0 h5py==2.10.0 gast==0.2.2
  8. Type exit to return to Cloud Shell.

Creating a Cloud Storage bucket

Next, create a Cloud Storage bucket to store your MNIST files. Follow these steps:

  1. Create a regional Cloud Storage bucket to hold the MNIST data files that are to be shared among the worker instances:

    gsutil mb -c regional -l us-east1 gs://${MNIST_BUCKET}
  2. Use the following script to download the MNIST data files and copy them to the bucket:

    sudo ./scripts_p3/
    gsutil cp /tmp/data/train.tfrecords gs://${MNIST_BUCKET}/data/
    gsutil cp /tmp/data/test.tfrecords gs://${MNIST_BUCKET}/data/

Creating the template image and training instances

To create the worker, master, and parameter server instances, convert the template instance into an image, and then use the image to create each new instance.

  1. Turn off auto-delete for the template-instance VM, which preserves the disk when the VM is deleted:

    gcloud compute instances set-disk-auto-delete template-instance \
    --device-name persistent-disk-0 --no-auto-delete
  2. Delete template-instance:

    gcloud compute instances delete template-instance
  3. Create the image template-image from the template-instance disk:

    gcloud compute images create template-image \
    --source-disk template-instance
  4. Create the additional instances. For this tutorial, create four instances named master-0, worker-0,worker-1,andps-0. The storage-rw scope allows the instances to access your Cloud Storage bucket. Be sure to delimit the instance names with spaces, as follows:

    gcloud compute instances create \
    master-0 worker-0 worker-1 ps-0 \
    --image template-image \
    --machine-type n1-standard-4 \

The cluster is ready to run distributed TensorFlow.

Running the distributed TensorFlow code

In this section, you run a script to instruct all of your VM instances to run TensorFlow code to train the model.

  1. In Cloud Shell, run the following command from the cloudml-dist-mnist-example directory:

    ./scripts_p3/ gs://${MNIST_BUCKET}

    The script named pushes the code to each VM and sends the necessary parameters to start the TensorFlow process on each machine to create the distributed cluster. The output stream in Cloud Shell shows the loss and the accuracy values for the test dataset.

    Accuracy values in terminal

    When the training is done, the script prints out the location of the newly generated model files:

    Trained model is stored in gs://${MNIST_BUCKET}/job_[TIMESTAMP]/export/Servo/[JOB_ID]/
  2. Copy the location of your bucket path for use in later steps.

Publishing your model for predictions

You've successfully generated a new model that you can use to make predictions. Training more-sophisticated models requires more-complex TensorFlow code, but the configuration of the compute and storage resources are similar.

Training the model is only half of the story. You need to plug your model into your application, or wrap an API service around it with authentication, and eventually make it all scale. There is a relatively significant amount of engineering work left to make your model useful.

AI Platform can help with some of this work. AI Platform provides a fully managed version of TensorFlow running on Google Cloud. AI Platform gives you all of powerful features of TensorFlow without needing to set up any additional infrastructure or install any software. You can automatically scale your distributed training to use as many CPUs or GPUs as you need, and you pay for only what you use.

Because AI Platform runs TensorFlow behind the scenes, all of your work is portable, and you aren't locked into a proprietary tool.

Try the tutorial Using distributed TensorFlow with Datalab to use the same example code to train your model by using AI Platform.

You can also deploy your model to AI Platform for predictions. Use the following steps to deploy your model to AI Platform. Deploying your model helps give you the ability to quickly test and apply your model at scale, with all the security and reliability features you would expect from a Google-managed service.

The following steps use the model bucket path that was output earlier by the script named

  1. Recall the output path to your Cloud Storage bucket with generated models. It is in the following format, where [JOB_ID] is the job ID. You use this path in the next step:

    MODEL_BUCKET: gs://${MNIST_BUCKET}/job_[TIMESTAMP]/export/Servo/[JOB_ID]
  2. Define a new v1 version of your model by using the gcloud command-line tool, and point it to the model files in your bucket. The following command can take several minutes to complete. Replace [YOUR_BUCKET_PATH] with the output path from the previous step. The path starts with gs://.

    gcloud ai-platform models create ${MODEL} --regions us-east1
    gcloud ai-platform versions create \
      --origin=${MODEL_BUCKET} --model=${MODEL} \
      --region=global --runtime-version=1.14 v1
  3. Set the default version of your model to v1:

    gcloud ai-platform versions set-default \
      --model=${MODEL} --region=global v1

The model is now running with AI Platform and able to process predictions. In the next section, you use Datalab to make and visualize predictions.

Executing predictions with Datalab

To test your predictions, create a Datalab instance that uses interactive Jupyter Notebooks to execute code.

  1. In Cloud Shell, enter the following command to create a Datalab instance:

    datalab create mnist-datalab
  2. From Cloud Shell, launch Datalab notebook listing page by clicking Cloud Shell Web preview (the square icon in the top right).

  3. Select Change port and select Port 8081 to launch a new tab in your browser.

  4. In the Datalab application, create a new notebook by clicking +Notebook in the upper right.

  5. Paste the following text into the first cell of the new notebook:

    cat Online\ prediction\ example.ipynb > Untitled\ Notebook.ipynb
  6. Click Run at the top of the page to download the Online prediction example.ipynb notebook. The script copies the remote notebook's contents into the current notebook.

  7. Reload the browser page to load the new notebook content. Then select the first cell containing the JavaScript code and click Run to execute it.

  8. Scroll down the page until you see the number drawing panel, and draw a number with your cursor:

    The number 3 drawn with a cursor.

  9. Click in the next cell to activate it and then click on the down arrow next to the Run button at the top and select Run from this Cell.

    The output of the prediction is a length-10 array in which each index, 0-9, contains a number corresponding to that digit. The closer the number is to 1, the more likely that index matches the digit you entered. You can see that the number 3 slot highlighted in the list is very close to 1, and therefore has a high probability of matching the digit.


The last cell in the notebook displays a bar chart that clearly shows that it predicted your number, which was 3 in this case.

Bar chart shows number 3 was selected.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  1. Delete the version of your model:

    gcloud ai-platform versions delete v1 --model=MNIST
  2. Delete the model:

    gcloud ai-platform models delete MNIST
  3. Delete your Cloud Storage bucket:

    gsutil rm -r gs://${MNIST_BUCKET}
  4. Delete your VMs including Datalab:

    gcloud compute instances delete master-0 worker-0 worker-1 ps-0 mnist-datalab
  5. Delete your VM template image:

    gcloud compute images delete template-image
  6. Delete your Datalab persistent disk:

    gcloud compute disks delete mnist-datalab-pd --zone us-east1-c

What's next