Edit on GitHub
Report issue
Page history

Cloud TPUs with Shared VPC

Author(s): @bernieongewe ,   Published: 2020-10-30

Bernie Ongewe | Technical Solutions Engineer | Google

Contributed by Google employees.

This tutorial is an adaptation of the Cloud TPU quickstart.

As with the original quickstart, this tutorial introduces you to using Cloud TPUs to run MNIST, a canonical dataset of hand-written digits that is often used to test new machine-learning approaches.

This adaptation is intended for users deploying under common networking constraints. This tutorial demonstrates the workflow for using AI Platform Notebooks to train models using Cloud TPUs in a Shared VPC environment with VPC Service Controls.

For a more detailed exploration of Cloud TPUs, work through the colabs and tutorials.

Differences from the original Cloud TPU quickstart

  • The ctpu up command used in the original Cloud TPU quickstart doesn't allow you to create a TPU in a subnet shared from a host project. This tutorial discusses some common production network considerations and uses the gcloud beta compute tpus create command to prepare the TPU. For more information, see Connecting TPUs with Shared VPC Networks.

  • This tutorial uses AI Platform Notebooks to launch the training job.

  • This tutorial discusses VPC-SC changes needed to access the gs://tfds-data/ storage bucket from within a service perimeter.

Before you begin

Before starting this tutorial, check that your Google Cloud project is set up according to the instructions in Set up an account and a Cloud TPU project.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

Use the pricing calculator to generate a cost estimate based on your projected usage.

Networking requirements

Shared VPC requirements

Note: The setup in this section might already have been completed by your organization's Google Cloud administrator who provided you with access to an assigned Shared network.

  1. Become familiar with Shared VPC concepts.

  2. Enable a host project that contains one or more Shared VPC networks.

    This must be done by a Shared VPC administrator.

  3. Attach one or more service projects to your Shared VPC network.

    This must be done by a Shared VPC administrator.

  4. Create a VPC network in the host project.

    This is the network in which the TPU will be created.

VPC Service Control requirements

The training instance requires access to the gs://tfds-data/ public Cloud Storage bucket. If the test network is within a service perimeter, your organization's administrator should implement rules that allow access to this public bucket.

Errors such as the one below when training the model indicate that the egress requirement has not been met:

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 403 with body '{
  "error": {
    "code": 403,
    "message": "Request is prohibited by organization's policy. vpcServiceControlsUniqueIdentifier: 8e042d53afb67532",
    "errors": [
      {
        "message": "Request is prohibited by organization's policy. vpcServiceControlsUniqueIdentifier: 8e042d53afb67532",
        "domain": "global",
        "reason": "vpcServiceControls"
      }
    ]    
  }
}
'
         when reading metadata of gs://tfds-data/dataset_info/mnist/3.0.1

Set up the project and Cloud Storage bucket

This section provides information on setting up Cloud Storage and a Compute Engine virtual machine (VM).

Important: Set up your Compute Engine VM, your Cloud TPU node, and your Cloud Storage bucket in the same region and zone to reduce network latency and network costs.

  1. Open Cloud Shell.

  2. Create a variable for your project ID:

    export PROJECT_ID=[YOUR_PROJECT_ID]
    
  3. Configure the gcloud command-line tool to use the project where you want to create Cloud TPUs:

    gcloud config set project $PROJECT_ID
    
  4. Create a Cloud Storage bucket:

    gsutil mb -p ${PROJECT_ID} -c standard -l us-central1 -b on gs://[YOUR_BUCKET_NAME]
    

    Replace [YOUR_BUCKET_NAME] with the name you want to assign to your bucket.

    This Cloud Storage bucket stores the data you use to train your model and the training results.

Set up VPC Peering

Review Connecting TPUs with Shared VPC networks.

Configure private service access

Private services access is used to create a VPC peering between your network and the Cloud TPU service network. Before you use TPUs with Shared VPCs, you need to establish a private service access connection for the network.

  1. Get the project ID for your Shared VPC host project, and then configure gcloud with your project ID:

    gcloud config set project [YOUR_NETWORK_HOST_PROJECT_ID]
    

    You can get the project ID from the Cloud Console.

  2. Enable the Service Networking API:

    gcloud services enable servicenetworking.googleapis.com
    

    You can also enable the Service Networking API with the Cloud Console.

  3. Allocate a reserved address range for use by Service Networking:

    gcloud compute addresses create SN-RANGE-1 --global\
    --addresses=10.110.0.0 \
    --prefix-length=16 \
    --purpose=VPC_PEERING \
    --network=<your-host-network>
    

    The prefix-length must be 24 or less.

  4. Establish a private service access connection:

    gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --ranges=SN-RANGE-1 --network=[YOUR_HOST_NETWORK]
    
  5. Check whether a private services access connection has been established for the network:

    gcloud services vpc-peerings list --network=[YOUR_NETWORK_NAME]
    

    If a private services access connection has been established for the network, then you can start using TPUs with the Shared VPC.

Create TPUs

The TPU and notebook TensorFlow versions must be aligned. This tutorial uses version 2.3 for both.

Create a TPU:

gcloud beta compute tpus create tpu-quickstart --zone [YOUR_ZONE] --network [YOUR_HOST_NETWORK] --use-service-networking --version 2.3

The command above creates a TPU called tpu-quickstart.

The use-service-networking flag enables the creation of TPUs that can connect to Shared VPC networks.

If you are using Shared VPC networks, the network field must include the host project ID or host project number and the network name following this pattern:

projects/my-host-project-id/global/networks/my-network

For information about what Google Cloud zones Cloud TPUs are available in, see TPU types and zones.

Get information about a TPU

You can get the details of a TPU node through TPU API requests:

gcloud compute tpus describe tpu-quickstart --zone [YOUR_ZONE]

The response body contains information about an instance of a TPU node, including the CIDR block.

You can get a list of TPUs through TPU API requests:

gcloud compute tpus list tpu-quickstart --zone [YOUR_ZONE]

Create an AI Platform Notebooks instance with custom properties

  1. Go to the AI Platform Notebooks page in the Cloud Console.

  2. Click New instance, and then select Customize instance.

  3. On the New notebook instance page, provide the following information for your new instance:

    • Instance name: Provide a name for your new instance.
    • Region: Select the same region as the Cloud TPU previously created.
    • Zone: Select the same zone as the Cloud TPU previously created.
    • Environment: Select Tensorflow Enterprise 2.3.
    • Machine type: Select the number of CPUs and amount of RAM for your new instance. AI Platform Notebooks provides monthly cost estimates for each machine type that you select.
  4. Expand the Networking section.

  5. Select Networks shared with me.

  6. In the Network menu, select the shared network that you previously configured.

  7. In the Subnetwork menu, select the subnetwork previously configured.

  8. To grant access to a specific service account, click the Access to JupyterLab menu, select Other service account, and then fill out the Service account field.

    This service account must also be granted a Storage Object User role on the gs://[BUCKET_NAME] training bucket you previously created.

  9. Click Create.

AI Platform Notebooks creates a new instance based on your specified properties. An Open JupyterLab link becomes active when it's ready to use.

Start the training job

  1. Click the Open JupyterLab link when it becomes available, and then click the Terminal tile in the notebooks launcher dialog.

  2. From a terminal session, install the TensorFlow Model Optimization Toolkit:

    pip3 install tensorflow_model_optimization
    
  3. Change directory to the example code directory:

    cd ~/tutorials/models/official/vision/image_classification
    
  4. Update PYTHONPATH:

    export PYTHONPATH="$PYTHONPATH:/home/jupyter/tutorials/models"
    
  5. Set environment variables, replacing [BUCKET_NAME] with your Cloud Storage bucket.

    export STORAGE_BUCKET=gs://[BUCKET_NAME]
    export TPU_NAME=tpu-quickstart
    export MODEL_DIR=$STORAGE_BUCKET/mnist
    DATA_DIR=$STORAGE_BUCKET/data
    export PYTHONPATH="$PYTHONPATH:/usr/share/models"
    
  6. Train the model:

    python3 mnist_main.py \
      --tpu=$TPU_NAME \
      --model_dir=$MODEL_DIR \
      --data_dir=$DATA_DIR \
      --train_epochs=10 \
      --distribution_strategy=tpu \
      --download
    

Clean up

  1. Go to the AI Platform Notebooks instances page to delete the notebook instance.
  2. Delete the Cloud TPU

    gcloud beta compute tpus delete tpu-quickstart --zone [YOUR_ZONE]
    

Submit a tutorial

Share step-by-step guides

Submit a tutorial

Request a tutorial

Ask for community help

Submit a request

View tutorials

Search Google Cloud tutorials

View tutorials

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.