Running a Machine Learning (ML) model requires a Compute Engine VM and Cloud TPU resources. This page describes how to manage these resources using:
- The
ctpu
utility, which provides a CLI specifically designed for managing Cloud TPU resources - The
gcloud command-line tool
, which provides the primary CLI to Google Cloud Platform (GCP) - The Cloud Console, which provides an integrated management console for your GCP resources.
Prerequisites
To run these procedures, you need to have a Google Cloud Platform (GCP) project set up. If you don't have a project, see Creating and managing projects to set one up.
Setting up a Compute Engine VM
ctpu
The ctpu
utility can create the Compute Engine VM and
Cloud TPU resources together or separately. In this procedure,
ctpu
is used to only create the Compute Engine VM.
- In the Cloud shell, run the following command to create a Compute Engine VM:
$ ctpu up --vm-only --zone=zone [optional: --name --machine-type --disk-size-gb]
Parameter | Description |
---|---|
vm-only |
Create only the Compute Engine VM. |
zone |
The zone where you plan to create your Cloud TPU. For example,
us-central1-b .
|
name
|
Specify a name for the Compute Engine VM. If a name is not specified, the default is your username. |
machine-type |
This is the type of machine to use for the VM. See machine types for the supported machine types. The default is n1-standard-1. |
disk-size-gb |
The disk size in GB for the VM. For example, 300GB. If a size is not specified, it defaults to 20GB. |
See the ctpu
Reference
for all of the ctpu
options.
gcloud commands
Use gcloud
commands
to interact with Google Cloud in the Cloud shell.
-
If you are not using the Cloud Shell
as your command interface, set up the
gcloud
command-line tool by installing the Cloud SDK for your operating system. -
Configure
gcloud
to use your project. -
Specify the zone where you plan to create your Compute Engine
VM. For example,
us-central1-b
. -
Launch a Compute Engine virtual machine.
Since you specified the zone in the previous command, the VM instance is created in that zone.
-
Remotely connect to your Compute Engine VM:
$ gcloud compute ssh vm-name --zone=zone
gcloud config set project project-name
$ gcloud config set compute/zone zone
gcloud compute instances create vm-name \ --machine-type=machine-type \ --image-project=ml-images \ --image-family=tensorflow-vm-image \ --boot-disk-size=boot-disk-size \ --scopes=cloud-platform
Parameter | Description |
---|---|
vm-name |
Specify a name for the Compute Engine VM. |
machine-type |
This is the type of machine to use for the VM. See machine types for the supported machine types. |
image-project |
The project against which all image and image family references will be resolved. Use ml-images. |
image-family
|
This is the TensorFlow family of the image that the boot disk will be initialized with, for example, tf-1-15 or tf-2-1. |
boot-disk-size |
The boot disk size in GB for the VM. For example, 300GB. If a size is not specified, it defaults to 20GB. |
scopes
|
Use cloud-platform. |
This will generate output similar to the following:
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS demo-vm-tpu us-central1-b n1-standard-1 10.138.0.2 35.247.15.162 RUNNING
Console
From the Google Cloud Console, create your VM and establish remote access to it.
- Select Compute Engine > VM instances from the left-hand navigation bar and click CREATE INSTANCE.
- From the top menu bar on the Create an instance page, select Create an instance and specify an instance name, the region, and a machine type.
- Click the Boot Disk > Change button. This will bring up a new window where you can choose your VM operating system and version.
- In the Operating system pull down, select Deep Learning on Linux.
- Under Version select the version of Debian GNU/Lunix with the TensorFlow version you want to run, for example, TF2-1.
- At the bottom of the Boot Disk page, click Select to complete the Operating System and Version selection.
- Under Identity and API access > Access scopes, select Allow full access to all Cloud APIs.
- Go to Compute Engine > VM instances. Find the instance with your VM name, and click SSH to connect to it.
Parameter | Description |
---|---|
name
|
Specifies the name of the Compute Engine VM. You can specify any instance name, but use the same one for both the VM instance and the Cloud TPU. |
region |
If you have set up a Cloud Storage bucket, this should match the Location of the bucket. |
machine type |
Specifies the machine type to use for your Compute Engine VM. Select a machine type from the drop down menu. |
Setting up a Cloud TPU
Set up your Compute Engine VM using the
VM setup procedure before setting up
your Cloud TPU. You can allocate and start your TPU
resources using the ctpu
utility,
gcloud
commands, or the Cloud Console.
ctpu
Run the following command in the Cloud shell to create your Cloud TPU.
$ ctpu up --tpu-only --name=tpu-name --zone=zone [optional: --tpu-size]
Parameter | Description |
---|---|
name
|
Specifies the name of the Cloud TPU. Use the same name as you used for the Compute Engine VM. |
zone |
The zone where you plan to create your Cloud TPU. This should
be the same zone you used for the Compute Engine VM. For example,
us-central1-b .
|
tpu-size |
This is the TPU type to use. The default is v2-8 . See
Types and zones
for the supported TPU types and zones.
|
gcloud commands
The Cloud SDK is a set of tools that you can use to interact with Google Cloud in the Cloud shell.
-
Install the
gcloud
command-line tool via the Cloud SDK. -
Use the
gcloud
command-line tool to specify your Google Cloud project:$ gcloud config set project project-name
-
Specify the zone where you plan to create your Cloud TPU resource.
This should be the same zone you used for the Compute Engine VM.
For example,
us-central1-b
.$ gcloud config set compute/zone zone
Cloud TPU is available in the following zones:
US
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB us-central1-b
us-central1-c
us-central1-f
v2-32 32 256 GiB us-central1-a
v2-128 128 1 TiB us-central1-a
v2-256 256 2 TiB us-central1-a
v2-512 512 4 TiB us-central1-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones v3-8 8 128 GiB us-central1-a
us-central1-b
us-central1-f
Europe
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB europe-west4-a
v2-32 32 256 GiB europe-west4-a
v2-128 128 1 TiB europe-west4-a
v2-256 256 2 TiB europe-west4-a
v2-512 512 4 TiB europe-west4-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones v3-8 8 128 GiB europe-west4-a
v3-32 32 512 GiB europe-west4-a
v3-64 64 1 TiB europe-west4-a
v3-128 128 2 TiB europe-west4-a
v3-256 256 4 TiB europe-west4-a
v3-512 512 8 TiB europe-west4-a
v3-1024 1024 16 TiB europe-west4-a
v3-2048 2048 32 TiB europe-west4-a
Asia Pacific
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB asia-east1-c
-
Create a new Cloud TPU resource.
Since you specified the zone in the previous command, the Cloud TPU is created in that zone.
$ gcloud compute tpus create tpu-name \ --network=network-ID or default \ --accelerator-type=tpu-type \ --version=2.1
Parameter Description tpu-name
Specifies the name of the Cloud TPU. Use the same name you used for the Compute Engine VM name. network
If you know your network ID, use that, otherwise enter default. accelerator-type
This is your TPU type. See TPU types for the supported TPU types for your zone. version
The TensorFlow version to use with your Cloud TPU. This must be the same version of TensorFlow used to write your model. This will generate output similar to the following:
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINT NETWORK RANGE STATUS demo-vm-tpu us-central1-b v2-8 10.240.1.2:8470 default 10.240.1.0/29 READY
-
Remotely connect to your Compute Engine VM:
$ gcloud compute ssh vm-name
- Create an environment variable containing the name of your TPU:
$ export TPU_NAME=tpu-name
Console
- Go to Compute Engine > TPUs on the left-hand navigation bar and click CREATE TPU NODE.
- On the Create a Cloud TPU page use the menu pulldowns to specify the TPU name, the zone, TPU type, TPU software version, and network.
- Go to Compute Engine > VM instances. Find the instance with your VM name, and click SSH to connect to it.
- From your VM, create an environment variable containing the name
of your TPU:
$ export TPU_NAME=tpu-name
Create, start and connect to your Cloud TPU.
Parameter | Description |
---|---|
name
|
Specifies the name of the Cloud TPU. Use the same name you used for the Compute Engine VM name. |
zone |
The zone where you plan to create your Cloud TPU. For example,
us-central1-b .
|
TPU type |
This is your TPU type. See TPU types for the supported TPU types for your zone. |
TPU software version |
The TensorFlow or PyTorch version to use with your Cloud TPU. This must be the same version of TensorFlow or PyTorch used to create your model. |
network-ID or default |
If you know your network ID, use that, otherwise enter default. |
Setting up a Compute Engine VM and Cloud TPU resources
You can allocate and start your VM and TPU resources using the ctpu
utility,
gcloud
commands, or the Cloud Console.
ctpu
Run the following command in the
Cloud shell.
The ctpu
utility creates the Compute Engine VM and
Cloud TPU resources together and gives them the same name.
$ ctpu up [optional: --name --zone --tpu-size --machine-type --disk-size-gb]
Parameter | Description |
---|---|
name
|
Specifies the name for both the Compute Engine VM and the Cloud TPU. |
zone |
The zone where you plan to create your Cloud TPU. For example,
us-central1-b .
|
tpu-size |
This is the TPU type to use. The default is v2-8 . See
Types and zones
for the supported TPU types and zones.
|
machine type |
Specifies the machine type to use for your Compute Engine VM. Select a machine type from the drop down menu. |
disk-size-gb |
The disk size in GB for the VM. For example, 300GB. If a size is not specified, it defaults to 20GB. |
gcloud commands
The Cloud SDK is a set of tools that you can use to interact with Google Cloud in the Cloud shell.
-
Install the
gcloud
command-line tool via the Cloud SDK. -
Use the
gcloud
command-line tool to specify your Google Cloud project:$ gcloud config set project project-name
-
Specify the zone where you plan to create your Compute Engine
VM and Cloud TPU resource. For this example, use the
us-central1-b
zone:$ gcloud config set compute/zone zone
Cloud TPU is available in the following zones:
US
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB us-central1-b
us-central1-c
us-central1-f
v2-32 32 256 GiB us-central1-a
v2-128 128 1 TiB us-central1-a
v2-256 256 2 TiB us-central1-a
v2-512 512 4 TiB us-central1-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones v3-8 8 128 GiB us-central1-a
us-central1-b
us-central1-f
Europe
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB europe-west4-a
v2-32 32 256 GiB europe-west4-a
v2-128 128 1 TiB europe-west4-a
v2-256 256 2 TiB europe-west4-a
v2-512 512 4 TiB europe-west4-a
TPU type (v3) TPU v3 cores Total TPU memory Available zones v3-8 8 128 GiB europe-west4-a
v3-32 32 512 GiB europe-west4-a
v3-64 64 1 TiB europe-west4-a
v3-128 128 2 TiB europe-west4-a
v3-256 256 4 TiB europe-west4-a
v3-512 512 8 TiB europe-west4-a
v3-1024 1024 16 TiB europe-west4-a
v3-2048 2048 32 TiB europe-west4-a
Asia Pacific
TPU type (v2) TPU v2 cores Total TPU memory Region/Zone v2-8 8 64 GiB asia-east1-c
-
Create a Compute Engine VM to interact with your Cloud TPU.
Since you specified the zone in the previous command, the VM instance is created in that zone.
$ gcloud compute instances create vm-name\ --machine-type=n1-standard-1 \ --image-project=ml-images \ --image-family=tensorflow-vm-image \ --boot-disk-size=boot-disk-size \ --scopes=cloud-platform
Parameter Description vm-name
Specify a name for the Compute Engine VM. machine-type
This is the type of machine to use for the VM. See machine types for the supported machine types. image-project
The project against which all image and image family references will be resolved. Use ml-images. image-family
This is the TensorFlow family of the image that the boot disk will be initialized with, for example, tf-1-15 or tf-2-1. boot-disk-size
The disk size in GB for the VM. For example, 300GB. If a size is not specified, it defaults to 20GB. scopes
Use cloud-platform. This will generate output similar to the following:
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS demo-vm-tpu us-central1-b n1-standard-1 10.138.0.2 35.247.15.162 RUNNING
-
Create a new Cloud TPU resource.
$ gcloud compute tpus create tpu-name \ --zone=zone \ --network=network-id or default \ --accelerator-type=tpu-type \ --version=2.1
Parameter Description tpu-name
Specifies the name of the Cloud TPU. Use the same name you used for the Compute Engine VM name. zone
The zone where you plan to create your Cloud TPU. For example, us-central1-b
.network
If you know your network ID, use that, otherwise enter default. accelerator-type
This is your TPU type. See TPU types for the supported TPU types for your zone. version
The TensorFlow version to use with your Cloud TPU. This must be the same version of TensorFlow used to create you model. This will generate output similar to the following:
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINT NETWORK RANGE STATUS demo-vm-tpu us-central1-b v2-8 10.240.1.2:8470 default 10.240.1.0/29 READY
-
Remotely connect to your Compute Engine VM:
$ gcloud compute ssh vm-name --zone=zone
- Create an environment variable containing the name of your TPU:
$ export TPU_NAME=tpu-name
Console
-
Create and start your VM.
- Go to Compute Engine > VM instances on the left-hand navigation bar and click CREATE INSTANCE.
- On the Create an instance page specify an instance name, the region, and the machine type.
Parameter Description name
Specifies the name of the Compute Engine VM. You can specify any instance name, but use the same one for both the VM instance and the Cloud TPU. region
If you have set up a Cloud Storage bucket, this should match the Location of the bucket. machine type
Specifies the machine type to use for your Compute Engine VM. Select a machine type from the drop down menu. - Create, start and connect to your Cloud TPU.
- Go to Compute Engine > TPUs on the left-hand navigation bar and click CREATE TPU NODE.
- On the Create a Cloud TPU page use the menu pulldowns to specify the TPU name, the zone, TPU type, TPU software version, network, and an internal IP address for the Cloud TPU to use.
- Go to Compute Engine > VM instances. Find the instance with your VM name, and click SSH to connect to it.
Parameter | Description |
---|---|
name
|
Specifies the name of the Cloud TPU. Use the same name you used for the Compute Engine VM name. |
zone |
The zone where you plan to create your Cloud TPU. For example,
us-central1-b .
|
TPU type |
This is your TPU type. See TPU types for the supported TPU types for your zone. |
TPU software version |
The TensorFlow or PyTorch version to use with your Cloud TPU. This must be the same version of TensorFlow or PyTorch used to create your model. |
network-ID or default |
If you know your network ID, use that, otherwise enter default. |
Stopping your Cloud TPU resources
Charges for Cloud TPU resources begin when the Cloud TPU starts, even if it is not yet actively training a model. To avoid being charged while the Cloud TPU is inactive, you can stop it and restart it when you are ready to train a model.
This section shows how to stop the Cloud TPU using the
ctpu
utility, gcloud
commands, or the Cloud Console.
ctpu
-
Run the
ctpu status
command, specifying the zone where your Cloud TPU is set up.$ ctpu status --zone=zone
This will display the status of Compute Engine VM and Cloud TPU resources within the zone.
Your cluster is running! Compute Engine VM: RUNNING Cloud TPU: RUNNING
-
If the Cloud TPU resource is running, use the following command to stop it.
$ ctpu pause --zone=zone
This stops the Compute Engine VM and deletes the Cloud TPU resources in the specified zone. To only restart the Compute Engine VM, run the following command, including any other flags you need for the VM:
$ ctpu up --vm-only --zone=zone
To restart the Cloud TPU, run
ctpu up
with the flags you set when you first started the Cloud TPU.
gcloud
-
Run the following command in the Cloud shell to list the available Cloud TPU resources in your zone.
$ gcloud compute tpus list --zone=zone
The Cloud TPU resource in your zone is displayed:
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINT NETWORK RANGE STATUS demo-tpu us-central1-b v2-8 10.240.1.2:8470 default 10.240.1.0/29 READY
-
Run the following command to stop the Cloud TPU:
$ gcloud compute tpus stop tpu-name --zone=zone
Console
Select Compute Engine > TPUs from the left-hand navigation bar. Click STOP from the menu bar at the top of the page.
To restart the Cloud TPU, click START.
Viewing your Compute Engine VM and Cloud TPU resources
This section shows how you can view your currently active VM and TPU resources
using the ctpu
utility, gcloud
commands, or the Cloud Console.
ctpu
Run the ctpu status
command and specify the zone where your
Compute Engine VM and Cloud TPU resources are
set up.
$ ctpu status --zone=zone
This will display the status of Compute Engine VM and Cloud TPU resources within the zone.
Your cluster is running! Compute Engine VM: RUNNING Cloud TPU: RUNNING
If no resources are currently set up, the output will just show dashes for the VM and TPU. If one resource is active and the other is not, you will see a message saying the status is unhealthy. You need to start or restart whichever resource is not running.
gcloud
-
Run the following command in the Cloud shell to list the available Compute Engine VM resources in specific zones. In this example, VM resources in us-central1-b and europe-west4-a will be displayed:
$ gcloud compute instances list --filter="zone:( us-central1-b europe-west4-a )"
The above command prints the details of the VM resources you've created. For example:
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS demo-tpu us-central1-b n1-standard-1 10.128.0.33 35.232.214.205 RUNNING
-
Run the following command from the Cloud shell to list the available Cloud TPU resources in your zone. In this example we show the selected zone to be
us-central1-b
.$ gcloud compute tpus list --zone=us-central1-b
The Cloud TPU resource in us-central1-b is displayed:
NAME ZONE ACCELERATOR_TYPE NETWORK_ENDPOINT NETWORK RANGE STATUS demo-tpu us-central1-b v2-8 10.240.1.2:8470 default 10.240.1.0/29 READY
Console
-
From the left navigation menu, select Compute Engine > TPUs.
A list of all active Compute Engine resources appears.
- From the left navigation menu, select Compute Engine > VM Instances.
A list of all active Compute Engine resources appears.
Deleting your Compute Engine VM and Cloud TPU resources
You can delete your VM and TPU resources using the ctpu
utility, gcloud
commands, or the Cloud Console.
ctpu
Run the following command from the
Cloud shell.
The ctpu
utility deletes
the Compute Engine VM and Cloud TPU resources together.
$ ctpu delete [optional: --zone]
gcloud
Run the following command from the Cloud shell to delete your Cloud TPU and Compute Engine VM resources. You must include the Cloud TPU resource name, the zone, and the project ID on the command line. Delete the Cloud TPU first since you need a running Compute Engine VM to delete the Cloud TPU resource.
-
Delete your Cloud TPU resource:
$ gcloud compute tpus delete tpu-name --project=project-name --zone=zone
-
Delete your Compute Engine instance:
$ gcloud compute instances delete vm-name --project=project-name --zone=zone
-
Delete the VPC network that Google automatically created as part of the
Cloud TPU setup.
Go to the VPC Networking page on the Cloud Console.
-
Select the VPC network. The network entry starts with cp-to-tp-peering
in the ID.
At the top of the page, click Delete to delete the selected VPC network.
-
When you've finished deleting the resources, use the
gsutil
command to delete any Cloud Storage buckets you created. Replace bucket-name with the name of your Cloud Storage bucket:$ gsutil rm -r gs://bucket-name
Console
-
Delete your VM.
- Go to Compute Engine > VM instances from the left-hand navigation bar.
- Select your VM instance from the list. Click the trash can icon at the top of the page.
-
Delete your Cloud TPU.
- Go to Compute Engine > TPUs on the left-hand navigation bar.
- Select your TPU resource from the list. Click the trash can icon at the top of the page.