Choosing between a single Cloud TPU device and a Cloud TPU Pod (alpha)

は、

Single Cloud TPU device and Cloud TPU Pod (alpha) configurations

Single Cloud TPU device and Cloud TPU Pod (alpha) configurations are described here and in the system architecture document.

  • A Cloud TPU v2-8 is a single Cloud TPU device that has 4 TPU chips. Each TPU chip has 2 cores, so the v2-8 has a total of 8 cores.
  • A Cloud TPU v2 Pod (alpha) consists of 64 TPU devices containing 256 TPU chips (512 cores) connected together with high-speed interfaces.

Since TPU resources can scale from a single Cloud TPU device to a Cloud TPU Pod (alpha), you don't need to choose between a single Cloud TPU device and a Cloud TPU Pod (alpha). Instead, you can request portions, or slices, of Cloud TPU Pods, with each slice containing a set of TPU cores. With slices, you can purchase only the processing power you need.

Scaling up from a single Cloud TPU device to a Cloud TPU Pod (alpha), often only requires adjusting hyperparameters such as the batch size to match the larger number of TPU cores in the Cloud TPU Pod slice. With just a few minor changes, you can take a model that trains in hours on a single Cloud TPU device and accelerate it to reach the same accuracy in minutes on a full Cloud TPU Pod.

Cloud TPU Pod (alpha) advantages

Cloud TPU Pod (alpha) brings the following benefits relative to a single Cloud TPU device:

  • Increased training speeds for fast iteration in R&D
  • Increased human productivity by providing automatically scalable machine learning (ML) compute
  • Ability to train much larger models than on a single Cloud TPU device

Cloud TPU Pod (alpha) slices

Since not every ML model needs an entire Pod for training, Google Cloud Platform offers smaller sections called slices. Slices are internal allocations consisting of different numbers of TPU chips. For example, a 4x4 slice has 16 TPU chips for a total of 32 cores (16 TPU chips * 2 cores per chip), and a 8x8 slice has 64 TPU chips for a total of 128 cores.

Pod slices are available in the following sizes. Several sizes are pictured in the diagram following the table.

Name # of cores # of TPU chips
v2-32 32 cores 16 TPU chips (4x4 slice)
v2-128 128 cores 64 TPU chips (8x8 slice)
v2-256 256 cores 128 TPU chips (8x16 slice)
v2-512 512 cores 256 TPU chips (16x16 slice)

image

Requesting slices in Cloud TPU Pod v2 (alpha)

You can set the type of slice you want to have allocated for your jobs using the names shown in the Pod slices table. The slice names correspond to the supported TPU versions. The smaller the slice, the more likely it is to be available.

You can specify the slice type in the following ways:

ctpu utility

  1. Running the Cloud TPU Provisioning Utility (ctpu).
    • When you run ctpu up, use the --tpu-size parameter to specify the slice name from the table above that corresponds to the number of cores you want to use. For example, to request 32 cores for a v2 Pod (alpha), you would specify: --tpu-size=v2-32.
      $ ctpu up --tpu-size=[SLICE-NAME]
      where:

gcloud command

  1. When creating a new Cloud TPU resource using the gcloud compute tpus create command, specify the accelerator type to be one of the slice names from the Pod slices table. Note that the slice names specify the number of cores you are requesting. For example, to request 32 cores for a v2 Pod (alpha), you would specify:

    $ gcloud compute tpus create [TPU name] 
    --zone europe-west4-a
    --range '10.240.0.0/29'
    --accelerator-type 'v2-32'
    --network my-tf-network
    --version '1.12'
    where:

    • TPU name is a name for identifying the TPU that you're creating.
    • --zone is the compute/zone of the Compute Engine. Make sure the requested accelerator type is supported in your region.
    • --range specifies the address of the created Cloud TPU resource and can be any value in 10.240.*.*/29.
    • --accelerator-type is the accelerator version associated with the number of cores you want to use, for example, v2-32 (32 cores).
    • --network specifies the name of the network that your Compute Engine VM instance uses. You must be able to connect to instances on this network over SSH. For most situations, you can use the default network that your Google Cloud Platform project created automatically. However, an error results if the default network is a legacy network.
    • --version specifies the TensorFlow version to use with the TPU.

Cloud Console

  1. From the left navigation menu, select Compute Engine > TPUs.
  2. On the TPUs screen click Create TPU node. This brings up a configuration page for your TPU.
  3. Under TPU type select the slice name associated with the number of cores you want to use, for example, v2-32 (32 cores).
  4. Click the Create button.
このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...