Storage options for Cloud TPU data

This document describes data storage options that can be used when training models on Cloud TPU.

Introduction

Cloud TPU requires data storage for:

  • dataset downloading and preprocessing
  • host input pipeline processing
  • model training input
  • model training output

There are four storage options for the Cloud TPU application data and training datasets:

For storage cost and performance details see Storage options.

The boot disk for a TPU VM or TPU Node

By default, each Cloud TPU VM has a 100GB single boot persistent disk that contains the operating system. The boot disk can also be used to store downloaded datasets for preprocessing and model input and output data, provided the total amount does not exceed the available space on the boot disk.

If your training application requires additional storage space beyond the boot disk default, you can add one or more persistent disks to your VM or TPU VM instance. There are different procedures for adding a persistent disk to a TPU Node (a Compute Engine VM) or to a TPU VM.

A persistent disk attached to a TPU VM or TPU Node

Persistent disks are durable network storage devices that your VM instances can access like physical disks in a desktop or a server. The data on each persistent disk is distributed across several physical disks. Compute Engine manages the physical disks and the data distribution for you to ensure redundancy and optimal performance.

Persistent disks are created independently from your virtual machine (VM) instances, so you can keep your data even after you delete your VM instances. Persistent disk performance scales automatically with size, so you can resize your existing persistent disks or add more persistent disks to an instance to meet your performance and storage space requirements.

Persistent disks have built-in redundancy to protect your data against equipment failure and to ensure data availability through datacenter maintenance events. Checksums are calculated for all persistent disk operations, so we can ensure that what you read is what you wrote.

Additionally, you can create snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental, and take only minutes to create even if you snapshot disks that are attached to running instances.

Cloud Storage buckets

Cloud Storage buckets are the most flexible, scalable, and durable storage option for your VM instances. If your training job does not require the lower latency of persistent disks, you can store your dataset in a Cloud Storage bucket.

The performance of Cloud Storage buckets depends on the storage class that you select and the location of the bucket relative to your instance.

Creating your Cloud Storage bucket in the same zone as your VM instance gives performance that is comparable to persistent disks but with higher latency and less consistent throughput characteristics.

All Cloud Storage buckets have built-in redundancy to protect your data against equipment failure and to ensure data availability through datacenter maintenance events. Checksums are calculated for all Cloud Storage operations to help ensure that what you read is what you wrote.

Unlike persistent disks, Cloud Storage buckets are not restricted to the zone where your instance is located. Additionally, you can read and write data to a bucket from multiple instances simultaneously. For example, you can configure instances in multiple zones to read and write data in the same bucket rather than replicate the data to persistent disks in multiple zones.

Furthermore, you can mount a Cloud Storage bucket to your instance as a file system. Mounted buckets function similarly to a persistent disk when you read or write files. However, Cloud Storage buckets cannot be used as boot disks. Your instance can write data to a file and overwrite critical data from other instances that are simultaneously writing data to the storage bucket.

Filestore file share

Filestore file share is a fully managed network attached storage (NAS) for Compute Engine. Filestore offers native compatibility with existing enterprise applications and supports any NFSv3-compatible client.

Filestore offers low latency for file operations. For workloads that are latency sensitive, Filestore supports capacity up to 100 TB and throughput of 25 GB/s and 720K IOPS, with minimal variability in performance.

With Filestore, you can easily mount file shares on Compute Engine VMs.

What's next