Storage options for Cloud TPU data

This document describes data storage options that can be used when training models on Cloud TPU.

Introduction

Cloud TPU requires data storage for:

  • dataset downloading and preprocessing
  • host input pipeline processing
  • model training input
  • model training output

There are five storage options for the Cloud TPU application data and training datasets:

For storage cost and performance details see Storage options.

The boot disk for a TPU VM or TPU Node

By default, each Cloud TPU VM has a 100GB single boot persistent disk that contains the operating system. The boot disk can also be used to store downloaded datasets for preprocessing and model input and output data, provided the total amount doesn't exceed the available space on the boot disk.

If your training application requires additional storage space beyond the boot disk default, you can add one or more persistent disks to your VM or TPU VM instance. There are different procedures for adding a persistent disk to a TPU Node (a Compute Engine VM) or to a TPU VM.

A persistent disk attached to a TPU VM or TPU Node

Persistent disks are durable network storage devices that your VM instances can access like physical disks in a desktop or a server. The data on each persistent disk is distributed across several physical disks. Compute Engine manages the physical disks and the data distribution for you to ensure redundancy and optimal performance.

Persistent disks are created independently from your virtual machine (VM) instances, so you can keep your data even after you delete your VM instances. Persistent disk performance scales automatically with size, so you can resize your existing persistent disks or add more persistent disks to an instance to meet your performance and storage space requirements.

Persistent disks have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Checksums are calculated for all persistent disk operations, so we can ensure that what you read is what you wrote.

Additionally, you can create snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental, and take only minutes to create even if you snapshot disks that are attached to running instances.

For more information on using persistent disks with TPU VMs, see Add a persistent disk to a TPU VM.

Cloud Storage buckets

Cloud Storage buckets are the most flexible, scalable, and durable storage option for your VM instances. If your training job does not require the lower latency of persistent disks, you can store your dataset in a Cloud Storage bucket.

The performance of Cloud Storage buckets depends on the storage class that you select and the location of the bucket relative to your instance.

Creating your Cloud Storage bucket in the same zone as your VM instance (for TPU Nodes) or your TPU VM gives performance that is comparable to persistent disks but with higher latency and less consistent throughput characteristics.

All Cloud Storage buckets have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Checksums are calculated for all Cloud Storage operations to help ensure that what you read is what you wrote.

Unlike persistent disks, Cloud Storage buckets are not restricted to the zone where your instance is located. Additionally, you can read and write data to a bucket from multiple instances simultaneously. For example, you can configure instances in multiple zones to read and write data in the same bucket rather than replicate the data to persistent disks in multiple zones.

Cloud Storage FUSE

Cloud Storage FUSE lets you to mount and access Cloud Storage buckets as local file systems. This allows applications to read and write objects in your bucket using standard file system semantics.

See the Cloud Storage FUSE documentation for details about how Cloud Storage FUSE works and a description of how Cloud Storage FUSE operations map to Cloud Storage operations. You can find additional information about how to use Cloud Storage FUSE, such as how to install the Cloud Storage FUSE CLI and mounting buckets on GitHub.

Filestore file share

Filestore file share is a fully managed network attached storage (NAS) for Compute Engine. Filestore offers compatibility with existing enterprise applications and supports any NFSv3-compatible client.

Filestore offers low latency for file operations. For workloads that are latency sensitive, Filestore supports capacity up to 100 TB and throughput of 25 GB per second and 720K IOPS, with minimal variability in performance.

With Filestore, you can mount file shares on TPU VMs.

What's next