Connector for PyTorch

Cloud Storage's Connector for PyTorch is an open source product supported by Google that provides a direct Cloud Storage integration with PyTorch.

Overview

Connector for PyTorch provides advantages for both data loading in training and for checkpointing and model loading:

For data loading in training, Connector for PyTorch provides the following advantages:

  • Connector for PyTorch contains optimizations to make training up to three times faster than default PyTorch in datasets consisting primarily of files smaller than 1MB.
  • Connector for PyTorch implements PyTorch's dataset primitive that can be used to help efficiently load training data from Cloud Storage buckets.
  • Support for map-style datasets for random data access patterns and iterable-style datasets for streaming data access patterns.
  • The ability to transform the downloaded raw bytes of data into the format of your choice, allowing the PyTorch DataLoader to flexibly work with NumPy arrays or PyTorch tensors.

For checkpointing and model loading, Connector for PyTorch provides the following advantages:

  • A checkpointing interface to conveniently and directly save model checkpoints to a Cloud Storage bucket and load model checkpoints from the bucket.
  • Connector for PyTorch supports PyTorch Lightning checkpointing by using the DatafluxLightningCheckpoint implementation of PyTorch Lightning's CheckpointIO.
  • Connector for PyTorch provides StorageWriter and StorageReader implementations for use with PyTorch distributed checkpointing. The Connector for PyTorch demo library includes example code for using this in a PyTorch Lightning FSDP workload.
  • Connector checkpointing includes support for asynchronous checkpoint saves with both Lightning and base PyTorch.

For more information see the Connector for PyTorch GitHub landing page.

Frameworks

Connector for PyTorch is supported on the following framework versions:

  • Python 3.8 or greater
  • PyTorch Lightning 2.0 or greater
  • PyTorch 2.3.1 or greater

Getting started

To use the Connector for PyTorch, you must have the following:

  • A Cloud Storage bucket that contains the data you want to work with.
  • The following permissions for working with the data stored in the bucket:
    • storage.objects.create
    • storage.objects.list
    • storage.objects.get
    • storage.objects.delete, if you intend to use composed downloads

These permissions must be granted to the account that the Connector for PyTorch will use for authentication by using an IAM role such as Storage Object User.

Installation

To install the Connector for PyTorch, use the following command:

pip install gcs-torch-dataflux

Configuration

Authentication must be provided to use the Connector for PyTorch Application Default Credentials through one of the following methods:

gcloud auth application-default login

Examples

A complete set of examples for working with the Connector for PyTorch can be found in the demo directory of Connector for PyTorch GitHub repository. Examples include:

Performance

Connector for PyTorch has specific optimizations designed for ML workloads that can provide significantly better performance than direct API calls to Cloud Storage:

  • To optimize listing performance, the Connector for PyTorch utilizes a fast listing algorithm developed to balance the listing workload among parallelized object listing processes.
  • To optimize the download performance of small files, the Connector for PyTorch uses the compose operation to concatenate sets of smaller objects into single, larger ones. These new composite objects are stored in the same bucket as the source objects and have the prefix dataflux-composed-objects/ in their names.
  • Multipart Upload for checkpoint write allows for up to 10x performance improvement over standard checkpoint upload.

You can find performance data on GitHub for the following:

  • Lightning Text Based Training
  • Lightning Image Training
  • Single-node checkpointing
  • Multi-node checkpointing

Considerations

The following should be taken under consideration on a per-workload basis.

Fast listing operations

Connector for PyTorch's fast listing algorithm causes the Connector for PyTorch to use more list operations than a regular sequential listing. List operations are charged as Class A operations.

Composite object usage

To avoid excess storage charges and early deletion charges when working with temporary composite objects, you should ensure your bucket uses the following settings:

Composite objects created by the Connector for PyTorch are usually automatically removed at the end of your training loop, but in rare cases they might not be. To ensure the objects are removed from your bucket, you can run the following command:

gcloud storage rm gs://<my-bucket>/dataflux-composed-objects/ --recursive

You can disable the use of composite objects by including either disable_compose=True or max_composite_object_size=0 in the config portion of the dataset you're constructing. However, turning off this behavior can cause training loops to take significantly longer, especially when working with small files.

Using composite objects causes Cloud Storage to hit QPS and throughput limits at a lower scale than downloading files directly. You should disable the use of composite objects when running at high multi-node scales where you hit project QPS or throughput limits even without using composite objects.

429 errors and degraded performance

While working with the Connector for PyTorch, you might receive 429 errors or slower than expected execution times. There are several common reasons this occurs:

  • Many machine learning efforts opt for a highly distributed training model leveraging tools such as PyTorch Lightning and Ray. These models are compatible with the Connector for PyTorch but can often trigger the rate limits of Cloud Storage.
  • 429 errors accompanied with messages such as "This workload is drawing too much egress bandwidth from Cloud Storage" or "This workload triggered the Cloud Storage Egress Bandwidth Cap" indicate that the data throughput rate of your workload is exceeding the maximum capacity of your Google Cloud project. To address these issues, take the following steps:
  • QPS limits can trigger 429 errors with a body message indicating TooManyRequests, but more commonly manifest in slower than expected execution times. QPS bottlenecks are more common when operating on high volumes of small files. Bucket QPS limits naturally scale over time, so allowing a warmup period can often lead to faster performance. To get more details on the performance of a target bucket, look at the Observability tab when viewing your bucket from the Google Cloud console.
  • If your workload is failing with a TooManyRequests error that includes the keyword dataflux-composed-objects in the error message, disabling the use of composed objects is the best first troubleshooting step. Doing so can reduce QPS load brought on by compose operations when used at scale.

Memory consumption

Checkpoint writes and loads, including final models for inference, are fully staged in memory in order to optimize upload and download performance. Each machine must have enough free RAM to stage its checkpoint in memory so it can take advantage of these performance improvements.

Get support

You can get support, submit general questions, and request new features by using one of the official Google Cloud support channels. You can also get support by filing issues in GitHub.

PyTorch, the PyTorch logo, and any related marks are trademarks of The Linux Foundation.