Using StorReduce for Cloud-based Data Deduplication

By Vanessa Wilson and Mark Cox, StorReduce
All data provided by StorReduce and not verified by Google

This page describes how StorReduce works with Google Cloud Platform.

StorReduce is a specialized cloud deduplication solution, designed to meet the unique requirements of companies using cloud storage for large volumes of data. StorReduce sits between your applications and Google Cloud Storage, transparently deduplicating data inline. StorReduce's internal testing reports speeds of up to 900 MB/s, with lower storage and bandwidth transmission costs, speeding up the transfer of data to the cloud, and between clouds, by as much as 30 times. Data deduplication also results in lower storage costs.

StorReduce architecture

Key Characteristics

StorReduce provides:

  • Up to 97% deduplication: Reduces data transmission and cloud storage volumes between 80 and 97%. This level of deduplication can help you reduce costs.
  • Fast sustained throughput: StorReduce's internal testing reports write speeds up to 900 megabytes per second, adding under 50ms of latency with sustained throughput 24/7. StoreReduce does not buffer data.
  • Scalable: StoreReduce's testing has shown the ability to store up to 40 petabytes (40,000,000 gigabytes) of data per Google Compute Engine instance on Cloud Platform. Deploying multiple servers can help you reach nearly any capacity you need.
  • Cloud-native: Deduplicated data is immediately accessible to cloud services through StorReduce’s S3 REST API.
  • Multi-cloud: Data can be stored in Google Cloud Storage, including Multi-Regional, Regional, Nearline, or Coldline, and also including Amazon S3 or S3IA, Microsoft Azure Blob Storage, and other S3-compatible private cloud object stores such as Cloudian and IBM Cleversafe.
  • Resilient: Index information required to re-hydrate the data is stored in multiple locations, both in Google Cloud Storage and on each StorReduce server. Extensive hashing and checking ensures data integrity during deduplication and rehydration.
  • Software-only solution: No hardware is required. StorReduce is free from hardware costs and lock-in to particular hardware.

Key features

  • Read replicas: Additional StorReduce servers can be deployed in the cloud or on-premises as read replicas. Uploaded data is only stored once but is immediately available in multiple locations, enabling migrated backup workloads to be re-purposed in the cloud for development, test, QA, disaster recovery and to be used by cloud-based services.
  • High availability: Several StorReduce servers can be combined into a high-availability cluster to provide automatic failover in the event of a server failure.
  • Data replication: StorReduce can replicate data between regions, between cloud vendors, or between public and private clouds, providing increased data resilience. Only the unique data is transferred.
  • Backup software integration and data management: Works with existing data management or backup software that is compatible with Google Cloud Storage, including Veritas NetBackup and CommVault.
  • Data encryption: Supports client-side data encryption using KMS for key management. Data can be encrypted on-premises before being sent to the cloud, or cloud encryption-at-rest services can be used. Data is always encrypted in-transit.
  • Secure user account and key management: Users or servers can be given individual user accounts within StorReduce, allowing data access to be restricted. Multiple access keys can be created and managed as needed for each user account.
  • Write speed throttling: A maximum write speed can be set to prevent StorReduce using too much bandwidth when sharing an Internet connection with other infrastructure.

Architecture

The StorReduce server provides similar functionality to cloud storage vendors, including object storage, user accounts, access keys, access control policies and a Web-based management interface, the StorReduce Dashboard.

S3 Client Software

StorReduce works with client software that supports Amazon’s S3 REST interface for object storage. This includes clients designed to work with Amazon S3 and those designed to work with Google Cloud Storage by using the S3-compatible XML API. Client software is configured to talk to the StorReduce server instead of directly to Google Cloud Storage, using access keys provided by the StorReduce server. S3 client software includes on-premises backup software, including Veritas NetBackup, and custom software written to use the S3 REST interface. Cloud-based services designed to work with Google Cloud Storage can also be used with StorReduce, as these services also act as S3 clients.

StorReduce translates S3 client requests into whichever protocol is needed by the underlying cloud storage providers. To migrate data from Azure Blob Storage to Google Cloud Storage, StorReduce provides an S3-compatible interface onto Azure Blob Storage.

Other interfaces, particularly CIFS and NFS, can be supported through gateway software that exposes these interfaces and converts requests into calls to the Cloud Storage S3-compatible XML API. This approach works for relatively low-data volumes, as many of these translation products have limitations in terms of throughput. In addition, these interfaces often require buffering of data and can fail further requests after buffers become full. For this reason the S3 API is the preferred way to interface with StorReduce.

StorReduce Server

The StorReduce server runs on its own physical or virtual machine. StorReduce recommends using local SSD storage. Each StorReduce server can handle up to 40 Petabytes of raw data, depending on the deduplication ratio achieved and the amount of SSD storage available for index information. For lower data volumes, magnetic disk can be used instead of SSD.

StorReduce supports the creation of multiple storage buckets, with global deduplication performed across all buckets.

To enable quick and easy setup, StorReduce VM images with the server pre-installed are available through Google Cloud Launcher. StorReduce can also be installed on any cloud-based Linux virtual machine, using RPM packages. For cloud-to-cloud migration to Google Cloud Storage, the StorReduce server is available on AWS Marketplace and Azure Marketplace, which makes it easy to set up the deduplication source server.

For migration of on-premises data to the cloud, or for private cloud deployments, the StorReduce server can be run on-premises on a physical or virtual machine. A pre-built virtual appliance (OVA file) is available.

The architecture is designed to allow multiple StorReduce servers to be run against the same back-end Google Cloud Storage service, for redundancy, load-sharing and increased storage volume. For example, an on-premises StorReduce server might be used to deduplicate and upload backup data, with a second, cloud-based StorReduce server providing immediate access to this data for cloud services as the data is uploaded.

S3-compatible interface

The StorReduce server exposes an S3-compatible REST interface for object storage. This highly scalable interface supports most S3 interface calls including:

  • Object GET/PUT/POST/DELETE, including multiple-object delete.
  • Multipart uploads, including listing and deleting uploads.
  • Digital signature verification.
  • Bucket creation, deletion, and renaming.
  • Setting and reading bucket policies for access control.

Admin interface

A separate REST interface is exposed for use by the web-based dashboard. This admin API is served on a separate port to allow firewalls to restrict network-level access, and can optionally also be served over HTTPS on port 443. The admin API is available for use by other client applications as well as the StorReduce dashboard, and supports manipulation of user accounts, access policies, index snapshots as well as providing a replica of the S3 API for use by management tools.

Local SSD or magnetic disk storage

Each StorReduce server stores index information on local storage and requires fast access to that data in order to achieve high throughput for deduplication. The amount of raw data a StorReduce server can handle depends on the amount of fast local storage available and the deduplication ratio achieved for the data.

For large data volumes, StorReduce uses local SSD storage for this index, enabling tens of petabytes of data to be managed by a single StorReduce server. Typically the amount of SSD storage required is less than 0.05% of the amount of data put through StorReduce. For more information, see the StorReduce on-premises guide.

For relatively low data volumes, you can run StorReduce using magnetic disk instead of SSD, using available RAM to cache the information stored on magnetic disk. This approach works for up to around 100TB of data before deduplication, depending on the deduplication ratio achieved.

The StorReduce server treats local SSD or magnetic storage as ephemeral. All information stored in local storage is also sent to Google Cloud Storage and can be recovered later if required, as described in an upcoming section.

Google Cloud Storage usage

The StorReduce server uses Google Cloud Storage for all persistent data. It acts as a Google Cloud Storage client, making use of the object storage API to store all its data in a single bucket.

The StorReduce server makes use of Google Cloud Storage to store the following types of data:

  • Deduplicated user data: Raw data is deduplicated using state-of-the-art algorithms and then compressed.
  • System Data: Information about buckets, users, access control policies and access keys is also stored in back-end cloud storage, making it available to all StorReduce servers in a given deployment.
  • Index snapshots: Data for rapidly reconstructing index information on local storage can also be stored in back-end cloud storage, as described in an upcoming section.

Performance

The StorReduce server is optimized for scalability, high throughput and low latency. The internal architecture and code are highly optimized for data deduplication, and to ensure that performance is maintained even when running in a public cloud environment.

Internal testing by StorReduce shows that a single StorReduce server is capable of sustained write speeds of up to 900 Megabytes per second, which is very close to saturating a 10Gb/s network connection.

Running an on-premises StorReduce server can significantly speed up throughput and decrease transfer bandwidth to cloud-based storage by deduplicating data prior to sending it into the cloud, and by reading deduplicated data from the cloud and reconstituting it locally.

Latency is kept to a minimum, typically less than 50ms of additional latency even when StorReduce is running in the cloud. For most situations, this small, added latency makes no difference to end users and does not affect throughput.

Index data

StorReduce maintains an index of user data on fast local storage. Each StorReduce server keeps its own independent index.

All index data can be rebuilt from the log of transactions stored in Google Cloud Storage. For large data sets, it can take a long time to rebuild the index from scratch. To speed this up, the server periodically takes a snapshot of the index and stores this information in Google Cloud Storage.

When a StorReduce server starts up, if an index needs to be rebuilt, the server:

  • Loads the last index snapshot from Cloud Storage.
  • Replays subsequent transactions to bring the index up to date.

Multiple StorReduce servers

Because StorReduce maintains a log of all transactions on Google Cloud Storage, multiple servers can watch this transaction log to keep their independent indexes up to date. New servers can be set up to talk to an existing Google Cloud Storage service and they automatically populate their local index data from Google Cloud Storage.

Only one server can currently write to any particular Google Cloud Storage bucket at a time, but any number of servers can read from the bucket in real time and serve the data. This organization enables a number of useful deployment scenarios, described in the following sections.

Read-only replicas

A StorReduce server can be deployed as a read-only replica, meaning one copy available in multiple locations, allowing reading and re-hydration of data. This enables the same content to be fetched from multiple StorReduce servers in different locations, each with the same view of the content updated in real- time.

One common deployment scenario is to have a StorReduce server running on-premises, deduplicating data as it is sent to the cloud. A second StorReduce server running in the cloud as a read-only replica can provide real-time access to the data, in re-hydrated form, to cloud-based applications and services through its S3 interface. This architecture works particularly well for moving backups to cloud.

Data replication

A StorReduce server can be set up to automatically replicate data from its own Google Cloud Storage bucket to one or more other Google Cloud Storage regions. Any changes seen by this StorReduce server will be copied to the other location(s). Because only deduplicated data is replicated, data transfer charges can be reduced.

A StorReduce replication server can be used to replicate data:

  • Across multiple regions in the same cloud.
  • Across multiple cloud providers, including Google Cloud Storage, Amazon S3, and Azure Blob Storage.
  • From private to public cloud or the reverse direction, including IBM Cleversafe, HDS HCP, HGST Active Archive, and Cloudian.

High-availability clusters

StorReduce high-availability clusters use several StorReduce servers to provide automatic failover within a clustered environment. A single primary server writes the data, while secondary servers follow the writes and are ready to take over as primary server in the event of a problem.

Failover is automatic, coordinated by using etcd. A load balancer helps to ensure requests are routed to the correct servers.

For more information on deployment options for high-availability clusters, please contact StorReduce.

Use Cases

This section describes common use cases.

Reducing the cost of primary backup on-cloud

This use case enables a cost reduction that encourages the lift and shift of IT infrastructure from on-premises to the cloud.

Companies looking to lift and shift their entire IT infrastructure to the cloud are calculating the cost of using the cloud for primary backup over time. They are finding it cost prohibitive, which is a barrier to them going all in on cloud. StorReduce can alleviate this cost and quicken the client’s move to cloud.

Migrating tape or disk-based backups to the cloud

Tape archives generally contain periodic full backups with multiple copies of the same data sets, which can be reduced down to a single copy with deduplication.

For tape or disk-based backup migration, StorReduce software can be installed on-premises for a capital-expenditure-free, very fast migration of an enterprise’s large tape archives and backup appliance data to the cloud. Installing StorReduce on-premises helps minimize bandwidth during the transfer.

A StorReduce read replica can be deployed to make the data available on-cloud for development, test, quality assurance and disaster recovery.

Cloud-to-cloud data replication

For more information about how StorReduce works with Veritas NetBackup, see the Configuring NetBackup 7.7 with StorReduce guide.

Moving or replicating data between clouds

Many organizations make use of more than one public cloud or have data in a hybrid cloud system, and want to quickly and affordably replicate or move their data from one cloud to another while minimizing storage costs.

StorReduce includes automated cloud-to-cloud replication. Any customer with data in more than one cloud can install StorReduce software on each cloud and then replicate data. The data moved is accessible by any cloud service. In addition, StorReduce’s reduction of the data volume enables organizations to affordably keep data in two clouds or two cloud regions to satisfy redundancy or compliance requirements.

Data can be replicated from region to region within a single public cloud in the same way.

On-premises tape to cloud migration

Moving unstructured data

StorReduce can be used to move general unstructured data to the cloud, such as data from corporate file servers, in the same way that it migrates tape and backup data to the cloud. Most data contains duplicate information and StorReduce reduces storage requirements on such data.

Private cloud

StorReduce can be inserted into any private cloud with an object store and an S3-compatible interface, in the same manner as shown in the previous diagram for a public cloud. This can reduce the infrastructure required for storage of unstructured data by performing inline data deduplication.

What's next

See additional documentation on the StorReduce website:

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...