Using StorReduce for cloud-based data deduplication

By Vanessa Wilson and Mark Cox, StorReduce
All data provided by StorReduce and not verified by Google

This page describes how StorReduce works with Google Cloud Platform (GCP).

StorReduce is a specialized cloud deduplication solution, designed to meet the unique requirements of companies using cloud storage for large volumes of data. StorReduce sits between your applications and Cloud Storage, transparently deduplicating data inline. StorReduce's internal testing reports speeds of up to 900 MB/s, with lower storage and bandwidth transmission costs, speeding up the transfer of data to the cloud, and between clouds, by as much as 30 times. Data deduplication also results in lower storage costs.

StorReduce architecture

Key characteristics

StorReduce provides:

  • Up to 97% deduplication: Reduces data transmission and cloud storage volumes by as much 80 to 97%. This level of deduplication can help you reduce costs.
  • Fast sustained throughput: StorReduce's internal testing reports write speeds up to 900 MB per second, adding under 50 ms of latency with sustained throughput 24/7. StorReduce doesn't buffer data.
  • Scalable: StorReduce's testing has shown the ability to store up to 40 petabytes (40,000,000 GB) of data per Compute Engine instance on GCP. Deploying multiple servers can help you reach nearly any capacity you need.
  • Cloud-native: Deduplicated data is immediately accessible to cloud services through StorReduce's Amazon S3 REST API.
  • Multi-cloud: You can store data in Cloud Storage, including Multi-Regional, Regional, Nearline, or Coldline. You can also store data in Amazon S3 or S3IA, Microsoft Azure Blob Storage, and other Amazon S3-compatible private cloud object stores, such as Cloudian and IBM Cleversafe.
  • Resilient: Index information required to rehydrate the data is stored in multiple locations, both in Cloud Storage and on each StorReduce server. Extensive hashing and checking ensures data integrity during deduplication and rehydration.
  • Software-only solution: No hardware is required. StorReduce is free from hardware costs and lock-in to particular hardware.

Key features

  • Read replicas: You can deploy additional StorReduce servers in the cloud or on-premises as read replicas. Uploaded data is only stored once but is immediately available in multiple locations, enabling migrated backup workloads to be re-purposed in the cloud for development, test, QA, disaster recovery and to be used by cloud-based services.
  • High availability: You can combine several StorReduce servers into a high-availability cluster to provide automatic failover in the event of a server failure.
  • Data replication: StorReduce can replicate data between regions, between cloud vendors, or between public and private clouds, providing increased data resilience. Only the unique data is transferred.
  • Backup software integration and data management: Works with existing data management or backup software that is compatible with Cloud Storage, including Veritas NetBackup and CommVault.
  • Data encryption: Supports client-side data encryption using KMS for key management. You can encrypt data on-premises before sending it to the cloud, or you can use cloud encryption-at-rest services. Data is always encrypted in-transit.
  • Secure user account and key management: You can give your users or servers individual user accounts within StorReduce, letting you restrict data access. For each user account, you can create and manage multiple access keys.
  • Write speed throttling: To prevent StorReduce from using too much bandwidth when sharing an internet connection with other infrastructure, you can set a maximum write speed.

Architecture

The StorReduce server provides similar functionality to cloud storage vendors, including object storage, user accounts, access keys, access control policies and a Web-based management interface, the StorReduce Dashboard.

Amazon S3 client software

StorReduce works with client software that supports Amazon’s S3 REST interface for object storage. This includes clients designed to work with Amazon S3 and those designed to work with Cloud Storage by using the Amazon S3-compatible XML API. Client software is configured to talk to the StorReduce server instead of directly to Cloud Storage, by using access keys provided by the StorReduce server. Amazon S3 client software includes on-premises backup software, including Veritas NetBackup, and custom software written to use the Amazon S3 REST interface. You can also use cloud-based services designed to work with Cloud Storage, as these services also act as Amazon S3 clients.

StorReduce translates Amazon S3 client requests into whichever protocol is needed by the underlying cloud storage providers. For example, to migrate data from Azure Blob Storage to Cloud Storage, StorReduce provides an Amazon S3-compatible interface onto Azure Blob Storage.

Other interfaces, particularly CIFS and NFS, are supported through gateway software that exposes these interfaces and converts requests into calls to the Cloud Storage Amazon S3-compatible XML API. This approach works for relatively low-data volumes, as many of these translation products have limitations in terms of throughput. In addition, these interfaces often require buffering of data and can fail further requests after buffers become full. For this reason the Amazon S3 API is the preferred way to interface with StorReduce.

StorReduce server

The StorReduce server runs on its own physical or virtual machine. StorReduce recommends using local SSD storage. Each StorReduce server can handle up to 40 Petabytes of raw data, depending on the deduplication ratio achieved and the amount of SSD storage available for index information. For lower data volumes, you can use magnetic disks instead of SSD.

StorReduce supports the creation of multiple storage buckets, with global deduplication performed across all buckets.

To enable setup, StorReduce VM images with the server pre-installed are available through Google Cloud Platform Marketplace. You can also install StorReduce on any cloud-based Linux virtual machine by using RPM packages. For cloud-to-cloud migration to Cloud Storage, the StorReduce server is available on AWS Marketplace and Azure Marketplace.

For migration of on-premises data to the cloud, or for private cloud deployments, the StorReduce server can be run on-premises on a physical or virtual machine. A pre-built virtual appliance (OVA file) is available.

The architecture is designed to allow multiple StorReduce servers to be run against the same backend Cloud Storage service, for redundancy, load-sharing and increased storage volume. For example, an on-premises StorReduce server might be used to deduplicate and upload backup data, with a second, cloud-based StorReduce server providing immediate access to this data for cloud services as the data is uploaded.

Amazon S3-compatible interface

The StorReduce server exposes an Amazon S3-compatible REST interface for object storage. This highly scalable interface supports most Amazon S3 interface calls including:

  • Object GET/PUT/POST/DELETE, including multiple-object delete.
  • Multipart uploads, including listing and deleting uploads.
  • Digital signature verification.
  • Bucket creation, deletion, and renaming.
  • Setting and reading bucket policies for access control.

Admin interface

A separate REST interface is exposed for use by the web-based dashboard. This admin API is served on a separate port to allow firewalls to restrict network-level access, and can optionally also be served over HTTPS on port 443. The admin API is available for use by other client applications as well as the StorReduce dashboard. The admin API supports manipulation of user accounts, access policies, and index snapshots, as well as providing a replica of the Amazon S3 API for use by management tools.

Local SSD or magnetic disk storage

Each StorReduce server stores index information on local storage and requires fast access to that data in order to achieve high throughput for deduplication. The amount of raw data a StorReduce server can handle depends on the amount of fast local storage available and the deduplication ratio achieved for the data.

For large data volumes, StorReduce uses local SSD storage for this index, enabling tens of petabytes of data to be managed by a single StorReduce server. Typically the amount of SSD storage required is less than 0.05% of the amount of data put through StorReduce.

For relatively low data volumes, you can run StorReduce by using magnetic disk instead of SSD, using available RAM to cache the information stored on magnetic disk. This approach works for up to around 100 TB of data before deduplication, depending on the deduplication ratio achieved.

The StorReduce server treats local SSD or magnetic storage as ephemeral. All information stored in local storage is also sent to Cloud Storage and you can recover it later if required.

Cloud Storage usage

The StorReduce server uses Cloud Storage for all persistent data. It acts as a Cloud Storage client, making use of the object storage API to store all its data in a single bucket.

The StorReduce server makes use of Cloud Storage to store the following types of data:

  • Deduplicated user data: Raw data is deduplicated using state-of-the-art algorithms and then compressed.
  • System Data: Information about buckets, users, access control policies and access keys is also stored in back-end cloud storage, making it available to all StorReduce servers in a given deployment.
  • Index snapshots: Data for rapidly reconstructing index information on local storage is also stored in back-end cloud storage, as described in an upcoming section.

Performance

The StorReduce server is optimized for scalability, high throughput and low latency. The internal architecture and code are highly optimized for data deduplication, and to ensure that performance is maintained even when running in a public cloud environment.

Internal testing by StorReduce shows that a single StorReduce server is capable of sustained write speeds of up to 900 MB per second, which is close to saturating a 10 GB/s network connection.

Running an on-premises StorReduce server can significantly speed up throughput and decrease transfer bandwidth to cloud-based storage by deduplicating data prior to sending it into the cloud, and by reading deduplicated data from the cloud and reconstituting it locally.

Latency is kept to a minimum, typically less than 50ms of additional latency even when StorReduce is running in the cloud. For most situations, this small, added latency makes no difference to end users and doesn't affect throughput.

Index data

StorReduce maintains an index of user data on fast local storage. Each StorReduce server keeps its own independent index.

You can rebuild all index data from the log of transactions stored in Cloud Storage. For large data sets, it can take a long time to rebuild the index from scratch. To speed this up, the server periodically takes a snapshot of the index and stores this information in Cloud Storage.

When a StorReduce server starts up, if an index needs to be rebuilt, the server:

  • Loads the last index snapshot from Cloud Storage.
  • Replays subsequent transactions to bring the index up to date.

Multiple StorReduce servers

Because StorReduce maintains a log of all transactions on Cloud Storage, multiple servers can watch this transaction log to keep their independent indexes up to date. You can set up new servers to talk to an existing Cloud Storage service and they automatically populate their local index data from Cloud Storage.

Only one server can currently write to any particular Cloud Storage bucket at a time, but any number of servers can read from the bucket in real time and serve the data. This organization enables a number of useful deployment scenarios, described in the following sections.

Read-only replicas

You can deploy a StorReduce server as a read-only replica, meaning one copy available in multiple locations, allowing reading and rehydration of data. This enables the same content to be fetched from multiple StorReduce servers in different locations, each with the same view of the content updated in real time.

One common deployment scenario is to have a StorReduce server running on-premises, deduplicating data as it is sent to the cloud. A second StorReduce server running in the cloud as a read-only replica can provide real-time access to the data, in rehydrated form, to cloud-based applications and services through its Amazon S3 interface. This architecture works particularly well for moving backups to the cloud.

Data replication

You can set up a StorReduce server to automatically replicate data from its own Cloud Storage bucket to one or more other Cloud Storage regions. Any changes seen by this StorReduce server are copied to the other locations. Because only deduplicated data is replicated, data transfer charges can be reduced.

You can use a StorReduce replication server to replicate data:

  • Across multiple regions in the same cloud.
  • Across multiple cloud providers, including Cloud Storage, Amazon S3, and Azure Blob Storage.
  • From private to public cloud or the reverse direction, including IBM Cleversafe, HDS HCP, HGST Active Archive, and Cloudian.

High-availability clusters

StorReduce high-availability clusters use several StorReduce servers to provide automatic failover within a clustered environment. A single primary server writes the data, while secondary servers follow the writes and are ready to take over as primary server in the event of a problem.

Failover is automatic, coordinated by using etcd. A load balancer helps to ensure requests are routed to the correct servers.

For more information on deployment options for high-availability clusters, contact StorReduce.

Use cases

This section describes common use cases.

Reducing the cost of primary backup to the cloud

This use case enables a cost reduction that encourages the lift and shift of IT infrastructure from on-premises to the cloud.

Companies looking to lift and shift their entire IT infrastructure to the cloud are calculating the cost of using the cloud for primary backup over time. They are finding it cost prohibitive, which is a barrier to them going all in one cloud. StorReduce can alleviate this cost and quicken the client’s move to cloud.

Migrating tape or disk-based backups to the cloud

Tape archives generally contain periodic full backups with multiple copies of the same data sets, which can be reduced down to a single copy with deduplication.

For tape or disk-based backup migration, you can install StorReduce software on-premises for a capital-expenditure-free, fast migration of an enterprise’s large tape archives and backup appliance data to the cloud. Installing StorReduce on-premises helps minimize bandwidth during the transfer.

You can deploy a StorReduce read replica to make the data available on-cloud for development, test, quality assurance, and disaster recovery.

Cloud-to-cloud data replication

For more information about how StorReduce works with Veritas NetBackup, see the Configuring NetBackup 7.7 with StorReduce guide.

Moving or replicating data between clouds

Many organizations make use of more than one public cloud or have data in a hybrid cloud system, and want to quickly and affordably replicate or move their data from one cloud to another while minimizing storage costs.

StorReduce includes automated cloud-to-cloud replication. Any customer with data in more than one cloud can install StorReduce software on each cloud and then replicate data. The data moved is accessible by any cloud service. In addition, StorReduce’s reduction of the data volume enables organizations to affordably keep data in two clouds or two cloud regions to satisfy redundancy or compliance requirements.

You can replicate data from region to region within a single public cloud in the same way.

On-premises tape to cloud migration

Moving unstructured data

You can use StorReduce to move general unstructured data to the cloud, such as data from corporate file servers, in the same way that it migrates tape and backup data to the cloud. Most data contains duplicate information and StorReduce reduces storage requirements on such data.

Private cloud

You can insert StorReduce into any private cloud with an object store and an Amazon S3-compatible interface, in the same manner as shown in the previous diagram for a public cloud. This can reduce the infrastructure required for storage of unstructured data by performing inline data deduplication.

What's next

See additional documentation on the StorReduce website:

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...