By Kevin Day, Luke Tubinis, Justin Ross at Tervela
This article outlines Cloud FastPath as an option for users looking to create managed and optimized data streams into and out of Google Cloud Platform.
Today, datasets and volumes continue to expand quickly. Although cloud computing capacity has grown in stride, transferring data from a large variety of sources into the cloud and between cloud services is a challenge for organizations looking to take advantage of storing and processing data in Cloud Platform. This bottleneck is one that every organization must address.
Data movement into and out of the cloud is usually left as an afterthought to optimizing applications, processes and clusters. However, as a system scales, issues with data movement can mean failures for both data processing and storage. These issues can reduce the efficiency and cost benefits that the cloud offers. To address and eliminate this bottleneck in cloud-based processing and storage, it is critical to create fast and flexible data streams into, between and out of cloud systems. This ensures a fully optimized application, while protecting the many advantages of cloud-based processing and storage.
There are a variety of tools available for creating upload/download steams with Cloud Platform such as the gsutil cp command, Google Cloud Storage Transfer Service, the offline media import/export service or the JSON API for object uploading. While these tools are extremely extensible and feature rich, creating entire data streaming systems that can support many tens or hundreds of connections is a large undertaking. In addition, optimizing the transfers, maintaining these systems and building reporting and automation tools is not always an option for many Cloud Platform users.
Tervela’s Cloud FastPath is a data transfer, sync, backup, and migration SaaS product. Cloud FastPath enables the creation of maintenance-free and fast data streams to Google Cloud Storage. With Cloud FastPath, you can easily establish high performance, flexible, and encrypted data streams that allow you to focus on creating and maintaining applications that maximize the value of your data.
The article starts by describing Cloud FastPath’s deployment procedures and architecture, and then explores the performance, flexibility and security that Cloud FastPath provides as it establishes data streams into and out of Cloud Platform. The article also provides a use case based on the movement of genomics data into Cloud Platform.
Prerequisites for deploying Cloud FastPath with Cloud Platform are:
- A project ID and a Cloud Storage bucket. See the instructions for creating a project and creating a bucket.
- Source(s) of data. See Cloud FastPath’s supported systems.
- A provisioned Cloud FastPath account. Sign up here.
Connecting systems to Cloud Platform
Cloud FastPath connects to cloud systems and accesses file lists through OAuth 2. No passwords are stored in Cloud FastPath systems. Cloud FastPath supports both user and service accounts in Cloud Platform. On-premises file servers, individual computers or on-premises storage systems require a simple, lightweight application downloaded one time. The application can be installed remotely. Administrative access is typically required for on-premises systems.
When systems are online, all data movement activities are managed centrally from
any web browser by using the cloud-based dashboard. Creating a new system such
as a Cloud Storage target can be achieved in a couple clicks. First, you select
Google Cloud Storage from the list of systems:
Then, simply sign in to the Google account your project is associated with and enter the project ID:
The ease and method of installation are important features when creating data movement workflows:
- Connections between systems are always maintained. Cloud FastPath automatically updates source and target adapters for performance, security and new API feature support.
- Non-technical users can easily establish new data streams without managing any virtual machines, environments, libraries, or dependencies.
- Cloud based services where
gsutilcannot be installed and used are easily available to integrate with Cloud Platform.
The web application is where data streams, or jobs, can be started, stopped and paused. Feature configuration and reporting, which are discussed later in this article, are also managed within the web application. While the web application serves as a management console, the data movement systems operate independently of it.
Cloud FastPath’s data streaming architecture
Cloud FastPath’s architecture is comprised of points of presence (POPs). POPs are groups of virtual machines specifically designed for transferring data.
The POP systems are automatically scaled for increased parallelism. POPs are pre-installed within Google Compute Engine and managed by the Cloud FastPath team. There is also the option to use virtual machines hosted on your own Google Compute Engine account.
The underlying streaming architecture is backed by:
Cloud FastPath is designed to take full advantage of upstream network speeds. However, as this might not always be desired, Cloud FastPath has bandwidth controls to limit the share of a network used for data streams. Streams can be scheduled to use more bandwidth at desired times such as on nights or weekends.
Cloud FastPath’s multi-POP infrastructure encrypts, chunks and compresses data nearest to the source before sending it over the network. This decreases latency while allowing more data to be concurrently transferred.
Another set of POPs decompresses and re-chunks data near the target systems. This allows the most-intensive network operations to be executed as closely as possible to to the target systems. API throttling organizes file sizes and types for simultaneous ingest streams.
Cloud FastPath uses the largest available TCP window without overwhelming the receiver window. By sending appropriate sizes of data to the receiver at the correct intervals, the connections between systems are kept busy for TCP efficiency. Cloud FastPath also has checks in place to avoid congestion collapse and network loss.
Security in flight
Data is streamed directly between the source and target systems in memory. Data is never persisted within any Cloud FastPath system. In addition, connections are encrypted with TLS at all points during a transfer. Data streams are not reliant on the Cloud FastPath web application, and will continue to run during any application downtime.
Sync and mirroring
Cloud FastPath supports syncing multiple sources with Cloud Platform. As files are created or modified they are transferred upon running a new job. Jobs can be scheduled to run automatically or chained, at a certain time or at a set interval.
Cloud FastPath gives the option to control behavior around overwrite and deletion mirroring between sources and Cloud Platform. By default Cloud FastPath automatically scans sources for newly created files and modified files, and then transfers or overwrites them. Also, by default Cloud FastPath does not mirror deletions between sources and Cloud Platform. Each of these features can be controlled while configuring a sync or transfer job using the Cloud FastPath UI. File lists and individual file IDs are used for comparison.
Error handling and transfer automation
Error handling automation helps ensure data streams can run uninterrupted. Errors are most commonly caused by disallowed file types or improper permissions. When errors do occur, Cloud FastPath takes automatic steps to resolve them. These steps include:
- Initially skipping files as some transfers are resolved through simple retries.
- Re-transmitting data due to network, target API, or other transient issues.
- Modifying file names to conform to the requirements of the destination system.
- Translating user names, groups, and permissions from source to destination.
- Only metadata is affected. No file data is ever altered.
- SHA-1 or MD5 checksums (depending on systems) are maintained between the sources, Cloud FastPath POPs, and Cloud Platform to validate data quality. The checksums are auditable in the Cloud FastPath reports.
Reporting and simulation
Cloud FastPath provides thorough reports of exactly what was transferred and any failed transfers on a per-file basis. The reports include file IDs, source last modified date and time, target last modification date and time, transfer success or failure, transfer errors, checksum method, checksum, and more. The reports are downloadable in CSV format. In addition to reporting, Cloud FastPath’s alerting system notifies a user, through email or in-app notifications, when a job fails or completes with or without errors.
The simulation feature is a report run before starting a stream. It analyzes the source systems to provide insight on the data within the source that is about to be moved, helping to identify potential errors and estimate the size and time of data transfer.
Use case: Creating data streams to feed processing workflows
For genomics and big data processing in Cloud Platform, there are normally a
variety of data sources located across the globe. These sources might include
file servers that are distributed across many locations, individual user
machines, public data sources, and machine-generated data sets. In addition,
there might be data sets that are located within other cloud services. While
gsutil is helpful in moving data onto Google’s processing and storage systems,
other tools might be needed to establish high-performance, flexible, and
encrypted data streams, without significant development efforts.
As an example, consider a company that has created genomic alignment processors and a population query engine within Cloud Platform. Raw genomic and processed data is generated by a variety of researchers scattered across the globe as well as high performance computing clusters in Texas, Tokyo, and California. The data is stored on-premises on NAS devices and individual machines, as well as in the cloud on Box and Dropbox.
Building a system to connect these disparate sources, the sequencing centers and Cloud Platform would require dedicated development resources. Maintenance would be frequent and scaling the system to handle new sources would be troublesome.
With Cloud FastPath, these data movement requirements could easily be fulfilled using both the web app UI and the API (which requires python 2.7). For data residing in services, such as Box or Dropbox, jobs can be configured to sync to Cloud Storage in seconds. Agents can be downloaded onto individual computers or on-premises file servers with specific one-time usage codes so they are automatically included in certain streams, without the need for end-user configuration. Or, as new file systems or end-user machines are provisioned, a new stream can automatically be created using the Cloud FastPath API:
python cfp nodes --id 56b3c5afc1e9c67b03690b55 --update --fields source_filedir="Path_to_raw_genomic_data"
After the data moves through sequencing, a data stream to Cloud Platform can be started by initiating a simple command:
python cfp jobs --start --id 53c68eacbc1ea508fc39ef98
After processing, the speed at which results are delivered is as important as the processing itself. Cloud FastPath can distribute results that are large in size, such as genomic data, imagery intensive analysis, or bioinformatics, quickly and to every location that needs them.
While this use case outlines a complicated data pipeline, Cloud FastPath saves setup time and increases transfer speeds for many other use cases such as:
- Managing and moving VM images to and from many locations and storage systems.
- Creating fast and dependable data pipelines into Google Cloud Dataflow.
- Most file sync and share platforms cannot handle files over a GB or two. When many users are working on large video, image, or creative files Cloud FastPath can serve as the link between office locations or storage systems for collaboration and long term storage.
- Movement of many millions of files in document-heavy industries such as legal firms or architecture and design firms.
- Mirroring and syncing file sync and share systems to Cloud Platform.
- Movement of image files for processing in Cloud Platform, such as mapping and geographical analysis, medical imagery analysis, or facial recognition.
Every Cloud FastPath license comes with standard support, which includes a dedicated support representative, remote troubleshooting, and chat and email support. Phone support and tutorials are also available. Cloud FastPath also has an extensive knowledge base for specific instruction. For more information, see the Cloud FastPath FAQs.
- Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.