This page provides an overview of the data and analytics partner ecosystem on Google Cloud Platform (GCP). This article maps data and analytics products to data pipeline phases and then discusses how partner solutions can work for specific use cases. Each use-case discussion provides a reference architecture.
As it has become easier for anyone to create content, the digital world is generating data at exponential rates. As the rate of data generation from the Internet increases, it also becomes progressively difficult to use traditional tools to analyze data at scale. You need better tools and technologies to bridge the gap between data generation and data analysis. Simplifying data analytics also helps to pave the way for using machine learning.
Big data defined
Before going further, it's helpful to understand what big data means. You cannot simply assign a fixed-size metric to a data set and call it big data. Big data represents datasets that might exhibit properties such as high volume, high velocity, or high variety. These properties mean that you cannot use traditional technologies to process these datasets. As a result, several popular massively parallel processing (MPP) frameworks have emerged, such as Hadoop, to help process big data workloads efficiently.
Benefits of Cloud Platform
GCP provides cloud-native storage and processing services that can help you address all your key big data needs, such as event delivery, storage, parallel processing of streaming and batch data, and analytics. With these services, you can build and seamlessly scale end-to-end big data applications quickly, easily, and more securely.
GCP allows you to define your processing logic and can take care of auto-scaling and optimizing resources on your behalf. GCP also provides fast access to popular open source data processing engines, including Apache Spark and Apache Hadoop. You can use this open source software to run your processing directly on the data stored in the GCP storage services.
Phases of a big data pipeline
The following diagram shows stages that are commonly seen in any big data pipeline.
The first phase of any data lifecycle is to ingest the data from the unprocessed source, such as Internet of Things (IoT) devices, on-premises systems, application logs, or mobile apps. After the data is available in GCP, you choose how to store it appropriately, process and analyze it from raw forms into actionable information, and explore and visualize the data to generate and share insights.
For a detailed overview of the data lifecycle on GCP, see Data Lifecycle on GCP.
Cloud Platform partner ecosystem
GCP has an extensive network of data and analytics partners that can help you migrate your data, use your preferred software through SAAS offerings, provide services than can help you process and analyze your data, and consult with your team to design your big data applications. The GCP partner network can also supplement existing GCP services to help you get more out of the platform.
The following diagram maps partner offerings to big data pipeline phases.
The data integration and replication services offered by GCP partners:
- Enable you to perform extract-transform-load (ETL) processing on your data.
- Enable you to connect to a variety of different data sources.
- Help you migrate your data into GCP storage services.
For example, you could migrate your on-premises Hadoop cluster into Google Cloud Dataproc by using the services offered by the data integration and replication partners.
The host of partner services available for the Process & Analyze stage can help you analyze your data and build charts that can help you identify hidden trends in your data.
The connector partners provide interface drivers that you can use to connect to GCP storage services and query the data stored in them. You can also integrate these drivers into your applications and use them to access data without having to worry about implementing the API.
This section presents a few use cases and reference architectures that show how partner solutions can further extend and supplement GCP.
Click-stream events contain data about user behavior on websites.
Use case: Assume that you are capturing click-stream events for your ecommerce website. The purpose of the application is to record every click made by the end user and to perform traffic analytics on the data. You want to track web pages that users stay on the most often or the longest, shopping cart abandonment rate, and user navigation flow in real time.
Reference architecture:The following diagram shows a reference architecture for this use case.
You can use Google Cloud Pub/Sub to collect the large stream of click events coming from the website. You can use Cloud Dataflow to process the data stream coming from Cloud Pub/Sub. You can create a stream and batch subscription to Cloud Pub/Sub to handle real time and batch use cases separately. The batch pipeline ensures that you have raw data stored in Google Cloud Storage as a backup, so you can handle issues related to data recovery, logical corruption, and data reconciliation.
The real-time pipeline performs the necessary filtering, enrichment and time- window aggregation on the data, and can store the data in Google BigQuery. You can analyze the data stored in BigQuery by using a comprehensive set of analytics tools provided by the partner ecosystem. These tools allow you to build visualizations of the data about the click-stream events. These visualizations operate on real-time aggregated data, which helps you to derive insights about user behavior soon after it happens.
Gaming applications can provide big data from interesting player scenarios.
Use case: Assume that you are hosting a multi-player gaming application on GCP. The application persists gaming data, such as player state, player scores, player groups, and leaderboards in one of the NoSQL datastores.
Suppose you want to run some analytics on this data to understand user behavior, so you can improve current features, introduce new features, or even build new games based on behavioral patterns. This kind of processing can be resource-intensive, because it needs to work on entire datasets to derive correlations. This sort of processing could negatively impact performance if performed on the online NoSQL system, which might hinder current online gaming operations.
Reference architecture: The following diagram shows a reference architecture for this use case.
The solution that you build should satisfy two main objectives:
- Build a scalable system to handle data analytics at scale and two.
- Allow the analysts to freely perform analysis on the datasets without worrying about affecting online operations.
BigQuery can help solve for both of these objectives above. BigQuery is highly scalable and provides a no-operations data warehouse. BigQuery also allows you to decouple your online system from these resource-intensive analytics by taking all the analytics load. Your analysts can run all the analytics on BigQuery without worrying about affecting the online datastore.
Data synchronization is where the data-integration solutions offered by our partner ecosystem can help a lot. Google has a set of partners who offer connectors for the NoSQL databases such as MongoDB, Cassandra, and Couchbase into BigQuery. You can use their solutions to create workflows that can incrementally sync the data from your online store into BigQuery. You can also use their solutions to prepare data for ingestion by performing:
Finally, your analysts can run their analyses on the data stored in BigQuery, using their favorite analytics tools, without worrying about affecting online gaming operations.
Moving data from on-premises storage presents unique challenges.
Use case: Assume that you have an on-premises Hadoop installation that hosts petabytes of advertising data using hundreds of servers. This Hadoop cluster is used for performing churn analysis, understanding factors that contributed to advertising revenue, understanding advertising properties that influenced conversion the most, and so on.
Imagine that this cluster has been growing at over 1 TiB per day, and you constantly have to deal with space issues and have to drop off or archive old data in order to continue taking in new data. You also have to forecast months in advance for your growth needs, because it takes months to get new servers provisioned. You run daily analytical processing at midnight of every day after you receive data for the previous day. The analytics run for less than 8 hours and make reports available for the business the following morning.
In this scenario, the cluster is idle two-thirds of the time. You are still paying for computing resources, even when you are not processing the data. You want to resolve the space issues permanently, avoid having to forecast months in advance by having an auto-scaled system, and optimize costs by not paying for idle computing time.
Reference architecture: GCP and Google's partners can help you achieve this migration without any downtime. The following diagram shows partner solutions that can help you migrate your on-premises Hadoop data to GCP.
This solution enables you to transfer large volumes of data to your Hadoop cluster running on GCP continuously as it changes on-premises, with strong consistency, and allows you to eliminate your migration window. Cloud Dataproc is the target for migration in this case.
Cloud Dataproc lets you store and process your data in Google Cloud Storage without having to store it in the local Hadoop Distributed File System (HDFS). Cloud Storage provides highly durable and virtually unlimited storage for your data, so you can immediately solve your space issues by migrating your data. Another advantage is that Cloud Dataproc lets you decouple storage and computing, so you do not have to provision large clusters to ensure you have enough space for the data.
You can shut down the cluster when you are not using it, without losing any data, which can help to optimize costs. This feature, by itself, can reduce costs because you don't pay to run resources full time.
GCP enables you to add more servers to reduce processing time without doing any forecasting. You can address business cases that you couldn't address in your on-premises environment, such as adding more reports and computing more metrics for existing reports, which was otherwise not possible in your current on- premises environment. And you can test and perform software upgrades on the cluster by starting a cluster with the new software version, and use that cluster for processing.
Partner solutions can also help you monitor the replication progress. You can test the new cluster in Cloud Dataproc for performance and make sure it meets your needs. While you test the new cluster, the partner solution can continue to replicate your data.
Next you can migrate the process for data ingestion, task management, and the scripts for automation into the GCP environment. When you are happy with the new cluster, and when it meets all your needs along with the convenience of unlimited storage and elastic infrastructure, you can do the final switchover. After the final switchover, you can shut down data ingestion into your on-premises cluster and use the Dataproc cluster as a primary cluster for all your data processing needs.