This page describes an architecture for complex event processing on Google Cloud Platform.
You can use Google Cloud Platform to build the elastic and scalable infrastructure needed to import vast amounts of data, process events, and execute business rules. Any architecture for complex event processing (CEP) must have the ability to import data from multiple, heterogeneous sources, apply complex business rules, and drive outbound actions.
The following architecture diagram shows such a system and the components involved:
This section provides details about the architecture shown in the diagram. The architecture supports streaming input, which can handle a continuous data flow, and batch input, which handles data as sizable chunks.
In scenarios that require immediate action based on real-time, user, or machine- generated events, you can use a streaming architecture to handle and process complex events.
User-generated events often come from interactions within applications that are hosted on-premises or in the cloud and can be in the form of clickstreams, user actions such as adding items to a shopping cart or wish list, or financial transactions. Machine-generated events can be in the form of mobile or other devices that report presence, diagnostic, or similar types of data. There are several different platforms for hosting applications on Cloud Platform:
- Google App Engine is a platform as a service (PaaS) that allows you to deploy applications directly into a managed, auto-scaling environment.
- Google Container Engine provides finer-grained control for container-based applications, allowing you to easily orchestrate and deploy containers across a cluster.
- Google Compute Engine provides high-performance virtual machines for maximum flexibility.
Whether applications are hosted on-premises or on Cloud Platform, the events they generate must travel through a loosely coupled, highly available messaging layer. Google Cloud Pub/Sub, a globally durable message-transport service, supports native connectivity to other Cloud Platform services, making it the glue between applications and downstream processing.
In scenarios with large amounts of historical data stored in on-premises or cloud databases, you can use a batch-oriented processing approach to import and process complex events.
Data sources that contain events to be processed can be located on-premises or on Cloud Platform. For on-premises data sources, you can export data and subsequently copy it to Cloud Storage for downstream processing. You can directly access data stored in cloud-hosted databases by using downstream processing tools, and there are several different platforms you can use to store unprocessed event data on Cloud Platform:
- Google Cloud Storage provides object storage for flat files or unstructured data.
- Google Cloud SQL stores relational data.
- Google Cloud Datastore is a NoSQL database that supports dynamic schemas and full data-indexing.
Processing and storage
As event data streams in from applications through Cloud Pub/Sub or is loaded into other cloud data sources, such as Cloud Storage, Cloud SQL, Cloud Datastore, you might need to transform, enrich, aggregate, or apply general purpose computations across the data. Google Cloud Dataflow is built to perform these tasks on data that is streaming or batched, by using a simple pipeline-based programming model. Cloud Dataflow executes these pipelines on a managed service that elastically scales, as needed. In CEP workloads, you can use Cloud Dataflow to normalize data from multiple heterogeneous sources and format it into a single, consistent representation.
In this architecture, Cloud Dataflow processes event data and normalizes it into a single, consistent, time-series representation, and then stored in Google Cloud Bigtable, a fully managed, NoSQL database service. Cloud Bigtable provides consistent, low-latency, high-throughput data access, useful for downstream processing and analytical workloads. Refer to Bigtable Schema Design for Time Series and Loading, Storing, and Archiving Time Series for more information on storing time-series data in Cloud Bigtable.
At the heart of CEP systems are rules that must be executed across incoming events. Rules engines can vary from simple, custom, domain-specific languages to complex business rules management systems. Regardless of the rules engine's complexity, it can be deployed in a distributed fashion using either Cloud Dataflow or Google Cloud Dataproc, a managed service for running Hadoop or Spark clusters. Using either of these systems, you can execute complex rules across large amounts of time series events, and then use the results of these rules evaluations to drive outbound actions.
After inbound events have been processed by the rules execution mechanism, results can be turned into a series of outbound actions. These actions can include push notifications for mobile devices, notifications for other applications, or sending email to users. To deliver these actions, you can send messages through Cloud Pub/Sub to applications hosted on-premises or on Google Cloud Platform, whether running on App Engine, Container Engine, or Compute Engine.
You can store the results of rules execution across inbound events in Google BigQuery to perform exploration and analysis for business intelligence or reporting purposes. BigQuery is a managed, petabyte-scale data warehouse that supports SQL-like queries and has connectivity to additional downstream analytical tools. For data-science-oriented analysis, Google Cloud Datalab offers powerful data exploration capabilities based on Jupyter notebooks. For reporting or dashboard use cases, Google Data Studio 360 supports creating customizable and shareable reports, backed by BigQuery's query and storage capabilities.
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.