This page describes an architecture for complex event processing on Google Cloud.
You can use Google Cloud to build the elastic and scalable infrastructure needed to import vast amounts of data, process events, and execute business rules. Any architecture for complex event processing (CEP) must have the ability to import data from multiple, heterogeneous sources, apply complex business rules, and drive outbound actions.
The following architecture diagram shows such a system and the components involved:
This section provides details about the architecture shown in the diagram. The architecture supports streaming input, which can handle a continuous data flow, and batch input, which handles data as sizable chunks.
In scenarios that require immediate action based on real-time, user, or machine- generated events, you can use a streaming architecture to handle and process complex events.
User-generated events often come from interactions within applications that are hosted on-premises or in the cloud and can be in the form of clickstreams, user actions such as adding items to a shopping cart or wish list, or financial transactions. Machine-generated events can be in the form of mobile or other devices that report presence, diagnostic, or similar types of data. There are several different platforms for hosting applications on Google Cloud:
- App Engine is a platform as a service (PaaS) that allows you to deploy applications directly into a managed, auto-scaling environment.
- Google Kubernetes Engine provides finer-grained control for container-based applications, allowing you to easily orchestrate and deploy containers across a cluster.
- Compute Engine provides high-performance virtual machines for maximum flexibility.
Whether applications are hosted on-premises or on Google Cloud, the events they generate must travel through a loosely coupled, highly available messaging layer. Pub/Sub, a globally durable message-transport service, supports native connectivity to other Google Cloud services, making it the glue between applications and downstream processing.
In scenarios with large amounts of historical data stored in on-premises or cloud databases, you can use a batch-oriented processing approach to import and process complex events.
Data sources that contain events to be processed can be located on-premises or on Google Cloud. For on-premises data sources, you can export data and subsequently copy it to Cloud Storage for downstream processing. You can directly access data stored in cloud-hosted databases by using downstream processing tools, and there are several different platforms you can use to store unprocessed event data on Google Cloud:
- Cloud Storage provides object storage for flat files or unstructured data.
- Cloud SQL stores relational data.
- Datastore is a NoSQL database that supports dynamic schemas and full data-indexing.
Processing and storage
As event data streams in from applications through Pub/Sub or is loaded into other cloud data sources, such as Cloud Storage, Cloud SQL, Datastore, you might need to transform, enrich, aggregate, or apply general purpose computations across the data. Dataflow is built to perform these tasks on data that is streaming or batched, by using a simple pipeline-based programming model. Dataflow executes these pipelines on a managed service that elastically scales, as needed. In CEP workloads, you can use Dataflow to normalize data from multiple heterogeneous sources and format it into a single, consistent representation.
In this architecture, Dataflow processes event data and normalizes it into a single, consistent, time-series representation, and then stored in Cloud Bigtable, a fully managed, NoSQL database service. Bigtable provides consistent, low-latency, high-throughput data access, useful for downstream processing and analytical workloads. Refer to Bigtable schema design for time series and Loading, storing, and archiving time series for more information on storing time-series data in Bigtable.
At the heart of CEP systems are rules that must be executed across incoming events. Rules engines can vary from simple, custom, domain-specific languages to complex business rules management systems. Regardless of the rules engine's complexity, it can be deployed in a distributed fashion using either Dataflow or Dataproc, a managed service for running Hadoop or Spark clusters. Using either of these systems, you can execute complex rules across large amounts of time series events, and then use the results of these rules evaluations to drive outbound actions.
After inbound events have been processed by the rules execution mechanism, results can be turned into a series of outbound actions. These actions can include push notifications for mobile devices, notifications for other applications, or sending email to users. To deliver these actions, you can send messages through Pub/Sub to applications hosted on-premises or on Google Cloud, whether running on App Engine, GKE, or Compute Engine.
You can store the results of rules execution across inbound events in BigQuery to perform exploration and analysis for business intelligence or reporting purposes. BigQuery is a managed, petabyte-scale data warehouse that supports SQL-like queries and has connectivity to additional downstream analytical tools. For data-science-oriented analysis, Datalab offers powerful data exploration capabilities based on Jupyter notebooks. For reporting or dashboard use cases, Google Data Studio supports creating customizable and shareable reports, backed by the query and storage capabilities of BigQuery.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.