This article describes an architecture for optimizing large-scale analytics ingestion on Google Cloud Platform (GCP). For the purposes of this article, 'large-scale' means greater than 100,000 events per second, or having a total aggregate event payload size of over 100 MB per second.
You can use GCP's elastic and scalable managed services to collect vast amounts of incoming log and analytics events, and then process them for entry into a data warehouse, such as BigQuery.
Any architecture for ingestion of significant quantities of analytics data should take into account which data you need to access in near real-time and which you can handle after a short delay, and split them appropriately. A segmented approach has these benefits:
- Log integrity. You can see complete logs. No logs are lost due to streaming quota limits or sampling.
- Cost reduction. Streaming inserts of events and logs are billed at a higher rate than those inserted from Cloud Storage using batch jobs.
- Reserved query resources. Moving lower-priority logs to batch loading keeps them from having an impact on reserved query resources.
The following architecture diagram shows such a system, and introduces the concepts of hot paths and cold paths for ingestion:
In this architecture, data originates from two possible sources:
- Analytics events are published to a Cloud Pub/Sub topic.
- Logs are collected using Stackdriver Logging.
After ingestion from either source, based on the latency requirements of the message, data is put either into the hot path or the cold path. The hot path uses streaming input, which can handle a continuous dataflow, while the cold path is a batch process, loading the data on a schedule you determine.
You can use Stackdriver Logging to ingest logging events generated by standard operating system logging facilities. Stackdriver logging is available in a number of Compute Engine environments by default, including the standard images, and can also be installed on many operating systems by using the Stackdriver Logging Agent. The logging agent is the default logging sink for App Engine and Kubernetes Engine.
In the hot path, critical logs required for monitoring and analysis of your
services are selected by specifying a filter in the
Stackdriver Logging sink
and then streamed to
multiple BigQuery tables.
Use separate tables for
WARN logging levels, and then split further
by service if high volumes are expected. This best practice keeps the number of
inserts per second per table under the 100,000 limit and keeps queries against
this data performing well.
For the cold path, logs that don't require near real-time analysis are selected using a Stackdriver Logging sink pointed at a Cloud Storage bucket. Logs are batched and written to log files in Cloud Storage hourly batches. These logs can then be batch loaded into BigQuery using the standard Cloud Storage file import process, which can be initiated using the Google Cloud Platform Console, the command-line interface (CLI), or even a simple script. Batch loading does not impact the hot path's streaming ingestion nor query performance. In most cases, it's probably best to merge cold path logs directly into the same tables used by the hot path logs to simplify troubleshooting and report generation.
Analytics events can be generated by your app's services in GCP or sent from remote clients. Ingesting these analytics events through Cloud Pub/Sub and then processing them in Cloud Dataflow provides a high-throughput system with low latency. Although it is possible to send the hot and cold analytics events to two separate Cloud Pub/Sub topics, you should send all events to one topic and process them using separate hot- and cold-path Cloud Dataflow jobs. That way, you can change the path an analytics event follows by updating the Cloud Dataflow jobs, which is easier than deploying a new app or client version.
Some events need immediate analysis. For example, an event might indicate undesired client behavior or bad actors. You should cherry pick such events from Cloud Pub/Sub by using an autoscaling Cloud Dataflow job and then send them directly to BigQuery. This data can be partitioned by the Dataflow job to ensure that the 100,000 rows per second limit per table is not reached. This also keeps queries performing well.
Events that need to be tracked and analyzed on an hourly or daily basis, but never immediately, can be pushed by Cloud Dataflow to objects on Cloud Storage. Loads can be initiated from Cloud Storage into BigQuery by using the Cloud Platform Console, the CLI, or even a simple script. You can merge them into the same tables as the hot path events. Like the logging cold path, batch-loaded analytics events do not have an impact on reserved query resources, and keep the streaming ingest path load reasonable.
- Have a look at our published Architecture for Complex Event Processing.
- For a game-specific pattern, check out Building a Mobile Gaming Analytics Platform — a Reference Architecture.
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.