This article describes an architecture for optimizing large-scale analytics ingestion on Google Cloud. For the purposes of this article, 'large-scale' means greater than 100,000 events per second, or having a total aggregate event payload size of over 100 MB per second.
You can use Google Cloud's elastic and scalable managed services to collect vast amounts of incoming log and analytics events, and then process them for entry into a data warehouse, such as BigQuery.
Any architecture for ingestion of significant quantities of analytics data should take into account which data you need to access in near real-time and which you can handle after a short delay, and split them appropriately. A segmented approach has these benefits:
- Log integrity. You can see complete logs. No logs are lost due to streaming quota limits or sampling.
- Cost reduction. Streaming inserts of events and logs are billed at a higher rate than those inserted from Cloud Storage using batch jobs.
- Reserved query resources. Moving lower-priority logs to batch loading keeps them from having an impact on reserved query resources.
The following architecture diagram shows such a system, and introduces the concepts of hot paths and cold paths for ingestion:
In this architecture, data originates from two possible sources:
After ingestion from either source, based on the latency requirements of the message, data is put either into the hot path or the cold path. The hot path uses streaming input, which can handle a continuous dataflow, while the cold path is a batch process, loading the data on a schedule you determine.
You can use Stackdriver Logging to ingest logging events generated by standard operating system logging facilities. Stackdriver Logging is available in a number of Compute Engine environments by default, including the standard images, and can also be installed on many operating systems by using the Stackdriver Logging Agent. The logging agent is the default logging sink for App Engine and Google Kubernetes Engine.
In the hot path, critical logs required for monitoring and analysis of your
services are selected by specifying a filter in the
Stackdriver Logging sink
and then streamed to
multiple BigQuery tables.
Use separate tables for
WARN logging levels, and then split further
by service if high volumes are expected. This best practice keeps the number of
inserts per second per table under the 100,000 limit and keeps queries against
this data performing well.
For the cold path, logs that don't require near real-time analysis are selected using a Stackdriver Logging sink pointed at a Cloud Storage bucket. Logs are batched and written to log files in Cloud Storage hourly batches. These logs can then be batch loaded into BigQuery using the standard Cloud Storage file import process, which can be initiated using the Google Cloud Console, the command-line interface (CLI), or even a simple script. Batch loading does not impact the hot path's streaming ingestion nor query performance. In most cases, it's probably best to merge cold path logs directly into the same tables used by the hot path logs to simplify troubleshooting and report generation.
Analytics events can be generated by your app's services in Google Cloud or sent from remote clients. Ingesting these analytics events through Pub/Sub and then processing them in Dataflow provides a high-throughput system with low latency. Although it is possible to send the hot and cold analytics events to two separate Pub/Sub topics, you should send all events to one topic and process them using separate hot- and cold-path Dataflow jobs. That way, you can change the path an analytics event follows by updating the Dataflow jobs, which is easier than deploying a new app or client version.
Some events need immediate analysis. For example, an event might indicate undesired client behavior or bad actors. You should cherry pick such events from Pub/Sub by using an autoscaling Dataflow job and then send them directly to BigQuery. This data can be partitioned by the Dataflow job to ensure that the 100,000 rows per second limit per table is not reached. This also keeps queries performing well.
Events that need to be tracked and analyzed on an hourly or daily basis, but
never immediately, can be pushed by Dataflow to objects on
Cloud Storage. Loads can be initiated from Cloud Storage into
BigQuery by using the Cloud Console, the
command-line tools, or even a simple script. You can merge them into the same
tables as the hot path events. Like the logging cold path, batch-loaded
analytics events do not have an impact on reserved query resources, and keep the
streaming ingest path load reasonable.
- Have a look at our published Architecture for complex event processing.
- For a game-specific pattern, check out Building a mobile gaming analytics platform — a reference architecture.
- Try out other Google Cloud features for yourself. Have a look at our tutorials.