This article describes an architecture for optimizing large-scale analytics ingestion on Google Cloud. For the purposes of this article, large-scale means greater than 100,000 events per second, or having a total aggregate event payload size of over 100 MB per second.
You can use Google Cloud's elastic and scalable managed services to collect vast amounts of incoming log and analytics events, and then process them for entry into a data warehouse, such as BigQuery.
Any architecture for ingestion of significant quantities of analytics data should take into account which data you need to access in near real-time and which you can handle after a short delay, and split them appropriately. A segmented approach has these benefits:
- Log integrity. You can see complete logs. No logs are lost due to streaming quota limits or sampling.
- Cost reduction. Streaming inserts of events and logs are billed at a higher rate than those inserted from Cloud Storage using batch jobs.
- Reserved query resources. Moving lower-priority logs to batch loading keeps them from having an impact on reserved query resources.
The following architecture diagram shows such a system, and introduces the concepts of hot paths and cold paths for ingestion:
In this architecture, data originates from two possible sources:
- Logs are collected using Cloud Logging.
- Analytics events are published to a Pub/Sub topic.
After ingestion from either source, based on the latency requirements of the message, data is put either into the hot path or the cold path. The hot path uses streaming input, which can handle a continuous dataflow, while the cold path is a batch process, loading the data on a schedule you determine.
You can use Cloud Logging to ingest logging events generated by standard operating system logging facilities. Cloud Logging is available in a number of Compute Engine environments by default, including the standard images, and can also be installed on many operating systems by using the Cloud Logging Agent. The logging agent is the default logging sink for App Engine and Google Kubernetes Engine.
In the hot path, critical logs required for monitoring and analysis of your services are selected by specifying a filter in the Cloud Logging sink and then streamed to multiple BigQuery tables. Use separate tables for ERROR and WARN logging levels, and then split further by service if high volumes are expected. This best practice keeps the number of inserts per second per table under the 100,000 limit and keeps queries against this data performing well.
For the cold path, logs that don't require near real-time analysis are selected using a Cloud Logging sink pointed at a Cloud Storage bucket. Logs are batched and written to log files in Cloud Storage hourly batches. These logs can then be batch loaded into BigQuery using the standard Cloud Storage file import process, which can be initiated using the Google Cloud console, the command-line interface (CLI), or even a simple script. Batch loading does not impact the hot path's streaming ingestion nor query performance. In most cases, it's probably best to merge cold path logs directly into the same tables used by the hot path logs to simplify troubleshooting and report generation.
Analytics events can be generated by your app's services in Google Cloud or sent from remote clients. Ingesting these analytics events through Pub/Sub and then processing them in Dataflow provides a high-throughput system with low latency. Although it is possible to send the hot and cold analytics events to two separate Pub/Sub topics, you should send all events to one topic and process them using separate hot-path and cold-path Dataflow jobs. That way, you can change the path an analytics event follows by updating the Dataflow jobs, which is easier than deploying a new app or client version.
Some events need immediate analysis. For example, an event might indicate undesired client behavior or bad actors. You should cherry pick such events from Pub/Sub by using an autoscaling Dataflow job and then send them directly to BigQuery. This data can be partitioned by the Dataflow job to ensure that the 100,000 rows per second limit per table is not reached. This also keeps queries performing well.
Events that need to be tracked and analyzed on an hourly or daily basis, but never immediately, can be pushed by Dataflow to objects on Cloud Storage. Loads can be initiated from Cloud Storage into BigQuery by using the Cloud console, the Google Cloud CLI, or even a simple script. You can merge them into the same tables as the hot path events. Like the logging cold path, batch-loaded analytics events do not have an impact on reserved query resources, and keep the streaming ingest path load reasonable.
For more information about loading data into BigQuery, see Introduction to loading data.
- Learn more about setting up a secure BigQuery Data Warehouse.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.