Using BigQuery Omni to reduce log ingestion and analysis costs in a multi-cloud environment
Rodrigo Vale
Data Analytics Engineer
TJ Mai
AppMod Customer Engineer
In today's data-centric businesses, it’s not uncommon for companies to operate hundreds of individual applications across a variety of platforms. These applications can produce a massive volume of logs, presenting a significant challenge for log analytics. Additionally, the broad adoption of multi-cloud solutions complicates accuracy and retrieval, as the distributed nature of the logs can inhibit the ability to extract meaningful insights.
BigQuery Omni was designed to effectively solve this challenge, and help reduce the overall costs when compared to a traditional approach. This blog post will dive into the details.
Log analysis involves various steps, namely:
-
Log data collection: collects log data from organization’s infrastructure and or applications. A common approach to collect this data is using a JSONL file format and saving it into an object storage application such as Google Cloud Storage. In a multi-cloud environment, moving raw log data between clouds can be cost prohibitive.
-
Log data normalization: Different applications and infrastructure generate different JSONL files. Each file has its own set of fields linked to the application/infrastructure that created it. To facilitate data analysis, these different fields are unified into a common set, allowing data analysts to conduct analyses efficiently and comprehensively across the entire environment.
-
Indexing and storage: Normalized data should be stored efficiently to reduce storage and query costs, and also to increase query performance. A common approach is to store logs into a compressed columnar-file format like Parquet.
-
Querying and visualization: Allow organizations to execute analytics queries to identify anomalies, anti-patterns or known threads available in the log data.
-
Data lifecycle: As log data ages, its utility decreases, while still incurring storage costs. To optimize expenses, it's crucial to establish a data lifecycle process. A widely adopted strategy involves archiving logs after a month (querying log data older than a month is uncommon) and deleting them after a year. This approach effectively manages storage costs while ensuring that essential data remains accessible.
A common architecture
To implement log analysis in a multi-cloud environment, many organizations implement the following architecture:
This architecture has its pros and cons.
On the plus side:
-
Data lifecycle: It’s relatively easy to implement data lifecycle management by leveraging existing features from object storage solutions. For example, in Cloud Storage you can define the following data lifecycle policy: (a) delete any object older than a week — you can use it to delete your JSONL files available during the Collection step; (b) archive any object older than a month — you can use this policy for your Parquet files; and (c) delete any object older than a year — also for your Parquet files.
-
Low egress costs: By keeping the data local, you avoid sending high volumes of raw data between cloud providers.
On the con side:
-
Log data normalization: As you collect logs from different applications, you will code and maintain an Apache Spark workload for each one. In an age where (a) engineers are a scarce resource, and (b) microservices adoption is growing rapidly, it’s a good idea to avoid this.
-
Querying: Spreading your data across different cloud providers drastically reduces your analysis and visualization capabilities.
-
Querying: Excluding archived files created earlier in the data lifecycle is non-trivial and is prone to human error when relying on WHERE clauses to avoid partitions with archived files. One solution is to work with Iceberg Table and manage the table’s manifest by adding and removing partitions as needed. However, manually playing with the Iceberg Table manifest is complicated, and using a third-party solution just increases costs.
An improved architecture
Based on these factors, an improved solution would be to use BigQuery Omni to handle all these problems as presented in the architecture below.
One of the core benefits of this approach is the elimination of different Spark workloads and associated software engineers to code and maintain them. Another benefit of this solution is that you have a single product (BigQuery) handling the entire process, apart from storage and visualization. You also gain benefits related to cost savings. We’ll explain each of these points below in detail.
A simplified normalization process
BigQuery's ability to create an external table pointing to JSONL files and automatically determine their schema is a significant value. This feature is particularly useful when dealing with numerous log schema formats. For each application, a straightforward CREATE TABLE statement can be defined to access its JSONL content. Once there, you can schedule BigQuery to export the JSONL external table into compressed Parquet files partitioned by hour in Hive format. The query below is an example of an EXPORT DATA statement that can be scheduled to run every hour. The SELECT statement of this query captures only the log data ingested from the last hour and converts it into a Parquet file with normalized fields.
A unified querying process across cloud providers
Having the same data warehouse platform that spans multiple cloud providers already brings benefits to the querying process, but BigQuery Omni can also execute cross-cloud joins — a game changer in Log Analytics. Before BigQuery Omni, combining log data from different cloud providers was a challenge. Due to the volume of data, sending the raw data to a single master cloud provider generates significant egress costs, on the other hand pre-processing and filtering it reduces your ability to perform analytics on it. With cross-cloud joins, you can run a single query across multiple clouds and analyze its results.
Helps to Reduce TCO
The final and probably the most important benefit from this architecture is it helps to reduce the total cost of ownership (TCO). This can be measure in three ways:
-
Reduced engineering resources: Removing Apache Spark from this process brings two benefits. First, there’s no need for a software engineer to work on and maintain Spark code. Second, the deployment process is faster and can be executed by the log analytics team using standard SQL queries. As a PaaS with a shared responsibility model, BigQuery and BigQuery Omni extend that model to data in AWS and Azure.
-
Reduced compute resources: Apache Spark may not always offer the most cost-effective environment. An Apache Spark solution comprises multiple layers: the virtual machine (VM), the Apache Spark platform, and the application itself. In contrast, BigQuery utilizes slots (virtual CPUs, not VMs) and an export query that is converted into C-compiled code during the export process can result in faster performance for this specific task when compared to Apache Spark.
-
Reduced egress costs: BigQuery Omni allows you to process data in-situ and egress only results through cross-cloud joins, avoiding the need to move raw data between cloud providers to have a consolidated view of the data.
How should you use BigQuery in this environment?
BigQuery offers a choice of two compute pricing models for running queries:
-
On-demand pricing (per TiB) - With this pricing model, you are charged for the number of bytes processed by each query, and the first 1 TiB of query data processed per month is free. As log analytics tasks consume a large volume of data, we do not recommend using this model.
-
Capacity pricing (per slot-hour) - With this pricing model, you are instead charged for compute capacity used to run queries, measured in slots (virtual CPUs) over time. This model takes advantage of BigQuery editions. You can use the BigQuery autoscaler or purchase slot commitments, which are dedicated capacity always available for your workloads, at a lower price than on-demand.
We executed an empirical test and allocated 100 slots (baseline 0, max slots 100) to a project focused on export log JSONL data into a compressed Parquet format. With this setup, BigQuery was able to process 1PB of data per day without consuming all 100 slots.
In this blog post, we presented an architecture aiming to support the TCO reduction of Log Analytics workloads in a multi-cloud environment, by replacing Apache Spark applications by SQL queries running on BigQuery Omni. This approach helps to reduce engineering, compute and egress costs, while at the same time minimizing overall DevOps complexity, which can bring value to your unique data environment.