Go from logs to security insights faster with Dataform and Community Security Analytics
Hunter Lynne
Director of Cloud Security and Compliance, Onix
Roy Arsan
Cloud Solutions Architect, Google
Making sense of security logs such as audit and network logs can be a challenge, given the volume, variety and velocity of valuable logs from your Google Cloud environment. To help accelerate the time to get security insights from your logs, the open-source Community Security Analytics (CSA) provides pre-built queries and reports you can use on top of Log Analytics powered by BigQuery. Customers and partners use CSA queries to help with data usage audits, threat detection and investigation, behavioral analytics and network forensics. It’s now easier than ever to deploy and operationalize CSA on BigQuery, with significant queries performance gains and cost savings.
In collaboration with Onix, a premier Google Cloud service partner, we’re delighted to announce CSA can now be deployed via Dataform, a BigQuery service and an open-source data modeling framework to manage the Extraction, Loading, and Transformation (ELT) process for your data. Now, you can automate the rollout of CSA reports and alerts with cost-efficient summary tables and entity lookup tables (e.g. unique users and IP addresses seen). Dataform handles the infrastructure and orchestration of ELT pipelines to filter, normalize and model log data starting with the raw logs in Log Analytics into curated and up-to-date BigQuery tables and views for the approximately 50 CSA use cases, as shown in the dependency tree below.
The best of Google Cloud for your logs analysis
Dataform alongside Log Analytics in Cloud Logging and BigQuery provide the best of Google Cloud for your logs management and analysis.
First, BigQuery provides the fully managed petabyte-scale centralized data warehouse to store all your logs (but also other security data like your SCC findings).
Then, Log Analytics from Cloud Logging provides a native and simple solution to route and analyze your logs in BigQuery by enabling you to analyze in place without exporting or duplicating logs, or worrying about partitioning, clustering or setting up search indexes.
Finally, Dataform sets up the log data modeling necessary to report, visualize, and alert on your logs using normalized, continuously updated summary tables derived from the raw logs.
Why deploy CSA with Dataform?
1. Optimize query cost and performance
By querying logs from the summary tables, the amount of data scanned is significantly reduced as opposed to querying the source BigQuery _AllLogs view. The data scanned is often less than 1% based on our internal test environment (example screenshot below) and in line with the nature of voluminous raw logs.
This leads to significant cost savings for read-heavy workloads such as reporting and alerting on logs. This is particularly important for customers leveraging BigQuery scheduled queries for continuous alerting, and/or a business intelligence (BI) tool on top of BigQuery such as Looker, or Grafana for monitoring.
Note: Unlike reporting and alerting, when it comes to ad-hoc search for troubleshooting or investigation, you can do so via Log Analytics user interface at no additional query cost.
2. Segregate logs by domains and drop sensitive fields
Logs commonly contain sensitive and confidential information. By having your log data stored in separate domain-specific tables, you can help ensure authorized users can only view the logs they need to view to perform their job. For example, a network forensics analyst may only need access to network logs as opposed to other sensitive logs like data audit logs. With Dataform for CSA, you can ensure that this separation of duties is enforced with table-level permissions, by providing them with read-only access to network activity summary tables (for CSA 6.*), but not to data usage summary tables (for CSA 5.*).
Furthermore, by summarizing the data over time — hourly or daily — you can eliminate potentially sensitive low-level information. For example, request metadata including caller IP and user agent is not captured in the user actions summary table (for CSA 4.01). This way, for example, an ML researcher performing behavioral analytics, can focus on user activities over time to look for any anomalies, without accessing personal user details such as IP addresses.
3. Unlock AI/ML and gen AI capabilities
Normalizing log data into simpler and smaller tables greatly accelerates time to value. For example, analyzing the summarized and normalized BigQuery table for user actions, that is 4_01_summary_daily
table depicted below, is significantly simpler and delivers more insights than trying to analyze the _AllLogs
BigQuery view in its original raw format. The latter has a complex (and sometimes obscure) schema including several nested records and JSON fields, which limits the ability to parse the logs and identify patterns.
The normalized logs allows you to scale ML opportunities both computationally (because the dataset is smaller and simpler), as well as for ML researchers, who don’t need to be familiar with Cloud Logging log schema or LogEntry definition to analyze summary tables such as daily user actions.
This also enables gen AI opportunities such as using LLM models to generate SQL queries from natural language based on a given database schema. There’s a lot of ongoing research about using LLM models for text-to-SQL applications. Early research has shown promising results where simpler schema and distinct domain-specific datasets yielded reasonably accurate SQL queries.
How to get started
Before leveraging BigQuery Dataform for CSA, aggregate your logs in a central log bucket and create a linked BigQuery dataset provided by Log Analytics. If you haven’t done so, follow the steps to route your logs to a log bucket (select the Log Analytics tab) as part of the security log analytics solution guide.
See Getting started in the CSA Dataform README to start building your CSA tables and views off of the source logs view, i.e., the BigQuery view _AllLogs from Log Analytics. You can run Dataform through Google Cloud console (more common) or via the Dataform CLI.
You may want to use the Dataform CLI for a quick one-time or ad hoc Dataform execution to process historical logs, where you specify a desired lookback time window (default is 90 days).
However, in most cases, you need to set up a Dataform repository via the Cloud console, as well as Dataform workflows for scheduled Dataform executions to process historical and streaming logs. Dataform workflows will continuously and incrementally update your target dataset with new data on a regular schedule, say hourly or daily. This enables you to continuously and cost-efficiently report on fresh data. The Cloud console Dataform page also allows you to manage your Dataform resources, edit your Dataform code inline, and visualize your dataset dependency tree (like the one shown above), all with access control via fine-grained Dataform IAM roles
Leverage partner delivery services
Get on the fast path to building your own security data lake and monitoring on BigQuery by using Dataform for CSA and leveraging specialized Google Cloud partners.
Onix, a Premier Google Cloud partner and a leading provider of data management and security solutions, is available to help customers leverage this new Dataform for CSA functionality including:
Implementing security foundations to deploy this CSA solution
Setting up your Google Cloud logs for security visibility and coverage
Deploying CSA with Dataform following Infrastructure as Code and data warehousing best practices.
Managing and scaling your reporting and alerting layer as part of your security foundation.
In summary, BigQuery natively stores your Google Cloud logs via Log Analytics, as well as your high-fidelity Security Command Center alerts. No matter the size of your organization, you can deploy CSA with Dataform today to report and alert on your Google Cloud security data. By leveraging specialized partners like Onix to help you design, build and implement your security analytics with CSA, your security data lake can be built to meet your specific security and compliance requirements today.