Statistics and aggregations in UDM search using YARA-L 2.0
This page describes how to run statistical queries on UDM events and group the results for analysis using YARA-L 2.0.
When handling a large volume of UDM events generated in your environment, understanding the trends in your UDM search data is important. You can use statistics and aggregate functions to gain actionable insights from your UDM logs. UDM search supports all the aggregate functions in YARA-L 2.0.
Use cases for statistical queries
You can use statistical queries for the following use cases:
Track critical metrics: You can measure the distribution and frequency of UDM events and associated assets, such as hosts communicating with known malicious IP addresses.
Detect anomalous behaviour: You can identify activity spikes that may indicate security incidents, such as unexpected network traffic surges or logins during off-hours.
Analyze trends over time: You can assess security posture changes to evaluate control effectiveness or identify areas for improvement, such as monitoring fluctuations in vulnerability counts over time.
YARA-L 2.0 query structure in search
You can group and order UDM search query results using syntax similar to the YARA-L structured used in detection engine rules. For more information, see YARA-L 2.0 language syntax.
The YARA-L 2.0 query structure is as follows:
Filtering statement: specifies the conditions to filter events.
Match (optional): defines the fields to group by. For more information, see Match section syntax.
Outcome: specifies the outputs of the query. For more information, see Outcome section syntax.
Order: determines the order of query results as
asc
(ascending) ordesc
(descending). If the order (asc
ordesc
) is not specified, it will defaults toasc
.Limit (optional): sets the maximum number of rows the query returns.
Here is an example of order and limit usage:
metadata.log_type = "OKTA"
match:
principal.ip
Outcome:
$user_count_by_ip = count(principal.user.userid)
order:
$user_count_by_ip desc
limit:
20
Aggregations
The UDM search supports the following aggregate functions:
array
array(expression)
Description
The array
function returns all the values in the form of a list. It truncates
the list to a maximum of 25 random elements.
Param data types
STRING
Return type
LIST
Code Samples
Example
Return an array containing event types.
$event_type = metadata.event_type
outcome:
$event_type_array = array($event_type)
array_distinct
array_distinct(expression)
Description
The array_distinct
function returns all the distinct values in the form of a
list. It truncates the list to a maximum of 25 random elements. The
deduplication to get a distinct list is applied before truncation.
Param data types
STRING
Return type
LIST
Code Samples
Example
Return an array containing distinct event types.
$event_type = metadata.event_type
outcome:
$event_type_array = array_distinct($event_type)
avg
avg(numericExpression)
Description
The avg
function returns the average of values within a numeric column. It
ignores NULL
values during the calculation. It is often used
with match
to calculate the averages within specific groups in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the average of metadata.event_timestamp.seconds
in a variable called avg_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$avg_seconds = avg(metadata.event_timestamp.seconds)
count
count(expression)
Description
The count
function returns the number of rows within a group. It is often used
with match
to get counts for specific groups in the data.
Param data types
STRING
Return type
NUMBER
Code Samples
Example
Return the count of successful user logins over time.
metadata.event_type = "USER_LOGIN"
$security_result = security_result.action
$security_result = "ALLOW"
$date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
match:
$security_result, $date
outcome:
$event_count = count(metadata.id)
count_distinct
count_distinct(expression)
Description
The count_distinct
function returns the number of rows that have distinct
values within a group. It is often used with match
to get counts for
specific groups in the data.
Param data types
STRING
Return type
NUMBER
Code Samples
Example
Return the count of distinct successful user logins over time.
metadata.event_type = "USER_LOGIN"
$security_result = security_result.action
$security_result = "ALLOW"
$date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
match:
$security_result, $date
outcome:
$event_count = count_distinct(metadata.id)
max
max(numericExpression)
Description
The max
function returns the maximum of the values within a numeric column.
It is often used with match
to get the maximum value within each group in
the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the maximum of metadata.event_timestamp.seconds
in a variable called max_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$max_seconds = max(metadata.event_timestamp.seconds)
min
min(numericExpression)
Description
The min
function returns the minimum of the values within a numeric column. It
is often used with match
to get the minimum value within each
group in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the minimum of metadata.event_timestamp.seconds
in a variable called min_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$min_seconds = min(metadata.event_timestamp.seconds)
sum
sum(numericExpression)
Description
The sum
function returns the sum of values within a numeric column. It ignores
NULL
values during the calculation. It is often used with match
to calculate the sums within different groups in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store a sum of network.sent_bytes
in a variable
called sent_bytes
.
target.ip != ""
match:
principal.ip
outcome:
$sent_bytes = sum(network.sent_bytes)
stddev
stddev(numericExpression)
Description
The stddev
function returns the standard deviation over all the possible
values.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the standard deviation of
metadata.event_timestamp.seconds
in a variable called stddev_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$stddev_seconds = stddev(metadata.event_timestamp.seconds)
YARA-L 2.0: search versus UDM usage
The
over
keyword, used for event window searches, is not supported in search.UDM search queries don't include the
condition
andoption
sections.
Group by time granularity
You can group event fields and placeholders in the match
section by a
specified time granularity, similar to grouping a column in SQL.
The syntax is as follows:
match:
... [BY|OVER EVERY] [FIRST] [TIME_GRANULARITY]
To group by time granularity, you can either use the keyword by
or over
every
. The allowed time granularities are as follows:
MINUTE
orm
HOUR
orh
DAY
ord
WEEK
orw
MONTH
ormo
Both the by
and the over every
keywords are functionally equivalent. You can
use one over the other.
Examples
Group IP address and hostname by the hour.
$hostname = principal.hostname
match:
$hostname, target.ip by hour
Group the count of all events by hostname and by the day the event occurred.
$hostname = target.hostname
match:
$hostname over every day
outcome:
$events_count = count($hostname)
Some data sources, like the entity context, are valid over a time range
(<start_time>
, <end_time>
) and don't have singular timestamps.
The first
keyword is optional and it applies to a single timestamp. This means that for a data source valid over a time range,
the keyword first
considers only the start time (<start_time>
).
For example, consider an entity with a time range of (1m, 5m
) with a time
granularity of 1m
. If the results are grouped by hosts (h1
,h2
), the returned columns will be (h1
, 1m
) and (h2
, 1m
), with the rest of the time range ignored.
The first
keyword can be added to both by
and over every
, resulting in
the same behavior for both. The use of by first
is equivalent to
over every first
.
The following is an example of a query that uses the by
operator with the
entity context data source that is valid over a time range. In this query, the
entire time range is considered because the first
keyword is omitted.
graph.entity.hostname != ""
match:
graph.entity.ip by hour
outcome:
$min_seconds = min(graph.metadata.event_metadata.event_timestamp.seconds)
Create and save visualizations in search
The Google SecOps Unified Data Model (UDM) search offers powerful data visualization capabilities. These capabilities let Security Operations Center (SOC) analysts to efficiently detect, investigate, and respond to threats by creating visualizations from search results and saving them to dashboards.
Create and save visualizations to the Native Dashboard
To create and save visualizations to add to the Native Dashboard, do the following:
Write a YARA-L query with
match
andoutcome
sections.Select a date range, and then click the Run Search to run the query. View the results on the Statistics and Visualize tabs.
On the Visualize tab, do the following: a. Select a chart type from the Chart type list. b. Adjust settings under Data Settings to customize the chart.
On the Add to dashboard screen, do the following: a. Enter a chart name, description, and time range. b. Choose to add the chart to an existing dashboard or create a new dashboard.
Click Add to Dashboard to add the chart to dashboard.
Need more help? Get answers from Community members and Google SecOps professionals.