Security & Identity

Enhancing Digital Threat Monitoring with Machine Learning

June 14, 2022

Mandiant

Written by: Mandiant Data Science

Traditional cyber security defenses are designed to protect assets that exist within an organization’s network. But those assets often extend beyond network perimeters, increasing a company’s risk of exposure, theft, and financial loss. As part of the Mandiant digital risk protection solution, Mandiant Advantage Digital Threat Monitoring (DTM) automatically collects and analyzes content streamed from external online sources, and subsequently alerts defenders whenever a potential threat is detected. This capability allows organizations to expose threats early, and more effectively identify potential breaches and exposures before they escalate – without adding operational complexity for already overburdened security teams.

What is Digital Threat Monitoring, and Why is it So Hard to Get Right?

The newly released Mandiant Advantage DTM module alerts customers to threats emanating from social media, the deep and dark web, paste sites, and other online channels. A customer can use this module to monitor and gain visibility into digital threats that directly or indirectly target their assets in real-time. DTM can also provide pivoting opportunities for further enrichment, context, or threat hunting. If it sounds like DTM can support a wide variety of use cases, it’s because that’s exactly what it was designed to do, for instance:

As a threat intelligence analyst, I want to discover threat actors actively targeting our infrastructure, so I can prioritize defenses and remediation.
As a CISO, I want to identify threats to our vendors and supply chain, so I can proactively mitigate that risk.
As a threat hunter, I want to identify possible data leaks and breaches, so I can uncover attackers in my environment and minimize dwell time.

DTM is a continuous process, shown in Figure 1, involving data collection, content analysis, alerting, remediation and takedowns, and subsequent search refinement and collection all in a loop. As such, the Mandiant Advantage DTM module needs to continuously evolve to enable customers to be proactive about digital threats they’re being targeted by.

https://storage.googleapis.com/gweb-cloudblog-publish/images/figure-1-dtm-0.max-500x500.png

Figure 1: Mandiant Advantage DTM Process

In addition to the dynamically changing nature of ingested content and the threat landscape itself, the diversity of ingested sources presents another significant technical challenge. While a customer wants a seamless, consistent end-to-end experience for each new source plumbed through DTM, documents derived from different sources can vary widely in terms of their structure, semantic composition, language, and length. For example, social media posts are short, jargon-laden, and stuffed with syntactic symbols like hashtags, mentions, and slang, which might not manifest within emails that contain more formal narrative and structural elements like headers, footers, signatures, and multi-part bodies.

Legacy solutions rely primarily upon keyword matching to address the issues outlined above. However, individual keywords can match documents in a variety of irrelevant contexts. For example, the word “breach” is often used colloquially in non-security settings, so simple keyword match results could return documents relating to breaches of trust or breached ship hulls. Also, and in analogy to the evolution of anti-virus, keyword matching is a brittle, signature-based approach that inevitably fails to recognize novel entities and threats as they evolve.

To make matters worse, trying to define complex threat concepts, such as credential dumps or release of new exploits, using simple combinations of keywords can be an impossible task. Often, this results in huge, totally unmanageable monitoring rules with hundreds or thousands of independent keywords. Given these challenges, it is necessary to take a data-driven approach using machine learning to extract valuable information and present it in a user-friendly way.

Augmenting Keyword Matching with Natural Language Processing

Behind the scenes, the new DTM module leverages machine learning (ML) and natural language processing (NLP) to continuously analyze and extract actionable patterns from millions of documents each day. This empowers DTM customers to craft custom monitoring rules to expeditiously identify content that matters most to their organization.

DTM is underpinned by a series of seven conditionally gated machine learning models that have been implemented, evaluated, and deployed to production by the Mandiant Data Science team (Figure 2). Together, these models form an end-to-end, cloud-based NLP pipeline that enriches ingested documents with entity extractions and classifications. These enrichments make it convenient for customers to query proprietary Mandiant data stores and customize alerts to what they care about most.

https://storage.googleapis.com/gweb-cloudblog-publish/images/figure-2-dtm.max-800x800.png

Figure 2: DTM’s NLP Analysis Pipeline.

From a technical point of view, this architecture also derives immediate benefits in terms of being able to:

Measurably reduce false positives and improve the quality of dispositioned alerts
Scale horizontally to handle arbitrarily increases in document volume
Quickly capture any errors or feedback received to allow us to rapidly iterate
Expose entities and classifications produced by individual models to populate global views and historical trends

Mandiant Data Science has integrated some of the most recent, state-of-the-art neural network-based NLP techniques in developing the individual machine learning models that make up the pipeline from Figure 2. In previous research, our team has applied state-of-the-art Transformer neural networks to security tasks like detecting social media information operations, malicious URLs, and even malware binaries. Transformers learn context in parallel by tracking long-distance relationships among sequential data, like words in a document, beating out the previous generation of models that inefficiently processed words within a limited window and produced more errors when related words occurred far away from each other.

Additionally, we make use of a novel semi-supervised topic classifier that combines subject matter knowledge from Mandiant experts with a data-driven ML approach to identify high-level threat topics within each document. Within the DTM pipeline, we have achieved remarkably high levels of accuracy and noise reduction in practice by utilizing Transformer models and topic modeling.

By the same token, it’s worthwhile to note that there are processing steps within the pipeline that aren’t driven by advanced NLP approaches. In the same way that machine learning can improve but not entirely remove the need for keyword matching, we have also taken care to deploy such heavy-duty models only when simpler heuristics do not suffice. In this way, our pipeline mixes the best of both worlds, allowing for greater flexibility, extensibility, and feedback-driven improvements.

The Benefits of Machine Learning for DTM

Higher levels of accuracy resulting from the pipelines’ machine learning models translate into improved experiences for customers using DTM. When entity types like organization are reliably extracted from ingested documents, it means that customers on the lookout for supply chain vulnerabilities affecting Apple products do not need to scroll past noisy documents mentioning apple pie recipes or apple juice nutrition facts. Entities help customers cut through the noise that is naturally present within large volumes of documents. Furthermore, the pipeline currently supports over 40 distinct entity types with more planned in the future, providing customers access to a rich set of accurately detected entities for crafting the most precise monitors to be alerted to the most relevant information. An additional benefit of high-quality entity extraction is that it allows for enriching DTM alerts with Mandiant intelligence sources, a good example of which is the Mandiant indicator confidence score (IC-Score) and threat actor and malware context for IP addresses, hashes, domains and URLs.

Lastly, machine learning also simplifies the creation of monitors by empowering customers to filter documents by high level topics. Documents flowing through the NLP Analysis Pipeline are tagged with up to 40 industry or threat topic labels, allowing customers to tailor alerts they receive to common threats and categorized security-related content, or to those pertaining specifically to their industry vertical. Topics give DTM customers yet another way to refine their alerts beyond simple keyword matching, meaning that incoming documents pertaining to life hacks or growth hacking get filtered away when specifying a monitor condition in which documents must be associated with the information-security/compromised topic.

DTM has undergone thorough internal evaluation so users can be confident that the entities and classifications that monitors are built from reflect best in breed NLP and threat intelligence capabilities. At the end of the day though, the proof is in the pudding – we encourage you to test drive the newly released Mandiant Advantage DTM module for yourself.

Posted in