Achieving Autonomic Security Operations: Reducing toil
Iman Ghanizada
Global Head of Security Operations Solutions, Google Cloud
Anton Chuvakin
Security Advisor, Office of the CISO, Google Cloud
Almost two decades of Site Reliability Engineering (SRE) has proved the value of incorporating software engineering practices into traditional infrastructure and operations management. In a parallel world, we’re finding that similar principles can radically improve outcomes for the Security Operations Center (SOC), a domain plagued with infrastructure and operational challenges. As more organizations go through digital transformation, the importance of building a highly effective threat management function rises to be one of their top priorities. In our paper, “Autonomic Security Operations — 10X Transformation of the Security Operations Center”, we’ve outlined our approach to modernizing Security Operations.
One of the core elements of the Security Operations modernization journey is a relentless focus on eliminating “toil.” Toil is an SRE term defined in the SRE book as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” If you’re a security analyst, you may realize that sifting through toil is one of the most significant and burdensome elements of your role. For some analysts, their entire workload fits the SRE definition of “toil.”
Another example from the same source states “If your service remains in the same state after you have finished a task, the task was probably toil.” Sound familiar? Some would say that most SOC work is inherently like this – attackers come, alerts trigger, triage and investigate, adjust, tune, respond, rinse, and repeat. If our infrastructure remains in the same state after this, it may be the desired outcome, but we are still left with all of the operational challenges that make the work of the analyst cumbersome.
So, let’s talk about how you can make your SOC behave more like good SRE teams do.
First, where is that 10X improvement, mentioned in the paper, likely to come from? If you have an increase in attacks, an increase in assets under protection or an increase in the complexity of your environment, a “toil-based” SOC will need to grow at least linearly with all those changes. To get to 2X the attacks or to 2X increased scope (such as cloud added to your SOC coverage), you will need 2X the people, and sometimes 2X budget to spend on tools.
However, if we transform the SOC based on the principles we discuss in the ASO paper, an increase of data and complexity may not require doubling your team and budget (two things that are quite an uphill battle for many security leaders!) The evolution of security operations in general and SOC effectiveness in particular is heavily dependent on driving an engineering-first mindset when operating secure systems at modern scale. You can’t “ops” your way to a modern SOC, but you can “dev” your way there! Using modern tools like Chronicle for detection and investigation can also help you reach that goal.
So, how can we put these and other SRE lessons to work in your SOC?
First, educate your team on how SRE philosophies can be implemented in the SOC. Find opportunities to do team-building exercises and empower your team to define the cultural transformation. Driving a cultural shift requires an inspired, motivated, and disciplined team.
Invest in learning programs to upskill your analysts to develop more engineering skills. Investing in your team's careers will both lead to more positive sentiment, a more motivated workforce, and a more solution-oriented team than a traditional operations team.
Aim to minimize your ops time to 50%; try spending the remaining 50% on improving systems and detections with an “automate-first” mindset. BTW, engineering is not the same as writing code: “Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy.“
“Commit to eliminate a bit of toil each week with some good engineering” in your SOC. Here are some SOC examples: tweak that rule that produces non-actionables alerts, write a SOAR playbook to auto-close some alerts while using context data, script the test for log collection running optimally, etc.
Another area to consider is to try hiring security automation engineers who have operations experience, or have the ability to ramp up quickly. The right person can set the tone for leading your whole team through evolution to an “SRE-inspired” 10X SOC.
We here at the Google Cybersecurity Action Team look forward to helping organizations of all sizes and capabilities to achieve Autonomic Security Operations. While the challenges that plague the SOC can at times seem insurmountable, incremental engineering improvements can drive exponential outcomes. As you look to develop your roadmap for modernizing your threat management capabilities, we’re here to partner with you along the journey.
Here are some additional resources that provide perspectives on the transition to more autonomic security operations:
“Modernizing SOC ... Introducing Autonomic Security Operations”
“Autonomic Security Operations — 10X Transformation of the Security Operations Center””
“SOC in a Large, Complex and Evolving Organization” (Google Cloud Security Podcast ep26) and “The Mysteries of Detection Engineering: Revealed!” (ep27)
“A SOC Tried To Detect Threats in the Cloud … You Won’t Believe What Happened Next”