Site Reliability Engineering helps New York City Cyber Command secure the city that never sleeps
About NYC Cyber Command
New York City Cyber Command (NYC3), originally created by executive order in 2017 by Mayor Bill de Blasio, is a centralized organization that leads New York City’s cyber defense efforts. NYC3 works across all city government agencies to prevent, respond to, and recover from cyberthreats. Cyber Command was officially added to the New York City charter by a unanimous vote of the City Council in September 2020.
Tell us your challenge. We're here to help.
Contact usRecognizing that critical security systems must be available and protecting the city around the clock, New York City Cyber Command adapted Google’s Site Reliability Engineering model to support its cloud operations.
Google Cloud results
- Establishes first known customer in the public sector to model Google’s SRE practice
- Keeps New York City security systems available 24/7
- Ensures that NYC3 delivers timely and accurate cybersecurity services and data to 100+ city agencies
- Enables a small team to securely support cloud infrastructure and applications
- Supports a platform with near-infinite scalability for analyzing petabytes of data
- Creates a cohesive team across diverse IT disciplines that is dedicated to system reliability
24/7 protection for the cyber defenders of New York City
New York City is home to more than 8.5 million people. To support them, the city employs more than 330,000 civil servants. They are responsible for more than one million systems that deliver critical services, from education and health to social services and public safety.
Cyberattacks are a major concern. That’s why the New York City Cyber Command (NYC3) was created in 2017—to defend city infrastructure from cybercriminals and other threats from a centralized place.
To succeed at that mission, NYC3 has built a security log aggregation platform on Google Cloud. This platform includes a data pipeline that provides Cyber Command personnel with risk alerts, visualization, and analytics—along with the flexibility to scale up and down as needed to address sophisticated threats.
“We’ve found the Google SRE framework is ideal for keeping NYC3 security operations functional.”
—Michael Makovoz, Chief Technology Officer, NYC3“Our users—security professionals in city agencies—depend utterly on our data and systems,” says Michael Makovoz, Chief Technology Officer for NYC3. According to Makovoz, the city requires 24/7 uninterruptible service, immediate recovery, a high degree of scalability, an enormous level of resiliency, full tolerance, and the ability to handle huge jumps in traffic to fulfill its mission. “It’s a tall order,” he says.
So tall an order, in fact, that NYC3 knew it needed more than a traditional support model to keep its cloud-based security systems operational. The protectors of the city needed protection against downtime themselves. So they turned to Google’s Site Reliability Engineering (SRE).
Today, NYC3 is being celebrated as the first known Google Cloud public sector customer to create their own formal SRE team.
“We’ve found the Google SRE framework is ideal for keeping NYC3 security operations functional,” says Makovoz.
What is Site Reliability Engineering?
SRE was created in 2004 at Google for Google. The company needed a way to keep their many cloud-based, public-facing services—Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few—available and performing at maximum capacity and minimal latency on a global scale.
They realized that the way to do this was to treat operations downtime as a software problem. Reliability should be built into the design of the systems during development.
SRE is also based on the idea that no production system can reasonably be expected to be up 100% of the time. For each individual system, a certain amount of downtime is acceptable. Google called this the “error budget.”
Within the SRE model, developers are assigned an error budget for each project they work on. And, most importantly, system developers collaborate extraordinarily closely with the operations people responsible for keeping those systems running once they are in production. Google calls these operational support people site reliability engineers (SREs).
SREs and developers have a firm commitment to each other. SREs will accept responsibility for the uptime and healthy operation of a system if it has been developed in accord with SRE best practices and—most importantly—if the system doesn’t routinely go over its error budget.
This tight relationship between developers and SREs has helped create a culture of intense collaboration and cooperation that has led to both incredible reliability and super-fast innovation at Google.
The SRE approach proved especially beneficial to NYC3 during the initial months of the COVID-19 pandemic when millions of citizens needed the city to help with health, unemployment, and housing needs.
For NYC3 to fulfill its mission, all the systems supporting these services had to stay up at a time when adversaries were increasing their activities—especially targeting work-at-home employees. Thanks to SRE, NYC3 was able to keep systems up and even dramatically increase the throughput of the system while decreasing the average compute cost per logged event.
“And we did this without adding to the size of the team,” says Colin Ahern, Deputy Chief Information Security Officer for the City of New York. “That wouldn’t have been possible if we hadn’t dedicated ourselves to SRE.”
NYC3’s journey to SRE
NYC3’s interest in SRE began with a book club. The newly hired engineering staff read and discussed the Google “Site Reliability Engineering” book, which explains how Google came to develop its own SRE practice. This provided inspiration for the team. They came up with the idea of implementing SRE themselves.
The team members then met numerous times with their local Google SRE representative to help them more fully understand SRE.
NYC3 embarked on developing their own SRE team by training and cross-training their engineers on all aspects of the development life cycle—not just support, but design, development, networking, and operations. They vetted job applicants for their fitness for the complex role of SRE. And they started analyzing how the design of their systems affected their subsequent reliability, availability, and supportability.
The payoff came immediately.
“Because SRE commences at the moment when design begins on a system, instead of being reactive and fighting fires, we began to intervene before the first spark,” says Noam Dorogoyer, Data Engineer at NYC3.
It’s important to recognize that SRE is built on the foundation of continuous integration and continuous deployment (CI/CD) development methodologies, says Ahern. “It’s the natural outcome of a cloud-first approach to IT, because it enables you to use software engineering tools to operate infrastructure and applications and to change them on demand, as needed, to keep them running smoothly.”
NYC3 spent the next two and a half years building an SRE team capable not only of managing unexpected slowdowns or outages in its systems, but of understanding the very guts of the systems they supported. To that end, the SRE team worked hand in hand with developers to ensure that systems would be reliable once they moved into production. The SRE team was officially launched in January 2020.
“Before we launched our internal SRE team, our users were not always completely satisfied. But after the launch, we achieved our goals by reducing the number of production incidents, decreasing the volume of data losses, and significantly improving the overall quality of the data that our team delivers to users.”
—Alex Sedlin, Data Engineer, NYC3Achieving non-linear returns at scale with SRE
By following SRE principles, NYC3 achieved substantial returns on their investments.
An early challenge came with Juggernaut, NYC3’s business-critical threat-management system. Developed internally, Juggernaut parses real-time cybersecurity data from many disparate sources and alerts the NYC3 team when something seems amiss. Because of the threat of multiple outages and data loss, NYC3 needed to guarantee to its users that Juggernaut was actually pulling the right data, analyzing it correctly, and delivering it on time.
Addressing the root cause of the software engineering problem itself was important, but it was also vital to provide metrics to prove to users that the problems had indeed been fixed. “To increase user confidence, you need to show numbers,” says Makovoz. “Measurability is everything with SRE.”
“Before we launched our internal SRE team, our users were not always completely satisfied,” says Alex Sedlin, a Data Engineer with NYC3. But after the launch, “[w]e achieved our goals by reducing the number of production incidents, decreasing the volume of data losses, and significantly improving the overall quality of the data that our team delivers to users.”
SRE practices not only help the team add functionality and capacity to Juggernaut and other internally developed core systems, but also facilitate onboarding of third-party applications.
“Today, when we evaluate third-party applications, in addition to functionality, we look at them from a supportability perspective,” says Makovoz. “And we see a lot of things that we previously wouldn’t have paid attention to.”
For example, when onboarding a third-party system, the team had to make sure it could reliably service a very large amount of data in real time.
With big projects, SRE is even more important. “And here was one major advantage of using the SRE framework,” says Sedlin. “We had built such strong bridges between multiple teams such as development, operations, and networking, that we now operate as a single unit.”
“We’re proud to say that we’re successfully operating an extremely complex, high-velocity security data pipeline 24/7. And if other state or local municipal governments think there’s no way they could do that, we hope they look to our example and say, ‘Yes, we can.’”
—Colin Ahern, Deputy Chief Information Security Officer, NYC3Looking forward
Hiring continues to be very important. “You need to select the right people for the mission,” says Makovoz. And that’s not easy, he says. Such candidates are part developer, part administrator, part database manager, and part operations professional, and thus must wear multiple hats to dive quite deeply inside complex technological problems.
“We’re looking for a triple threat,” agrees Ahern. “We need people with significant technical skills, with domain knowledge and a strategic outlook, who can envision how their individual day-to-day labors translate into business value.”
For training, new team members need to interact with each system—preferably first in a nonemergency environment—and participate in feature development to help them feel confident while on call, Ahern stresses.
NYC3 is also calculating how to create best practices for nontraditional machine learning and artificial intelligence-based systems. “Defining service-level objectives for such systems requires a different paradigm altogether,” says Makovoz.
Another goal is to automate alerts and service-level objectives (SLOs) so that the team can handle more and more complex support initiatives without having to scale linearly.
And NYC3 continues to measure everything. That’s because even small advances in efficiency can be a big deal when you’re talking about hundreds and thousands of systems and billions of events per day, according to Ahern. “Even a tenth of a percent greater system efficiency makes a huge difference,” he says.
Ahern is eager to evangelize their practices, both internally to the city, but also to other state and municipal governments considering this approach.
“We’re proud to say that we’re successfully operating an extremely complex, high-velocity security data pipeline 24/7,” says Ahern. “And if other state or local municipal governments think there’s no way they could do that, we hope they look to our example and say, ‘Yes, we can.’”
Tell us your challenge. We're here to help.
Contact usAbout NYC Cyber Command
New York City Cyber Command (NYC3), originally created by executive order in 2017 by Mayor Bill de Blasio, is a centralized organization that leads New York City’s cyber defense efforts. NYC3 works across all city government agencies to prevent, respond to, and recover from cyberthreats. Cyber Command was officially added to the New York City charter by a unanimous vote of the City Council in September 2020.