Management Tools

Making your monolith more reliable

February 20, 2020

Eric Harvieux

SRE Systems Engineer

In cloud operations, we often hear about the benefits of microservices over monolithic architecture. Indeed, microservices help manage hardware being abstracted away and push developers towards resilient, distributed designs. However, many enterprises still have monolithic architectures which they need to maintain.

For this post, we’ll use Wikipedia’s definition of a monolith: “A single-tiered software application in which the user interface and data access code are combined into a single program from a single platform.”

When and why to choose monolithic architecture is usually a matter of what works best for each business. Whatever the reason for using monolithic services, you still have to support them. They do, however, bring their own reliability and scaling challenges, and that’s what we’ll tackle in this post. At Google, we use site reliability engineering (SRE) principles to ensure that systems run smoothly, and these principles apply to monoliths as well as microservices.

Common problems with monoliths

We’ve noticed some common problems that arise in the course of operating monoliths. Particularly, as monoliths grow (either scaling with increased usage, or growing more complex as they take on more functionality), there are several issues we commonly have to address:

Code base complexity: Monoliths contain a broad range of functionality, meaning they often have a large amount of code and dependencies, as well as hard-to-follow code paths, including RPC calls that are not load-balanced. (These RPCs call to themselves or call between different instances of a binary if the data is sharded.)
Release process difficulty: Frequently, monoliths consist of code submitted by contributors across many different teams. With more cooks in the kitchen and more code being cooked up every release cycle, the chances of failure increase. A release could fail QA or fail to deploy into production. These services often have difficulty reaching a mature state of automation where we can safely and continuously deploy to production, because the services require human decision-making to promote them into production. This puts additional burden on the monolith owners to detect and resolve bugs, and slows overall velocity.
Capacity: Monolithic servers typically serve various types of requests, and that variation means that in order to complete the requests, differences in compute resources—CPU, memory, storage I/O, and so on—are required. For example, an RDBMS-backed server might handle view-only requests that read from the database and are reasonably cacheable, but may also serve RPCs that write to the database, which must be committed before returning to the user. The impact on CPU and memory consumption can vary greatly between these two. Let’s say you load-test and determine your deployment handles 100 queries per second (qps) of your typical traffic. What happens if usage or features change, resulting in a higher number of expensive write queries? It’s easy to introduce these changes—they happen organically when your users decide to do something different, and can threaten to overwhelm your system. If you don’t check your capacity regularly, you can end up being underprovisioned gradually over time.
Operational difficulty: With so much functionality in one monolithic system, the ability to respond to operational incidents becomes more consequential. Business-critical code shares a failure domain with low-priority code and features. Our Google SRE guidelines require changes to our services to be safe to roll back. In a monolith with many stakeholders, we need to coordinate more carefully than with microservices, since the rollback may revert changes unrelated to the outage, slow development velocity, and potentially cause other issues.

How does an SRE address the issues commonly found in monoliths? The rest of this post discusses some best practices, but these can be distilled down to a single idea: Treat your monolith as a platform. Doing so helps address the operational challenges inherent in this type of design. We’ll describe this monolith-as-a-platform concept to illustrate how you can build and maintain reliable monoliths in the cloud.

Monolith as a platform

A software platform is essentially a piece of software that provides an environment for other software to run. Taking this platform approach toward how you operate your monolith does a couple of things. First, it establishes responsibility for the service. The platform itself should have clear owners who define policy and ensure that the underlying functionality is available for the various use cases. Second, it helps frame decisions about how to deploy and run code in a way that balances reliability with development velocity.

Having all the monolith code contributors share operational responsibility sets individuals against each other as they try to launch their particular changes. Instead of sharing operational responsibility, however, the goal should be to have a knowledgeable arbiter who ensures that the health of the monolith is represented when designing changes, and also during production incidents.

Scaling your platform

Monoliths that are run well converge on some common best practices. This is not meant to be a complete list and is in no particular order. We recommend considering these solutions individually to see if they might improve monolith reliability in your organization:

Plug-in architecture: One way to manifest the platform mindset is to structure your code to be modular, in a way that supports the service’s functional requirements. Differentiate between core code needed by most/all features and dedicated feature code. The platform owners can be gatekeepers for changes to core code, while feature owners can change their code without owner oversight. Isolate different code paths so you can still build and run a working binary with some chosen features disabled.

Policies for new code and backends: Platform owners should be clear with the requirements for adding new functionality to the monolith. For example, to be resilient to outages in downstream dependencies, you may set a latency requirement stating that new back-end calls are required to time out within a reasonable time span (milliseconds or seconds), and are only retried a limited number of times before returning an error. This prevents a serving thread from getting stuck, waiting indefinitely on an RPC call to a backend, and possibly exhausting CPU or memory.

Similarly, you might require developers to load test their changes before committing or enabling a new feature in production, to ensure there are no performance or resource requirement regressions. You may want to restrict new endpoints from being added without your operation team’s knowledge.

Bucket your SLOs: For a monolith serving many different types of requests, there’s a tendency to define a new SLI and SLO for each request. As the number of SLOs increases, however, it gets more confusing to track and harder to assess the impact of error budget burn for one SLO vs. all the others. To overcome this issue, try bucketing requests based on the similarity of the code path and performance characteristics. For example, we can often bucket latency for most “read” requests into one group (usually lower latency), and create a separate SLO bucket for “write” requests (usually higher latency). The idea is to create groupings that indicate when your users are suffering from reliability issues.

Which team owns a particular SLO or deciding whether an SLO is even needed for each feature are important considerations. While you want your on-call engineer to respond to business-critical outages, it’s fine to decide that some parts of the service are lower-priority or best-effort, as long as they don’t threaten the overall stability of the platform.
Set up traffic filtering: Make sure you have the ability to filter traffic by various characteristics, using a web application firewall (WAF) or similar method. If one RPC method experiences a Query of Death (QoD), you can temporarily block similar queries, thereby mitigating the situation and giving you time to fix the issue.
Use feature flags: As described in the SRE book, giving specific features a knob to disable all or some percentage of traffic is a powerful tool for incident response. If a particular feature threatens the stability of the whole system, you can throttle it down or turn it off, and continue serving all your other traffic safely.
Flavors of monoliths: This last practice is important, but should be carefully considered, depending on your situation. Once you have feature flags, it’s possible to run different pools of the same binary, with each pool configured to handle different types of requests. This helps tremendously when a reliability issue requires you to re-architect your service, which may take some time to develop. Within Google, we once ran different pools of the same web server binary to serve web search and image search traffic separately, because performance profiles were so different. It was challenging to support them in a single deployment but they all shared the same code, and each pool only handled its own type of request.
There are downsides to this mode of operation, so it’s important to approach this thoughtfully. Separating services this way may tempt engineers to fork services, in spite of the large amount of shared code, and running separate deployments increases operational and cognitive load. Therefore, instead of indefinitely running different pools of the same binary, we suggest setting a limited timeframe for running the different pools, giving you time to fix the underlying reliability issue that caused the split in the first place. Then, once the issue is resolved, merge serving back to one deployment.

Regardless of where your code sits on the monolith-microservice spectrum, your service’s reliability and users’ experience are what ultimately matters. At Google, we’ve learned—sometimes the hard way—from the challenges that various design patterns bring. In spite of these challenges, we continue to serve our users 24/7 by calling to mind SRE principles, and putting these principles into practice.

Posted in