Management Tools

Designing distributed systems using NALSD flashcards

May 1, 2020

Dan Lüdtke

Site Reliability Manager

There are many ways to design distributed systems. One way involves growing systems organically—components are rewritten or redesigned as the system handles more requests. Another method starts with a proof of concept. Once the system adds value to the business, a second version is designed from the ground up.

At Google, we use a method called non-abstract large system design (NALSD). NALSD describes an iterative process for designing, assessing, and evaluating distributed systems, such as Borg cluster management for distributed computing and the Google distributed file system. Designing systems using NALSD can be a bit daunting at first, so in this post, we introduce a nifty strategy to make things easier: flashcards. We describe how you can use flashcards to connect the most important numbers around constrained resources when designing distributed systems. These numbers include educated estimates concerning the CPU, memory, storage, and network latencies and throughputs.

Let’s look at two examples illustrating the use of these numbers.

For the first example, say you have a server designed to store images. We are most interested in the write throughput of the underlying storage layer. The underlying storage layer might be limited by the write speed of the disks it consists of. Knowing disk seek times and the write throughput is important so we can spot the bottleneck in the overall system.

For the next example, say you have another server that may be responsible for serving low-latency metadata search queries. Here, potential bottlenecks might be memory consumption or CPU utilization. The memory consumption is from holding an index, and CPU utilization is from performing the actual search. To find out which one is the bottleneck, we have to consult latency numbers on CPU cache and main memory access. We are probably less concerned with network throughput, because we expect requests and responses to be small in size. However, as we scale the system up on the drawing board, the bottlenecks may change. So it’s best to always assign educated estimates to all components in a distributed system.

NALSD helps identify potential bottlenecks as systems scale up. We address the bottlenecks early on—for example, by iterating on the design until we find an overall more scalable architecture.

‘The numbers everyone should know’
So what are the magical numbers we’ve alluded to? According to long-time Google engineer Jeff Dean, there are “numbers everyone should know.” These include numbers that describe common actions performed by the machines that servers and other components of a distributed system run on. (Numbers have changed since this video was recorded. In this post, we’re using the most recent figures.) Here are some examples:

An L1 cache reference takes a nanosecond.
A branch misprediction is roughly three times as expensive as an L1 cache reference, and takes three nanoseconds.
Locking or unlocking a mutex (a resource-guarding structure used for synchronizing concurrency) costs about 17 nanoseconds, more than five times the cost of a branch misprediction.
Referencing main memory is slightly more expensive, costing roughly 100 nanoseconds.
Sending two kilobytes over a 10 Gb/s network takes 1.6 microseconds, or 1600 nanoseconds. Stuff gets expensive here!
A round trip within the same data center takes only 500 microseconds, while a round trip from California to the Netherlands takes roughly 300 times as long (150 milliseconds).
A disk seek takes about 10 milliseconds. That’s quite expensive compared to reading 1 MB sequentially from disk, which takes about 5 milliseconds.

Memorizing these numbers may come naturally to some, but others, like us, may prefer flashcards to help remember the numbers that engineers use to design and maintain a system. Flashcards are a helpful companion for designing large systems. An added bonus of these flashcards is that they can be used as an entertaining, on-the-spot quiz for fellow site reliability engineers (SREs), or as a preparation tool for an NALSD interview with Google’s SRE team.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Flashcards.max-2000x2000.jpg

If you're interested in these flashcards, you can download your own set of flashcards for site reliability engineers. Follow these easy steps to turn them into handy flashcards: