DevOps & SRE

SRE classroom: Exercises for non-abstract large systems design

September 23, 2020

Jenny Liao

Software Engineer

Salim Virji

Technical Program Manager, Google

Accelerate State of DevOps Report

Get a comprehensive view of the DevOps industry, providing actionable guidance for organizations of all sizes.

Download

Have you ever tried your hand at designing a resilient distributed software system? If you have, you likely found that there are many factors that contribute to the overall reliability of a system. Different parts of the system can fail in varied and unexpected ways. Certain architecture patterns work well in some situations, but poorly in others. There are many tradeoffs to be made about which parts of the system to optimize and when to optimize them.

Navigating the many nuances of designing a distributed system can be daunting. However, anyone can be equipped to tackle these problems with the right tools and practice. There are many ways to design distributed systems. One way involves growing systems organically, adding and rewriting components as the system handles more requests or changes scope. At Google, we use a method called non-abstract large system design (NALSD). NALSD is an iterative process for designing, assessing, and evaluating distributed systems such as the Borg cluster management for distributed computing and the Google distributed file system. With this in mind, we’ve developed exercises to provide hands-on experience with the NALSD techniques.

NALSD exercises are designed to equip engineers with the foundational knowledge and problem-solving skills needed to design planet-scale systems. You’ll learn how to evaluate whether a particular design achieves a service’s required service-level objectives (SLOs). These workshops challenge you to translate abstract designs into concrete plans using back-of-the-envelope calculations. Most importantly, they provide a chance for you to put these abstract concepts into practice.

Planet-scale system (noun): A system that delivers services to users, no matter where they are around the world. Such a system delivers its services reliably, with high performance and availability to all of its users

Tweet this quote

SRE Classroom and the first NALSD workshop

Developed by Google engineers, SRE Classroom is a workshop series designed to drive understanding of concepts like NALSD and other core SRE principles. Over the past few years, these workshops—taught within Google and at external conferences—have helped numerous engineers improve their system design and thinking skills. Our mission is to ensure engineering teams everywhere can understand and apply these concepts and best practices to their own systems.

We’re pleased to make available all of the materials for our Distributed Pub/Sub workshop—the first of our NALSD-focused exercises from SRE Classroom. You can now freely use and re-use this material, available under the Creative Commons CC-BY 4.0 license, as long as Google is credited as the original author. Run your own version of this workshop and teach your coworkers, customers, or conference attendees about how to design large-scale distributed systems!

What’s covered in the Distributed PubSub workshop

The PubSub exercise is about designing a planet-scale asynchronous publish-subscribe communication system. The workshop presents the problem statement, describes the requirements and available infrastructure, and walks through a sample solution.

The workshop and material is broken into three stages:

Design a working solution for a single data center.
Extend that design to multiple data centers.
Provision the system (i.e., how much hardware and bandwidth do we need?).

For each stage of the workshop, participants will work through their own solution first. After they have a chance to explore their own ideas, the workshop leader presents a sample solution along with reasons for why certain design decisions were made.

The exercise covers a wide variety of topics related to distributed system design, including scaling, replication, sharding, consensus, availability, consistency, distributed architecture patterns (such as microservices), and more. We present these concepts in contexts where they are useful to solving the problem at hand: designing a system to meet specific requirements. This helps bring clarity to where and why a particular concept might be useful for solving a particular problem.

Typically, when we run this workshop, we break participants up into groups of four to six to work collaboratively toward a solution. Each group is paired with an experienced SRE volunteer who facilitates the discussion, encourages participation, and keeps the group on track.

Run your own PubSub workshop!

If this sounds interesting, check out the Presenter Guide and the Facilitator Guide, which have a lot more information on how to organize a Distributed Pub/Sub workshop. If you don't have a whole team to educate, you can also work through this exercise with a buddy or on your own. Exploring multiple solutions to the problem and identifying the pros and cons of each solution may also be a meaningful exercise.

Learn more about SRE and industry-leading practices for service reliability.

Google Cloud

SRE fundamentals: SLIs, SLAs and SLOs

A big part of SRE is establishing and monitoring service-level metrics like SLOs, SLAs and SLIs. This post gives you an overview of what each of these acronyms are, what they mean, and how to use them.

By Jay Judkowitz • 5-minute read