Creating and implementing reliable systems (of code, infrastructure, or anything really) forms the discipline of systems engineering, which is used extensively by Google site reliability engineers (SREs). To help you learn more about systems engineering, and how to get hands-on with best practices, we’ve assembled some resources for you to get started.
The Systems Engineering Side of Site Reliability Engineering (USENIX paper)
What is a systems engineer, and how might they be different from a given SRE, software engineer or a SysAdmin? This paper explores the key perspectives and approaches that systems engineers take, the way they look at intersecting sets of services, and how they continue to grow their own knowledge. Read this short report to get an inside view from Google SREs on how they tackle investigating and architecting applications.
Non-abstract Large System Design (in the SRE Workbook)
Designing systems reliably and scalably requires a focused, practical approach, which we call non-abstract large system design (NALSD). Doing this well requires iterating and refining on designs in consideration of feasibility, resilience and efficiency, and also focusing design on real-world resource constraints or expectations. If you want to learn more about NALSD after reading this chapter, take a look at some other real-world examples (Reliable Cron across the Planet, Making “Push on Green” a Reality, or SRE Best Practices for Capacity Management) and dig into the research behind capacity planning, component isolation and graceful degradation.
Distributed ImageServer workshop (in the SRE Classroom)
Want to put these systems design and engineering concepts into practice? This self-guided workshop helps you code and deploy a large-scale system using NALSD principles, helping bridge the gap between the theoretical and the practical. You will make a robust, scalable, reliable system, and see what it takes to iterate on designs. To go further, check out the other workshops in SRE classroom or join an SRE community in your area.
Google Production Environment (YouTube talk)
Curious how Google runs its production environment? This lively 15 minute talk digs into software, hardware, and the numerous subsystems that power the online services used by billions of people every day. Watch to learn about our physical network infrastructure, the Borg cluster management system, persistent file systems, a massive mono-repo, and much more. Hungry for more? Hear more from SREs at Google on how they handle capacity management in this tech talk.
Reliable Data Processing with Minimal Toil (research paper)
Ensuring reliability in data processing can be tricky, especially with batch jobs and automation. And while batching can save costs, it also adds risks around data corruption and downstream delays. A structured approach to safety, validation and testing can make batch pipelines more reliable and consistent. To dive deeper, check out the video, learn more about A/B testing and canary deployments, and try out managed data tools such as BigQuery and Dataflow.
How to design a distributed system in 3 hours (YouTube talk)
Learn about the importance of Service Level Objectives (SLOs) and how they play into defining a system’s reliability. Using the same photo handling system from the Distributed ImageServer workshop above, you’ll learn how to think through storage, thumbnail and download services as part of the overall product. Finally, dig into caching and scalability, to learn how to improve performance and the overall design. Don’t worry, this talk is only 60 minutes.
Implementing Service Level Objectives (O’Reilly book)
O’Reilly’s Portuguese Water Dog book delivers a ton of insights, but if you have to read just one part, check out the chapter on how systems design supports SLOs: Chapter 10 – Architecting for Reliability. Here, you’ll learn about the value of using SLOs as a design tool, all while incorporating reliability principles, managing risk with error budgets, and establishing feedback loops for continuous improvement. You’ll also learn about other tools for measuring, improving, and communicating reliability metrics, along with tips for implementation and stakeholder management.
Making “Push On Green” a Reality (research paper)
Automation and consistency can lead to faster, more reliable releases — and a “release early, release often” mentality is healthy for developers and reliability engineers alike. This rapid release process can work in different ways, with varying degrees of manual control, automation and configuration. As always, pay attention to testing, monitoring and changes in the rollout process, to maintain reliability.
Canary Analysis Service (research paper)
This paper describes an automated system that we use at Google to assess the safety of production changes before a full rollout. It compares key metrics between two groups: a small test group (canary) that receives the change, and a larger unchanged group (control). If the canary group suffers worse performance, the change is deemed unsafe and rolled back for further iteration. Canary analysis goes very well with SLOs and reliable rollout processes.
Let us know which of these you found most useful, and how you’ll be applying them. Need even more? Check out our SRE Classroom, a collection of workshops developed by Google’s SRE group.
Cloud BlogRead More