Poll three Site Reliability Engineers with the question “What is SRE?” and you’re likely to get five different answers: an implementation of DevOps, a role, a set of practices, a cultural shift, a snazzy title. While these definitions may not necessarily align with those in the SRE books, there is one throughline differentiating SRE from other ways of working: Service Level Objectives (SLOs). While simple to understand – intentionally! – SLOs are frequently challenging to define in practice. And even though the specifics of an SLO vary across industries and verticals, we have found there are a number of practices and strategies common amongst teams that have successfully implemented SLOs for their workloads.
Bringing together product, development, and SRE teams to achieve a common understanding of the workload in question, and in particular its critical user journeys (CUJ), is a key first step. For many teams this means writing down, often for the first time, detailed sequence or flow diagrams for these CUJs. The maturity of and the relationship between the three “legs of the stool” (development, product, and SRE) will play a role in the level of effort required to complete this first step of the journey. Having a common understanding of your users’ expectations of your workload is a prerequisite to writing effective SLOs.
While modeling user journeys and decomposing them into SLOs is an art and no two applications are alike, there are a few key aspects upon which to anchor your discussion. The main question we recommend you keep top-of-mind when going through this process is “What do my users care about?” Framing your thought process in this way prevents implementation snafus and strategies that don’t approximate user expectations. Other aspects to consider include:
Are there breakpoints where the user may choose not to take an action?
Which parts of the interaction are we capable of measuring and which are we not (e.g., third-party dependencies)?
Which parts of the user journey are common across many user journeys and thus are possible candidates for factoring out as their own CUJ (example: login)?
Which parts of the journey can be measured in aggregate, and which must be separated because of differences in criticality, request rates, or other factors?
Which steps of the journey have strict dependencies between one another?
Armed with answers to these questions, a detailed request diagram, and your application code, you’re ready to start putting pen to paper! Before jumping into your monitoring consoles, we recommend writing up an SLO design document which lays out the technical details of your chosen SLOs. We’ve made a template available to you to jump start this process (if you have a Google account, you can make a copy using this link). In it, you’ll find an empty template along with worked examples for reference as you create your own specifications. Whether you use this template or not, we recommend the following as you document your SLOs:
Be pedantic with technical specifications – they will matter during implementation
Maintain a section outlining clarifications, caveats, and/or tradeoffs made as a part of the design process
Consider where you’re measuring – make sure it’s feasible
Beware of summaries, averages, and other non-aggregatable statistics for latency SLOs
Keep compliance periods consistent across your workload(s)
We recommend the following defaults:
28 days rolling for operational needs (error budget alerting)
Fixed calendar quarters for prioritization and lookback
Changelog: Include one, even if your documentation tool has version history, so you can track major changes
Put your SLO documentation in a location accessible by your team and company stakeholders
Once your SLO PRD is finalized, treat your implementation as code and store it in your version control system
Remember – DRY!
We hope these recommendations and template give you a head start in bringing your SLOs to production. If you find yourself in need of a tool to implement your SLOs, consider Google Cloud SLO Monitoring which allows you to create SLOs for any metric available in Google Cloud Monitoring and computes your error budget automatically, enabling burn rate-based alerting. If this process still feels daunting or you find your team in need of help with any of the above, our reliability engineering professional services team can assist. For more information, visit cloud.google.com/sre or you can contact your Google Cloud account team.
Cloud BlogRead More