Adopting SRE: Standardizing your SLO design process

By mullaned2002

March 10, 2023

396

Poll three Site Reliability Engineers with the question “What is SRE?” and you’re likely to get five different answers: an implementation of DevOps, a role, a set of practices, a cultural shift, a snazzy title. While these definitions may not necessarily align with those in the SRE books, there is one throughline differentiating SRE from other ways of working: Service Level Objectives (SLOs). While simple to understand – intentionally! – SLOs are frequently challenging to define in practice. And even though the specifics of an SLO vary across industries and verticals, we have found there are a number of practices and strategies common amongst teams that have successfully implemented SLOs for their workloads.

Bringing together product, development, and SRE teams to achieve a common understanding of the workload in question, and in particular its critical user journeys (CUJ), is a key first step. For many teams this means writing down, often for the first time, detailed sequence or flow diagrams for these CUJs. The maturity of and the relationship between the three “legs of the stool” (development, product, and SRE) will play a role in the level of effort required to complete this first step of the journey. Having a common understanding of your users’ expectations of your workload is a prerequisite to writing effective SLOs.

While modeling user journeys and decomposing them into SLOs is an art and no two applications are alike, there are a few key aspects upon which to anchor your discussion. The main question we recommend you keep top-of-mind when going through this process is “What do my users care about?” Framing your thought process in this way prevents implementation snafus and strategies that don’t approximate user expectations. Other aspects to consider include:

Are there breakpoints where the user may choose not to take an action?

Which parts of the interaction are we capable of measuring and which are we not (e.g., third-party dependencies)?

Which parts of the user journey are common across many user journeys and thus are possible candidates for factoring out as their own CUJ (example: login)?

Which parts of the journey can be measured in aggregate, and which must be separated because of differences in criticality, request rates, or other factors?

Which steps of the journey have strict dependencies between one another?

Armed with answers to these questions, a detailed request diagram, and your application code, you’re ready to start putting pen to paper! Before jumping into your monitoring consoles, we recommend writing up an SLO design document which lays out the technical details of your chosen SLOs. We’ve made a template available to you to jump start this process (if you have a Google account, you can make a copy using this link). In it, you’ll find an empty template along with worked examples for reference as you create your own specifications. Whether you use this template or not, we recommend the following as you document your SLOs:

Be pedantic with technical specifications – they will matter during implementation

Maintain a section outlining clarifications, caveats, and/or tradeoffs made as a part of the design process

Consider where you’re measuring – make sure it’s feasible

Beware of summaries, averages, and other non-aggregatable statistics for latency SLOs

Keep compliance periods consistent across your workload(s)

We recommend the following defaults:

28 days rolling for operational needs (error budget alerting)

Fixed calendar quarters for prioritization and lookback

Changelog: Include one, even if your documentation tool has version history, so you can track major changes

Put your SLO documentation in a location accessible by your team and company stakeholders

Once your SLO PRD is finalized, treat your implementation as code and store it in your version control system

Remember – DRY!

We hope these recommendations and template give you a head start in bringing your SLOs to production. If you find yourself in need of a tool to implement your SLOs, consider Google Cloud SLO Monitoring which allows you to create SLOs for any metric available in Google Cloud Monitoring and computes your error budget automatically, enabling burn rate-based alerting. If this process still feels daunting or you find your team in need of help with any of the above, our reliability engineering professional services team can assist. For more information, visit cloud.google.com/sre or you can contact your Google Cloud account team.

Cloud BlogRead More

Previous articleCloud Spanner powers Kochava’s mobile analytics platform

Next articleIntroducing regional placement in Dataflow

Adopting SRE: Standardizing your SLO design process

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

New service exposes telco capabilities through APIs to third-party developers

How to properly — and easily — do data disk migrations to Google Cloud

Eliminate hotspots in Cloud Bigtable

POPULAR CATEGORY