“Users don’t care why something is not working, but that it is not working.”
If we believe this to be true, why aren’t we monitoring what they care about? How did we get here?
Let’s start with a traditional model, where Ops focuses on infrastructure, and we wait for customers to tell us something is wrong. A set of reasonable assumptions and values make this view intuitive, at first:
Certainty is better than ambiguity, so we want to focus on what we know most directly: the infrastructure.
The goal is to understand whether the cause of a problem is an Ops responsibility, or someone else’s.
Setting up many alerts on infrastructure is the best way to ensure Ops is only responding to things that it is responsible for.
Automation is a luxury and a side project, Ops primarily budgets time for cutting through tickets and responding manually to alerts.
While we may end up with a different set of priorities down the road, some of these are hard to ignore. For now, let’s consider the worst-case scenario in this traditional state.
Users experience an issue: the business made a promise to users, and it isn’t coming true. But the infrastructure is fine.
Users don’t care, though.
The Ops team may only have a small, partial view, and this partial view leads to another potential issue.
Perhaps things are going well for the business, and more users start using their service. Traditional Ops might be panicking even when something good is happening for users, and they may have a legitimate reason to be concerned!
Operations is ultimately a business problem, not just a technical one.
We need to be able to see the causal chain between different layers of a system.
We see a chain of dependencies surfacing differently as a mix of clear and ambiguous causes.We also see layers of redundancy that allow for lower-level infrastructure failures without impacting users. Moving from this conceptual awareness, you can think of how to identify and measure different areas of interest. Based on how apparent they are to users, we can group them into symptoms and causes.
Now that we have a model of the causal order, Ops can focus more on the same area of concern as the rest of the business: the users.
When issues arise, starting from a few symptoms, Ops can find the cause more efficiently than before. But if we know that causes precede symptoms, don’t we want to know when causes start to look wrong in advance? Isn’t a symptoms-first approach more reactive and not as predictive, regardless of if we know a causal chain?
These are valid concerns if causes are as powerful as before and if we still need to do more to mitigate the impact of a failure deep within our system.
So suppose instead of those mitigations, we alert on causes. We run a risk of being overwhelmed with causal failures. Alert fatigue and a high noise-to-signal ratio do not help us fix things faster. Firefighting hardly seems more manageable if we’re merely aware of more fires.
How do we improve this situation?
Ideally, we would ask, “What would it take to only alert on symptoms and not causes?”
We would build in layers of automation that obviate the need for alerts.
Why? Because alerts need to be actionable, we should have a system ready to handle failure.
With the ultimate goal of turning off alerts for causes, we automate as much as possible and progressively move closer to just the symptoms. For an example of how to automate responses to a particular type of alerts, see the reference architecture for detecting and responding to Cloud Logging events in real-time. At no point are we turning off monitoring, and there are more ways to silence an alert than to completely remove it. Google Cloud Monitoring offers the ability to snooze alerts as an organization becomes comfortable with automation to handle issues. We still need to monitor causes for troubleshooting, cost control, and so forth–but we are increasingly confident in our ability to focus on the symptoms primarily.
Even with automation and monitoring in place, we had accepted earlier that any technical system guaranteed some failures. Beyond the types of failures that we can prepare for, there are still unknown potential causes. With a pattern for handling newly discovered causes, we avoid the need to obsess over them. A bit of project work saves us from a lot of future toil. In a little time, we can return our focus to users. But we do it with the expectation that failure is inevitable, and we’re ready to discover future unknown causes.
Apply this perspective to orient discussions about expected improvements to Ops:
Think when an IT leader says, “We want complete, end-to-end visibility.”
In that case, though, what is the main priority?
“We want to be aware when something goes wrong.”
If you’ve designed a system to handle failure, what does it mean to “go wrong”? There is a provocative way to get people to think about these issues:
“Starting tomorrow, turn off all alerts except for user-facing symptoms. Any objections?”
You will get a litany of dependencies, a lack of redundancy, and gaps in monitoring. It would be too abrupt to make this move all at once.
Likewise, some non-symptoms such as saturation matter a great deal when there are limited resources. Teams need to be alerted early if they’re approaching capacity or a quota limit, and in some cases, automation isn’t possible today. The point is really to ask:
“What will it take to work towards that ideal state?”
In Conclusion
In our Reliability discussion group we discuss successes and failures, learn about best practices, and network with others who are also on a journey to implement a sustainable practice of reliability engineering.
While the path may be different for each team, there is one key similarity in a change in perspective. It’s up to Ops to care more about why something isn’t working–even if users don’t. The change in perspective here isn’t merely about transitively caring about the same things as users.
Instead, what a user-centric perspective gives us is a different set of values:
Accept ambiguity and focus on the most relevant. Remember: there are more possible causes of issues in our system than there are possible moves in chess.
Discovering new technical issues that we didn’t previously see is a result of listening to “business concerns”.
Alerting less, and only on symptoms, should be our goal. Starting with users and alerting Ops on symptoms is the sanest way to approach debugging.
Automation is the best means to obtain our goal confidently. Automation isn’t a side project or a luxury.
Cloud BlogRead More