We published another episode of “VM End to End,” which is a series of curated conversations between a “VM skeptic” and a “VM enthusiast”. Every episode, join Brian, Carter, and a special guest as they explore why VMs are some of Google’s most trusted and reliable offerings, and how VMs benefit companies operating at scale in the cloud. Here is a transcript of the episode:
Carter Morgan: Hi, and welcome back to VM End to End, a show where we have a VM skeptic, myself, and a VM enthusiast or a VM enthusiast come onto this show to talk about why VMs are interesting and useful for a cloud-native future. Last time we spoke about reasoning about reliability, but I feel like I still don’t know enough to do anything with it in the real world. So we’ve brought Steve and Brian back on today to help hammer that out. Hi Steve. Hi Brian. Thanks for being here.
Steve McGhee: Hey Carter. Nice to be here.
Carter: Glad to have you. Brian, I got to start with you, buddy. After last week’s episode, do you think I know enough to actually go and take a small SaaS offering and go from like three nines reliability to four nines, to maybe five nines?
Brian Dorsey: I think we’ve got a lot of the components. We talked about some common patterns, and we can dig into that a little more as well, but the question is kind of like, when and how? What if we got a little more specific here? What kind of app do you imagine building?
Carter: Okay. Let’s go with a payment processing app. So it’s a SaaS application, will exist in the cloud, and people will want to make payments. Right now, that’s what it’s going to focus on. That said, I don’t even know how to reason about how reliable that is by itself. So let’s start there. Steve.
Steve: Yeah. This is a good place to start. In SRE, we call this a non-abstract systems design problem, which is the best way to do it. And if you think about this like you’re building a system on your laptop or your desktop. You’re getting it functionally correct first, and then you have to think about how much do I want to invest into it to scale it up to have it do something for someone else because having it just run tests on your laptop doesn’t make you a lot of money. That doesn’t do much good for society.
You want to get it out there, right? You want to put it into production. And so what you have to do is think about how reliable you want it to be. And the answer should always be just reliable enough. Right? That’s a jokey way of responding, but think about who your customers are and what they need out of this? So you’re talking about a payments processor, right? So like, who are your customers? Are your customers like your friend who sells bracelets at the farmer’s market? Or is it a Fortune 500 company that’s going to be throwing transactions at you from all the time from around the world? Those are very different situations. Do you know what I mean?
Brian: Okay. So I love this specific thing. So it’s always a business or a goal kind of trade-off is the takeaway here.
Brian: Do you have any rules of thumb for how to know? For example, businesses care about money and time. How do you know the money and time costs when moving between these thresholds?
Steve: We don’t have a lot of good data. It’s more just like a sense, but the rule of thumb that we use is that every additional nine you add to a system will cost you ten times as much as the last one did. So if you move from like 90% available, just like running on your laptop and you close the lid, and it’s not available anymore to move onto the cloud where the VM, the single VM that it’s running on, is like 99 or more like 99.9% available, it takes more effort. And the way to reason about it is really like 10X per nine. So-
Brian: That’s huge.
Steve: Yeah. It’s a big deal.
Carter: I like this little, simple heuristic to reason about it, but I’m still a bit lost. Like take the internet, how reliable is the internet? Can we ground it, maybe these nines and just like simple, browsing the web kind of thing. How reliable is that?
Steve: Yeah. Great question. Have you ever turned off your phone?
Carter: Rarely, but yes.
Steve: Yeah. The battery sometimes dies, right? Phones lose connectivity. Have you ever been in a dead spot?
Carter: Mm-hm (affirmative).
Steve: Right? So your phone doesn’t have 100% reliability in letting you do stuff on the internet. So if you think about it that way, if you’re building a system just for yourself on the cloud, you wouldn’t want to invest in making it five-nines reliable because your pipe to that service is over like your phone and like cell towers and stuff. You’re going to be driving through a tunnel sometimes.
Steve: And so why make a five-nine service for a three-nines tunnel? We start to have higher availability because many of these services are actually for server-to-server communication in a data center where they have much better connectivity than you do between your computer or your phone and service. And the other part of it is like, most of the time you have more than one customer. So if my connectivity to the service is no good, most of the other people’s service is still good during that minute or during that second.
Steve: So, to have these high-end services, you have to think of the big picture. Who’s the customer? How many of the customers are there? How often do they expect it to be up or not? And the last thing you want to consider is that your service or your business probably consists of many different services. And if you have many endpoints that your service or your customer may hit, some of them might be super important. Some of them might be way less important.
Steve: So if I need to sign up for a new account and it’s down, I can retry. Not a big deal. But if I’m sending you thousands of transactions per second and I want you to give me money back, that should probably be more up than the signup flow. Do you know what I mean?
Brian: Okay. So this makes sense. So we’re building this kind of more reliable system out of layers of less reliable bits. And because we have lots of users, our business needs to do it. But like, so we talked about a little bit of it, like in the docs, you can find out roughly how reliable a VM is. But when you start to build these shapes on top of that, how do I know how reliable a particular pattern I’ve built is? Like a load balancer in front of a service or something.
Steve: Yeah. Great question. Because most interesting businesses, at least that I know of, are not running on a single VM. They’re not just like one thing. And if you look at the cloud offerings, there are all kinds of stuff. There are Pub/Sub and databases and all kinds of stuff. And your job as an architect or designer or an engineer at a company that’s using the cloud is to come up with what we call an architecture, right? The set of services that all kind of stick together, some of them are like, you put code onto a compute platform of some kind.
Steve: Another one is like, you put a schema onto a database or a data store of some kind. Sometimes it’s like a message passing system. So you have to define the message cues. So this is your architecture. So the trick with architecture is you’ve got a whole bunch of components, and you can compose them in tons of different ways. And some of them are complete disasters. Some of them do not work well. So now the question you asked is like, how do we know if we’re building a good one? And just like any software development, the best plan is to use a well-known pattern.
Steve: So we have these design patterns in computer science. We have similar design patterns in cloud architecture as well. So there’s a paper that came out sometime this year a while ago. And it was about the deployment archetypes in the cloud. And it gives you a couple, like, I think there’s like six or eight different models. And it starts simple, build one thing in one region, and it builds up from there. So some of the more common architectures are like if you have a stack of services and by services, I mean your web server and your database and things like that, you can have one stack in one region and one stack in another region and a load balancer in front of those two things.
Steve: And you can get complicated beyond that. You can have things like failover between those two be automated. You can have like hot and cold. You can have hot and warm. You can have hot and hot, and you can have more than just two, right? You could have different deployments and have them all be active at any given time. And you can charge your customers across them and all that. So take a look at that paper. It’s not GCP architecture-specific. It gives you the pattern that lets you know what you can get out of this archetype.
Carter: I love that. And I love learning general principles for architecting and designing systems. So I’m going to check that out. But this leads me back to something you said earlier, getting a little less abstract. If I were to take my payment processor application just for myself and my friends from before, now I’ve got my first big customer. And they’re going to throw thousands of receipts and invoices at me. How do I start going from my current offering to where it needs to be to meet these new demands?
Steve: I love pinning it to this real problem. And the pattern you’re talking about where you start with a few customers, and you slowly grow, and you start to make business agreements with more customers, that is common across everybody, right? This is how businesses grow, and it’s the right way to do it because you don’t want to spend 15 years building this thing before you have your first customer. So growing with your customer demands is the right way to do it.
Steve: In terms of actually getting you to land that contract with BigCo and handle the five nines or whatever it is that they’re putting in your contract, this takes a lot of what we call resilience engineering or reliability engineering. The way to think about it is that you’re going to be building what we call a platform, and this is going to be a platform of capabilities. So these capabilities are going to be technical things like you can deploy code to your services quickly and effectively and safely. So you’ve heard of CICD, for example. A CICD pipeline is a set of capabilities in your platform.
Steve: Other things like observability and monitoring, which means determining what is going on [in your system and] why is this one weird thing slow? Being able to look at a graph, drill into it and check out a trace or a profile of a live running system and understand it, and having your team be able to grok what’s going on under the covers and use the tools they have at their disposal. This is all part of your platform. And you’re building out capabilities as a team to run this super complicated complex system. And the more capabilities you have, the better shot you have at achieving it.
Brian: So if we want to handle those requests from BigCo coming in, everything downstream of that needs to be highly reliable to the same degree, right?
Steve: The answer’s no. It’s one of these surprising things where what matters to your business is the front door, right? So BigCo has a contract with us, and whenever they send us a thing, we have to respond within some time. But does that mean that the database behind our API needs to be six-nines and the infrastructure that database is on needs to be at seven-nines and then the power supply behind that and —
Steve: No, that is not the case. So this is good news. You’re able to build more reliable components on top of less reliable components. This is like a general abstraction that you can achieve. So if your front door, you want it to be five nines, you can get away with having like four nines or even a three nines backend. And the way you do that is through some resilience capabilities. A simple one is purely adding retries to your system.
Steve: So if your front end is up and your backend is down briefly when your front end makes a request to your backend, and it goes, ugh, error, instead of returning the response to your, the customer that you have an agreement with, I don’t know what happened, that’s a bummer, just wait, just wait for 200 milliseconds or some amount of time and try it again. If it still doesn’t work, wait a little longer. So as long as you’re not exceeding some sort of like a contractual deadline, you can wait all day, and you can still give back a correct response, and the customer will see that as a success.
Steve: So this is another form of how we kind of hide errors behind layers of abstractions. And basically, you’re trading off time for reliability. And if that’s okay with your business, do it; it’s just another tool in your toolbox.
Carter: I love this idea of basically planning for the imperfect. We talked about this before, and I’m sure there’s a bunch of different avenues that we could go and look at this, like geography, there could be points of failures there, or like you said, with time, using time as a trade-off. The thing I’m curious about is it sounds like there’s a lot that goes into this. You talked about a team of people being involved, and you are an SRE.
Carter: So it seems like there’s more than just technology. And I’ve been very focused on that, but there are also people involved, and maybe even other elements involved in reliability. Can you talk about that a little bit?
Steve: Yeah. When we talk about large systems, we tend to say people, process, and technology. So this is an excellent example of it. So being able to understand a system, improve the system and build the system it’s being done by humans, right? And the model that Google came up with is called SRE. So many of the principles Google used to build systems, scale them, and not burn out all the humans in the meantime have been written down. And the specifics of what Google SREs would do in the past inside of Google don’t apply directly to what customers would do because it’s like an entirely different system.
Steve: And customers aren’t running web search onboard or anything like that. They’re running the payment processor on Kubernetes on the cloud. But the principles are the same, can be applied anywhere, and are consistent. So these tools, processes, and culture of how people work together are bundled together as SRE. And there’s a bunch of books out there you can get that explain all this, and it’s becoming more and more popular.
Brian: Awesome. So we’ll put some links to the SRE books and the article you mentioned earlier in the show notes. As far as doing this, when you want to push your app up to another threshold of nines, we’ve got the people, process, tech part, and we’ve got these reliability patterns. And some of those patterns, I think, are built into the system, as we talked about last time, in the VM abstraction. And some of them are things we have to handle with our app architecture itself. Does that sound fair?
Steve: Totally. There’s always the idea of build versus buy. And a lot of these systems, you’re just getting out of the platform. So these capabilities that I talked about, if you’re running on-prem on machines that you bought a couple of years ago and running your own software on them, you’re not going to get live migration out of it. You could build it, and it might take a while or something like it. But instead, if you take that concern and hand it off to the cloud, you get this capability. So there’s a lot of things out there that you can get a product. And that gives you that capability for your platform. And now you can do that thing.
Steve: Another similar one is if you have a layer seven load balancer, a global L7 load balancer, you’re able to siphon traffic from one region to another kind of magically. It’s amazing. And doing this on the wild west internet is hard because it’s like, you got to deal with different autonomous systems and BGP and DNS and all this crazy stuff. But with Google’s L7 load balancer, it’s pretty straightforward. And you take advantage of this massive set of systems out there, this front-end technology that Google uses for its entire set of services.
Carter: I’ve got to thank you, Steve and Brian. This is my first time diving into SRE principles and the thinking behind reliability. And so I learned so much, whether it’s about the people and the process and the technology, whether it’s about having non-abstract or less abstract architectures and thinking it through. So thank you. I’m going to check out that paper about the different patterns and for people listening at home, if you check that paper out, tell us what you liked or didn’t like about it in the comments. Thank you so much.
Special thanks to Steve McGhee, Reliability Advocate at Google, for being this episode’s guest!
Cloud BlogRead More