In late October Roblox’s global online game network went down, an outage that lasted three days. The site is used by 50 million gamers daily. Figuring out and fixing the root causes of this disruption would take a massive effort by engineers at both Roblox and their main technology supplier, HashiCorp.
Roblox eventually provided an amazing analysis in a blog post at the end of January. As it turned out, Roblox was bitten by a strange coincidence of several events. The processes Roblox and HashiCorp went through to diagnose and ultimately fix things are instructive to any company running a large-scale infrastructure-as-code installation or making heavy use of containers and microservices across their infrastructure.
InfoWorld Cloud ComputingRead More