Editor’s note: Today we hear from the Lowe’s SRE team. They share about how they have been able to increase the number of releases they can support by adopting Google’s Site Reliability Engineering (SRE) framework and leveraging their partnership with Google Cloud.
At Lowe’s, we’ve made significant progress in our multiyear technology transformation. To modernize our systems and build new capabilities for our customers and associates, we leverage Google’s SRE framework and Google Cloud, which helps us meet their needs faster and more effectively. With these efforts, we’ve been able to go from one release every two weeks to 20+ releases daily—about 20X more releases per month.
Our SRE transformation didn’t happen overnight, though. Every step along the way brought some challenges. But looking back, we are excited to see how much we have accomplished for our customers as a result.
Back in 2018, before adopting SRE practices, we were more reactive than proactive, following an “eyes on glass” approach. On-call structures and incident management efficiency were not at optimal levels with too many repetitive and manual tasks, resulting in operational toil. Production concerns were not surfaced into the product roadmap, which resulted in delays in making fixes.
Bootstrapping SRE at Lowe’s
As we moved from on-prem to Google Cloud, we decided to move from a monolithic- to microservices-based architecture. And to better manage this new architecture, we embarked on an SRE journey.
Then as COVID-19 hit, we really had to accelerate this journey as customers increasingly moved to online ordering and delivery to meet their Total Home Improvement needs. To do so, we followed four key principles that allowed us to meet changing customer needs quickly and release fast and reliably.
Automate away toil
As we moved from traditional Ops to an SRE ecosystem, our biggest opportunity was reducing toil, so that engineers can spend time on activities that drive business impact and customer outcomes. We think of toil as work that is manual, repetitive, tactical, devoid of enduring value—but automatable. So, to tackle toil, we focused on automating away the need for manual intervention. As an example, we made sure engineers were not the first point of contact for any alert. Any triage or resolution that an engineer can perform, a machine can be trained to do the same. We used supervised and unsupervised learning techniques to automate our toil. With a long-term goal of “no toil,” our SREs work on identifying and reducing toil to a manageable level across the organization.
Engineer alignment through roadmaps
Our goal is to maximize the engineering velocity of developer teams while keeping products reliable. We want an engagement model where product, SRE and development teams are closely aligned. A key way we’ve been able to create this alignment is by having our SREs embedded into domain and product teams. Each domain has an SRE, who is involved at the beginning stages of product development to ensure that the domain stakeholders are in alignment with the SRE initiatives. As such, SREs are able to improve the reliability, performance, scalability and launch velocity of the services throughout all phases of the service lifecycle.
Adopt one-touch releases
Our path to production used to contain many manual steps and validations, slowing the rate at which we released features. Additionally, we used to bulk all our releases together to deploy at once, which increased the risk of failure and created a longer feedback loop from production. To tackle this with an SRE mindset, we created a one-touch release process in which SREs review the product team’s pull requests. When approved, this triggers a DevSecOps pipeline that deploys the approved changes to production securely. This process created a safe, reliable and sustainable continuous delivery pipeline with quick feedback loops. Striking the right balance between speed, innovation and stability, we were able to increase our releases exponentially for the year, taking less than 30 minutes per release to deliver quality code, including various automated quality checks and processes, all in just one click.
Embrace capacity planning
To ensure our services have enough spare capacity to handle any surge in traffic patterns, our SREs emphasize capacity planning, making recommended capacity changes in the continuous delivery (CD) pipeline. They constantly monitor performance to make sure the service is robust, stable and available. And when there’s a sudden surge beyond the forecasted volume, SREs change the capacity on demand and document changes for the performance and domain teams.
Capacity planning is especially important for us during peak holiday times such as Black Friday and Cyber Monday (BFCM). We lay out our SRE stability plan three months in advance and surface into the domain team’s product roadmap. This way development teams are able to allocate sufficient engineering time to reliability. We do performance testing to ensure the environment is able to sustain increased load over long periods of time and also handle sudden surges in traffic. We also do region failover testing at a global scale to validate the automatic failover duration of service level agreements (SLAs), SRE and domain readiness. Additionally, we conduct Black Friday and Cyber Monday-specific destructive testing to validate customer experience, reliability and more.
Google Cloud’s Black Friday and Cyber Monday white-glove service played a key role in ensuring our success in both BFCM 2019 and BFCM 2020. This service included on-site visits from Google’s Customer Reliability Engineering (CRE) team who reviewed Lowe’s web architecture, capacity planning, operations practices for event risks, and presented workshops on topics such as incident response best practices.
There is always room for improvement, and at Lowe’s we aim to continuously improve our SRE practices. One thing that has worked well for us, which we plan to continue, has been our road shows, where senior SRE leads present to other SREs and application domain teams on the latest SRE principles and best practices, and to get input in real-time from them.
Google’s tools and methodology have played an instrumental role in helping reshape our SRE practices and better serve our customers. We look forward to building on the momentum and partnership as we continue our SRE journey at Lowe’s.
If you want to learn more about how to adopt SRE best practices on Google Cloud, check out our documentation. If you want to learn more about Google SRE, visit our website. Stay tuned for the next blogs with Lowe’s discussing how they trained their engineering talent to adopt SRE practices and tooling, and how they improved MTTR using SRE principles.
Cloud BlogRead More