A homeowner’s to-do list may be never-ending, but for Lowe’s, the busiest period is by far the week of Black Friday and Cyber Monday (BF/CM). As Site Reliability Engineers (SREs), we work hard to provide customers with a flawless experience from first click to checkout — especially during times of high demand.
As part of our Total Home Strategy, Lowe’s continues modernizing its online business following its 2019 digital transformation with Google Cloud. After implementing the SRE Framework back in 2020, Lowe’s SRE team launched a new BF/CM readiness strategy to take full advantage of automation and microservices. This year, we began planning and strategizing with Google Cloud months in advance, leading to another successful BF/CM.
Our readiness strategy entails five core pillars:
Collaboration with business and cross-functional teams Chaos engineering Performance engineeringCapacity planningBot management
Each of these five pillars is critical to maintaining the reliability and availability of the Lowes.com website, and for BF/CM to succeed without impacting customer experiences, all must go off without a hitch.
Collaboration and communication
At the core of any successful event is clear communication across the different teams, stakeholders, and vendors.
Business team partnership
As SREs, we calculate how business decisions impact the site’s traffic by maintaining high visibility between the business’s goals and how IT can carry them out. For instance, if the marketing department plans to send a push notification advertising holiday deals at 3:30 pm on Friday, our team is aware of the schedule and anticipates the traffic increase to the different Lowes.com shopping and purchase funnels.
Once the SRE team has insight into business marketing strategies and forecasts for BF/CM, we begin capacity planning.
Managing change through communication
Maintaining clear lines of communication and hierarchy is essential to executing a successful shopping event. As part of Lowe’s culture change, we have a change management process, and governance board, to centralize decision-making and mitigate system errors. Because most issues or incidents come from changes, having observability of all modifications across the site means stakeholders have established procedures to assess, deploy, and roll back changes in the event of problems.
To ensure optimum efficiency during our Black Friday events, on November 1st, we implement a sitewide frost — changes are allowed, but only those critical to the ecosystem. To prevent any change-related vulnerabilities, we enter our sitewide freeze around mid-November — we don’t deploy any changes and instead enter a hyper-care mode with our internal planning partners to determine if they need additional scaling or resources.
In the months leading to BF/CM, Lowe’s SREs and Google Cloud conduct engineering tabletop games to replicate previous high-pressure cases. We run these simulations so each team member knows their role in the event of an incident and can rehearse procedures in a controlled environment. Furthermore, the exercises reinforce the reporting and communication hierarchy in high-stress situations, a critical feature in reducing our mean time to acknowledge incidents from 30 minutes in 2019 to one minute in 2022 – a 97% decrease.
Downstream and third-party vendor interactions
Even with the website and services fleet effectively optimized and prepared for the influx of customers during the BF/CM event, there’s always something for the SRE to do. We partner with over 20+ different enterprise and third-party vendor teams to tackle initiatives to ensure a seamless browsing experience.
Once our team establishes alignment across the different stakeholders, it’s time to begin stress-testing and optimizing our infrastructure.
Building game-days momentum (chaos engineering)
While planning for the BF/CM event technically begins in June, our SRE team is already testing our technology ecosystem’s resilience. At the beginning of February 2022, the team began instituting weekly chaos engineering game days to identify shortcomings within the software components powering Lowes.com selling channels. Chaos engineering is the practice of intentionally introducing failures, traffic spikes, and disruptions into a network environment to understand how it behaves against adverse conditions. Before 2022, our team only ran chaos game days three or four times in advance of the BF/CM events. By chaos gaming different aspects of the technology ecosystem and services weekly, our team proactively identified critical vulnerabilities for engineers to fix while optimizing resiliency in real-time.
Regular and varied exercises, such as chaos game days and traffic spikes, prepare the system for the worst while keeping our team agile and responsive.
Engineering for performance
At Lowe’s, we use continuous performance engineering techniques to identify bottlenecks within the system architecture throughout the year. BFCM specific performance exercises began in August and as we got closer to October, Lowe’s SRE team had conducted 35+ separate performance tests that included several variations as per the industry standards, for example stress tests under extreme workloads, and endurance tests to identify long-term performance issues.
Like one would exercise a muscle, managing a massive fleet of services powering online selling channels requires consistent effort, maintenance, and attention.
Capacity planning determines the resources needed to support expected traffic and user activity levels, such as server capacity and bandwidth. Throughout the year, we continuously adjust our plans based on the changing needs of our customers and systems, but it’s a different experience preparing for the biggest sales week of the year. We score all the business goals, prioritizing them based on available resources, and schedule increases in server capacity and compute in line with product promotions.
With SREs having visibility into business goals, planning for seasonal traffic growth becomes easier, while optimizing our engineering resources.
Blocking bad actors using improvised bot management
Today, a variety of bots, such as search engine crawlers, social networking bots, aggregator crawlers, or other monitoring bots, make up two-thirds of all internet traffic. However, the malicious bots that attack user accounts, scrape data, and bombard infrastructure are hidden amongst the routine scanning and monitoring bots. Implementing anti-bot software tools, such as a Web-Application Firewall (WAF), not only provides granular control over which bots can gain entry to our site, but automatically excludes malicious algorithms and evasive bots.
It takes a village
While software tools are critical in addressing unforeseen issues, our Technical Account Manager (TAM) set up shop in Lowe’s recently opened Tech Hub to provide on-site support, which made a real difference. Thanks to the TAM by our side, we have a real-time advocate within Google Cloud to ensure we receive the highest-priority support during the most critical moments of the week-long event.
With the 2022 Black Friday/Cyber Monday event in the rearview mirror, our team is already preparing for the 2023 holiday. In partnership with Google Cloud, the Lowe’s SRE team is fulfilling Lowe’s Total Home Strategy by providing customers with the best Lowe’s experience online. The success, and continuing availability, of the Lowe’s website during BF/CM proves that collaboration, communication, and optimization are crucial tentpoles to an enjoyable website experience.
Special thanks to Prasanna Singaraju, Rajat Khanna and the entire Lowe’s E-commerce Site Reliability Engineering team for contributing to this blog post.
Cloud BlogRead More