The top-level theme of this topic is Engineering for Failure, which presumes that in a world of complex distributed systems supporting a business that is also constantly changing, failure happens. It just does.
Failure happens all the time. If you think you can prevent it, you’re thinking about the problem wrong.
And the more the system scales horizontally and vertically, the more likely failure is to happen. I’m going to do a shallow dive into one practice that is a bedrock, foundational practice to driving learnings and improvements from systems failure. That practice is Correction of Errors (COE). I’m going to attempt to try this with one post as short as it possibly can be and no shorter. But bear in mind this is a summary of a subset of the process.
For many years, at Amazon, the Correction of Errors process was not publicly talked about. This process and the vast internal repository of learnings which have been accumulated over the years is more than likely one of the most valuable operational assets owned by the company. Amazon has an internal system that stores, manages and runs their COE process, machine learning that links new findings to previous COEs, and automated ticketing systems that align engineering organizations around findings and driving completion of action items. In a data-driven culture, it is essential to have real data that can be pointed to as the basis for core operating and engineering principles. The data contained in the COE system is used by engineers across the company to back up engineering and tradeoff decisions around why this feature or that attribute of a system needs to be built in a certain way. The COE repository provides real, documented evidence of failures that happened, the reasons why, and what actions were taken to make the affected systems more resilient or to further limit the blast radius of failure.
The wrong way to do COEs: It’s important to understand that human nature and bad practices can cause us to introduce human bias into the COE process. For example, Post-accident attribution to a ‘root cause’ is fundamentally wrong. Complex systems often have multiple failures that occur together to contribute to an incident. It can be very tempting to point to one thing and completely over look how bad all of the other contributing factors were. Hindsight Bias is also very easy to fall into, and cause us to look at the problem as if we understood what it looked like before the incident happened. We miss a lot when we look at an incident that way. The danger of both of these tendencies is a false sense of security that if we “just fix that one thing” or “if we had just done this other thing” that the entire system will be safer as a result. I highly recommend reading Dr Cook’s simple treatise, How Complex Systems Fail for a very high level view of how hubris leads to improper practices towards dealing with failure.
Blast Radius: We can’t really talk about Correction of Error without talking about a more basic and fundamental concept that is assumed and is critical to engineering for failure, which is the concept of limiting blast radius. Since nearly all outages are caused by changes, blast radius is all about limiting the potential impact of any given change that causes failure. There are at least two (possibly more) dimensions to limiting blast radius: exposure and time. Limiting blast radius along the exposure dimension means that you have built the ability to gradually roll out a change such that if a defect is determined, the impact was a subset of the potential impact if it hadn’t been rolled out gradually. The second dimension, time, examines how quickly an incident is able to be detected and mitigated, and examines measures that are available and/or used to both detect and quickly rollback affected changes.
Given an understanding that finding root cause is not the most important outcome from a COE, a belief that failure in complex systems happens all the time, and key to making change safe is limiting blast radius, that leads us to the three most important questions asked in the COE process. In most COE reviews, these questions need answers first before any others. It can be very tempting to try to focus on root cause or the PR that caused or fixed the issue, but these questions are ultimately the most important, which leads to the intentional design and structure of a well-written COE document.
“What Happened?” COEs are about forensic analysis. This doesn’t happen without data. Data is critical to a good COE document. Good data eliminates hunches, discourages bias, and focuses on customer and business impact. The most important data are: (1) The one metric that clearly demonstrates the outage. Obviously, if this is unable to be produced, it likely points to a potential observability failure that needs correction. A good COE has a graph with a link to source data that clearly demonstrates the impact, based on some operational metric and even better if a financial metric is also able to be tallied. This data can be used to demonstrate the “exposure” dimension of the blast radius of the incident. (2) Timeline. Without a precise timeline, it is really hard to understand the “time” dimension and ask the right questions about all of the factors that contribute to the amount of time it took to mitigate the incident. The timeline is key to many of the best learnings from a COE, and it’s important to refer often to the timeline when asking the two most important sets of questions, which come next…
“How did you detect the incident? As a thought exercise, what could you have done to cut that time in half?” Incident detection is the key first step to restoring service. A sure sign of poor monitoring is the system breaking and it taking hours for a customer or some random employee to happen to discover it’s down. Meanwhile, customers likely have been unable to use the system for those hours with no ability to tell us its broken. This question often raises important questions about whether the right monitors are in place and whether they are tuned appropriately. For a primary system outage, ideally on-call engineers should be paged within 2-5 minutes. If the metrics are noisy, it’s a good time to ask whether we have the right data to create the right monitor that can go off at the right time.
“How did you mitigate the incident? As a thought exercise, what could you have done to cut that time in half?” Time to mitigate is key to understanding how well incident management is done. During the time from detection to mitigation, root cause analysis should not necessarily be the focus. If the team knows what is not working, hopefully they have a lever in place that can mitigate quickly before root cause analysis is done. This could include rollback of a deployment, turning a feature gate off, or pulling a lever in dynamic config. Good on-call practices emphasize, where possible, mitigate first, diagnose root cause later. This question generally probes in the areas around operational levers that have been built to stabilize the system or turn off problematic behaviors.
After incident response is covered, it’s a good time to look at the root cause and ask another important question. What was the blast radius of the change that caused the outage? Is there any way the blast radius of the change could have been cut in half or less? Tools and practices around reducing blast radius probably deserve a lot more discussion, but you know you are getting better at it when you never have scheduled downtime (the very definition of large blast radius) and teams are using feature gates or dials for any changes of significance or any change in an area of risk to the system.
In summary, one of the key elements of a good COE review are a solid focus on various questions around Blast Radius. To that end, timeline, incident detection and incident response and important areas to dive deep on. A second key area is examining the blast radius of the change that initiated the incident. This part of the incident was planned up front. Is there any way blast radius could have been reduced?
Pro tip: Preparing for a big launch? Wondering what you need to do to prepare for that next huge release? Try this: Create a simulated COE that simulates a potential failure of the system, anything from someone pushing a bad commit to a database running out of memory. Even though you’ll never likely guess what actually will go wrong, It’s surprising what can be learned if you think through the process of incident response for a failure that you think may never happen, and what the timeline would look like if it actually did. This is a great team-level exercise and it helps greatly with those key areas of monitoring, incident response, team dependencies on other teams, and fallback strategies.
External References:
“Correction of Error” (AWS Well Architected Framework) -AWS Well Architected Framework - Correction of Error
“How Complex Systems Fail” (Richard I. Cook, MD) - https://how.complexsystems.fail/