Availability Monitoring
Service monitoring, part art, part science. What I attempt to do in this series of posts on metrics is to lay down some basic principles that I have found to be helpful without being too prescriptive.
Failure happens, all the time. If I had to pick one metric to use to monitor my service it would be availability. I want this metric to be accurate, unvarnished, and real. This is the metric that any service owner should crave in order to observe and monitor their service. It’s the best way to define what “good” is from a resiliency perspective. The goal of monitoring this metric is two-fold: (1) For notable availability drops (below some desirable threshold), we need to be notified immediately so that we can restore service. (2) For the long-tail availability drops (failures that happen all the time, but not at a level that trigger operational response), we can use this metric to observe, find, and diagnose where the weak edges are in our service and come up with new approaches, automated remediations, or necessary fixes in order to add more armor and resiliency to our service.
How we define availability goes a long way towards defining how we monitor it. Availability for any given request generally means that we have been able to send the service a request and we have received a valid response. How this is actually computed may slightly vary by service and use case. I would like to start by describing a general method for computing availability, and then look at how it may be adapted for specific use cases.
A common heuristic that is used internally at Amazon for all tier 1 services is:
(SUM of successful responses)
Availability = ————————————————————————————
(SUM of valid requests)
This computation is done using time series data in order to be able to compute availability for any time frame. The keys to utilizing this formula for any given service is defining “successful response” and “valid request”. This is where the service team should be able to weigh in with a (hopefully concise) definition of how they have chosen to compute these data points through their service’s metrics. A “valid request” generally means that the request passed AuthN/AuthZ checks and input validation checks. A “successful response” generally means that a request was successfully processed by the service and it returned a successful response.
Let me start with a real world example of how I have seen these metrics computed for a Tier 1 web page service on a high traffic website. The service is available to the public internet, has dozens of upstream dependencies that are used to render the page, and is called through a client proxy. In this case we used metrics from the client proxy and computed availability using the following implementation of the above heuristic:
(SUM of Http 2xx and 3xx Responses)
Availability = ———————————————————————————————-
(All requests - Http 4xx responses)
For this formula, we had landed on the following definitions:
“Successful Response” = We were able to successfully render a web page and return it with an HTTP 2xx or 3xx response.
“Valid Request” = Any request that did not result in an HTTP 4xx response.
The reason why we eliminated HTTP 4xx responses from the calculation for this use case was primarily to eliminate spurious invalid requests caused by bot traffic from the computation. For a public web page on a high traffic website, there is potential for a non-trivial amount of bot traffic that can cause spurious errors. We wanted the availability of our service to be computed against known valid requests, that is real customers. We didn’t want the presence or absence of traffic from invalid requests to skew our availability metric. It’s important to note that we separately monitored HTTP 4xx errors by type in order to observe anomalies and find content errors. However, we considered that use case to be orthogonal to the availability of our service. If we included this traffic in the computation, it could potentially skew the statistic, and our goal was to maintain 5 9’s of availability against real customers and valid use cases not bots or content errors.
NOTE: Using HTTP status codes in the manner above to compute availability only works if all availability issues result in an HTTP 5xx error. This includes critical dependency calls and timeouts. I have found that in practice, it is possible that a developer might chose to obscure availability issues and return a creative HTTP code other than 5xx, which would make this method problematic to monitor at the HTTP layer.
It is a best practice, where possible, to collect service availability metrics from the downstream client of your service. This is generally possible in cases where you have control over the client and the environment. Internal services are good candidates for monitoring availability from a client that has been vended and instrumented by the service team. However, for external endpoints, client-side monitoring is usually not possible. In that case, the best you can do is emit metrics either from the service itself or from a load balancer or proxy layer downstream from the service. The caveat for this is that this this monitoring will be blind of infrastructure/network connectivity issues that clients may have in reaching the service. Service teams should have a means of detecting these issues, if possible through secondary client-side metrics, or synthetics/canary style monitoring.
Just one additional point that is worth calling out. In addition to monitoring the service itself, the same technique can be used to monitor the availability of a service’s upstream dependencies. A well-built dashboard and monitoring system for a service will cover availability of critical dependencies at the same incident response level as the service itself.