<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Still Compiling...]]></title><description><![CDATA[Thoughts and learnings from 30 years of software engineering and leadership. Some may be useful.]]></description><link>https://compiling.enstaria.com</link><image><url>https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png</url><title>Still Compiling...</title><link>https://compiling.enstaria.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 17 Apr 2026 05:15:32 GMT</lastBuildDate><atom:link href="https://compiling.enstaria.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Andrew Elmhorst]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[andrewelmhorst@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[andrewelmhorst@substack.com]]></itunes:email><itunes:name><![CDATA[Andrew Elmhorst]]></itunes:name></itunes:owner><itunes:author><![CDATA[Andrew Elmhorst]]></itunes:author><googleplay:owner><![CDATA[andrewelmhorst@substack.com]]></googleplay:owner><googleplay:email><![CDATA[andrewelmhorst@substack.com]]></googleplay:email><googleplay:author><![CDATA[Andrew Elmhorst]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Config is Code]]></title><description><![CDATA[Test and deploy config the same way you deploy code]]></description><link>https://compiling.enstaria.com/p/config-is-code</link><guid isPermaLink="false">https://compiling.enstaria.com/p/config-is-code</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 11 Jan 2025 13:07:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/11b29ffe-feef-4b4a-9e18-d2078aac73ef_1018x495.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I feel like the one of the distinguishing characteristics of a senior engineer is a healthy fear of change to production. I think it&#8217;s just that the combination of all of the things I&#8217;ve broken over time accumulates to the point where I am skeptical of any and all changes I make. This is why I love unit testing, and the practice of writing code that is testable. While some issues that can break production are still hard to test at a unit test level, at a very a minimum, good test coverage assures that the code itself that I am writing handles all known issues I can think of to throw at it.</p><p>A long time ago, I learned the hard way that config is code. I used to think config was safe and code was not safe. But I think the opposite is true. I have found that bad config can break things just as fast or faster than bad code.</p><p>That&#8217;s why I always prefer to find a way to test my config. At a minimum, load the config using the same code that loads it in production, parse it, poke at it, verify it is doing exactly as you think it is. While it is still possible for that config to break something, at least it helps if that thing that is broken is not the config itself.</p><p>Your results may vary, but I prefer to test my config before pushing it to prod.</p>]]></content:encoded></item><item><title><![CDATA[Key Service Metrics]]></title><description><![CDATA[Service monitoring, part art, part science. What I attempt to do in this series of posts on metrics is to lay down some basic principles that I have found to be helpful without being too prescriptive.]]></description><link>https://compiling.enstaria.com/p/key-service-metrics</link><guid isPermaLink="false">https://compiling.enstaria.com/p/key-service-metrics</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 04 Jan 2025 13:03:02 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c3fa23c9-44be-447b-a058-ac5cdae06845_1018x592.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;d like to round out this series for now by summarizing some metrics I have found helpful for monitoring any service.</p><h3>Availability</h3><p>We spent a fair amount of time in the past three posts on availability. The purpose of computing, displaying and monitoring availability is that it most easily defines what <strong>&#8220;good&#8221;</strong> is across any service.</p><p><code>              (SUM of successful responses)<br>Availability = &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br>                (SUM of valid requests)</code></p><p>Drops in availability have measurable business value that can easily be computed to $ and &#162;.</p><p><code>Total Revenue <br>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-&#8212;&#8212;&#8212;&#8212;&#8212;  - Total Revenue ~= REVENUE IMPACT<br> Availability</code></p><p>Now let&#8217;s take a look at additional metrics that are important to monitor for any service.</p><h3><strong>Latency</strong></h3><p><em>Caveat: For the purposes of this article, I am not going to cover end user experience latency, which is a complex and highly specialized topic. For this article, assume latency of a micro service&#8230;</em></p><p>Latency is the total amount of time it takes to complete a single transaction, and is typically measured in milliseconds for consistency across services. While availability is more of a high level gauge, Latency is a window into the application that can say and reveal a lot about what&#8217;s happening under the hood. It&#8217;s like taking an XRay. Take for example, the following graph. The first question I would ask is what is causing the spikes at p90?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DPk1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DPk1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 424w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 848w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 1272w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DPk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png" width="577" height="381" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:381,&quot;width&quot;:577,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DPk1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 424w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 848w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 1272w, https://substackcdn.com/image/fetch/$s_!DPk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc85e8451-4542-4b36-8968-1cb0344bab6c_577x381.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If I were to wager a guess, more than likely it&#8217;s contention for shared resources, such as a database running into I/O contention. Is this a big deal? Well, it depends. How much traffic? How much headroom is available in the fleet? How often does it happen?</p><p>Latency allows us to generally determine a number of different things about a service. Sudden drops in latency can often indicate an availability drop as errors sometimes cause abnormal termination of the request. Sudden spikes in latency are indications of exhaustion of a shared resource, whether it&#8217;s a consistency lock on a database table, availability of a new connection from a connection pool, or lack of available cpu.</p><p>Latency can also be used to compute concurrency using <a href="https://en.wikipedia.org/wiki/Little%27s_law">Little&#8217;s law</a>:</p><p><code>Concurrency = Arrival rate * Wait time</code></p><p>Concurrency can be used as a computational aid to understand concurrent resources needed to satisfy traffic at a specific level. An example would be connection limits on a load balancer as a simplistic <strong>load shedding</strong> device.</p><p>One of the strongest signals latency can give is during stress testing it often can be used to signal the breaking point of an application on a specific resource configuration. The breaking point, as described by <a href="https://en.wikipedia.org/wiki/Amdahl's_law">Amdahl&#8217;s law</a>, is the point at which a specific application on a specific hardware configuration becomes overwhelmed. It is at this point point that latency spikes dramatically. Some sophisticated <strong>load shedding</strong> techniques use latency rather than concurrency to predict when load shedding should occur to enable fleet protection in traffic overload scenarios.</p><h3>Errors</h3><p>Using metrics to monitor errors can be extremely helpful. However there is one caveat, which is that high cardinality metrics be both expensive and noisy. It can be really helpful to summarize errors by type into a lower cardinality list. Timeouts are a really good example error metric. Any timeout of a service endpoint is an availability drop. Any other availability causing errors (or <a href="https://boltpay.atlassian.net/wiki/spaces/~6363ec0aa04e906250c9c093/blog/2023/02/23/2611511297/A+Separation+of+Errors">server faults</a>) should have some way to be summarized on a dashboard at a low cardinality for monitoring purposes.</p><h3><strong>Physical Resources</strong></h3><p>All service are dependent on physical resources in order to run and respond to traffic. It is important to monitor critical resources that may reach physical limits due to increased traffic or other application issues. CPU and Memory are good examples, but it can sometimes be important to monitor other resources that are either heavily optimized (e.g. limited) or known to cause contention issues, such as thread pools, connection pools, and dependency rate limits. If caching is an important part of the service design, a good graph on cache-hit ratio is a must. Generally latency can in many cases act as a proxy to resource contention signals, but it is valuable to have important resource metrics on a dashboard as well to enable rapid diagnosis of where the resource contention is once it arises.</p><h3><strong>Dependencies</strong></h3><p>For each critical dependency, whether it&#8217;s a storage resource such as a database or search index or a 3rd party or internal service, a couple of quick graphs per dependency should be on the primary service dashboard. It makes sense to monitor Availability, Latency, and Timeouts per dependency. If the number of dependencies is large, it can sometimes make sense to put these on a secondary dashboard that can be used for deep dives for operational issues.</p><h3>Monitoring Each Endpoint</h3><p>When monitoring a service, doing an overall monitor on all endpoints is really only useful if all endpoints are doing similar work (such as multiple web pages on a web server). Generally services with multiple endpoints require individual latency, availability, throughput, and error metrics per endpoint.</p><p>In summary, understanding availability, latency, throughput, and errors should be a key design goal of a service dashboard. Monitoring by endpoint is important. Understanding utilization and issues with physical resources and dependencies can help to quickly diagnose and respond to issues. </p>]]></content:encoded></item><item><title><![CDATA[A Correction of Errors]]></title><description><![CDATA[Maximize learnings from errors, while minimizing impact]]></description><link>https://compiling.enstaria.com/p/a-correction-of-errors</link><guid isPermaLink="false">https://compiling.enstaria.com/p/a-correction-of-errors</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 28 Dec 2024 13:06:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7b78add0-c993-4239-b8a8-e5c94854eff4_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The top-level theme of this topic is Engineering for Failure, which presumes that in a world of complex distributed systems supporting a business  that is also constantly changing, failure happens. It just does.</p><div class="pullquote"><p><em><strong>Failure happens all the time.</strong> If you think you can prevent it, you&#8217;re <a href="https://how.complexsystems.fail/">thinking about the problem wrong</a>.</em></p></div><p>And the more the system scales horizontally and vertically, the more likely failure is to happen. I&#8217;m going to do a shallow dive into one practice that is a bedrock, foundational practice to driving learnings and improvements from systems failure. That practice is Correction of Errors (COE). I&#8217;m going to attempt to try this with one post as short as it possibly can be and no shorter. But bear in mind this is a summary of a subset of the process.</p><p>For many years, at Amazon, the Correction of Errors process was not publicly talked about. This process and the vast internal repository of learnings which have been accumulated over the years is more than likely one of the most valuable operational assets owned by the company. Amazon has an internal system that stores, manages and runs their COE process, machine learning that links new findings to previous COEs, and automated ticketing systems that align engineering organizations around findings and driving completion of action items. In a data-driven culture, it is essential to have real data that can be pointed to as the basis for core operating and engineering principles. The data contained in the COE system is used by engineers across the company to back up engineering and tradeoff decisions around why this feature or that attribute of a system needs to be built in a certain way. The COE repository provides real, documented evidence of failures that happened, the reasons why, and what actions were taken to make the affected systems more resilient or to further limit the blast radius of failure.</p><p><strong>The wrong way to do COEs:</strong> It&#8217;s important to understand that human nature and bad practices can cause us to introduce human bias into the COE process. For example, <a href="https://how.complexsystems.fail/#7">Post-accident attribution to a &#8216;root cause&#8217; is fundamentally wrong.</a> Complex systems often have multiple failures that occur together to contribute to an incident. It can be very tempting to point to one thing and completely over look how bad all of the other contributing factors were. <a href="https://how.complexsystems.fail/#8">Hindsight Bias</a> is also very easy to fall into, and cause us to look at the problem as if we understood what it looked like before the incident happened. We miss a lot when we look at an incident that way. The danger of both of these tendencies is a false sense of security that if we &#8220;just fix that one thing&#8221; or &#8220;if we had just done this other thing&#8221; that the entire system will be safer as a result. I highly recommend reading Dr Cook&#8217;s simple treatise, <a href="https://how.complexsystems.fail/">How Complex Systems Fail</a> for a very high level view of how hubris leads to improper practices towards dealing with failure.</p><p><strong>Blast Radius:</strong> We can&#8217;t really talk about Correction of Error without talking about a more basic and fundamental concept that is assumed and is critical to engineering for failure, which is the concept of limiting blast radius. Since nearly all outages are caused by changes, blast radius is all about limiting the potential impact of any given change that causes failure. There are at least two (possibly more) dimensions to limiting blast radius: exposure and time. Limiting blast radius along the exposure dimension means that you have built the ability to gradually roll out a change such that if a defect is determined, the impact was a subset of the potential impact if it hadn&#8217;t been rolled out gradually. The second dimension, time, examines how quickly an incident is able to be detected and mitigated, and examines measures that are available and/or used to both detect and quickly rollback affected changes.</p><p>Given an understanding that finding root cause is <em>not</em> the most important outcome from a COE, a belief that failure in complex systems happens all the time, and key to making change safe is limiting blast radius, that leads us to the three most important questions asked in the COE process. In most COE reviews, these questions need answers first before any others. It can be very tempting to try to focus on root cause or the PR that caused or fixed the issue, but these questions are ultimately the most important, which leads to the intentional design and structure of a well-written COE document.</p><p>&#8220;<strong>What Happened?</strong>&#8221; COEs are about forensic analysis. This doesn&#8217;t happen without data. Data is critical to a good COE document. Good data eliminates hunches, discourages bias, and focuses on customer and business impact. The most important data are: (1) The one metric that clearly demonstrates the outage. Obviously, if this is unable to be produced, it likely points to a potential observability failure that needs correction. A good COE has a graph with a link to source data that clearly demonstrates the impact, based on some operational metric and even better if a financial metric is also able to be tallied. This data can be used to demonstrate the &#8220;exposure&#8221; dimension of the blast radius of the incident. (2) Timeline. Without a precise timeline, it is really hard to understand the &#8220;time&#8221; dimension and ask the right questions about all of the factors that contribute to the amount of time it took to mitigate the incident. <em><strong>The timeline is key to many of the best learnings from a COE, and it&#8217;s important to refer often to the timeline when asking the two most important sets of questions, which come next&#8230;</strong></em></p><p>&#8220;<em><strong>How did you detect the incident?</strong></em> <em>As a thought exercise, what could you have done to cut that time in half?</em>&#8221; Incident detection is the key first step to restoring service. A sure sign of poor monitoring is the system breaking and it taking hours for a customer or some random employee to happen to discover it&#8217;s down. Meanwhile, customers likely have been unable to use the system for those hours with no ability to tell us its broken. This question often raises important questions about whether the right monitors are in place and whether they are tuned appropriately. For a primary system outage, ideally on-call engineers should be paged within 2-5 minutes. If the metrics are noisy, it&#8217;s a good time to ask whether we have the right data to create the right monitor that can go off at the right time.</p><p>&#8220;<em><strong>How did you mitigate the incident?</strong> As a thought exercise, what could you have done to cut that time in half?</em>&#8221; Time to mitigate is key to understanding how well incident management is done. During the time from detection to mitigation, root cause analysis should not necessarily be the focus. If the team knows what is not working, hopefully they have a lever in place that can mitigate quickly before root cause analysis is done. This could include rollback of a deployment, turning a feature gate off, or pulling a lever in dynamic config. Good on-call practices emphasize, where possible, mitigate first, diagnose root cause later. This question generally probes in the areas around operational levers that have been built to stabilize the system or turn off problematic behaviors.</p><p>After incident response is covered, it&#8217;s a good time to look at the root cause and ask another important question. <em>What was the blast radius of the change that caused the outage? Is there any way the blast radius of the change could have been cut in half or less?</em> Tools and practices around reducing blast radius probably deserve a lot more discussion, but you know you are getting better at it when you never have scheduled downtime (the very definition of large blast radius) and teams are using feature gates or dials for any changes of significance or any change in an area of risk to the system.</p><div class="pullquote"><p><em>In summary, one of the key elements of a good COE review are a solid focus on various questions around Blast Radius. To that end, timeline, incident detection and incident response and important areas to dive deep on. A second key area is examining the blast radius of the change that initiated the incident. This part of the incident was planned up front. Is there any way blast radius could have been reduced?</em></p></div><p><strong>Pro tip:</strong> Preparing for a big launch? Wondering what you need to do to prepare for that next huge release? Try this: Create a simulated COE that simulates a potential failure of the system, anything from someone pushing a bad commit to a database running out of memory. Even though you&#8217;ll never likely guess what actually will go wrong, It&#8217;s surprising what can be learned if you think through the process of incident response for a failure that you think may never happen, and what the timeline would look like if it actually did. This is a great team-level exercise and it helps greatly with those key areas of monitoring, incident response, team dependencies on other teams, and fallback strategies.</p><p>External References:</p><ol><li><p>&#8220;Correction of Error&#8221; (AWS Well Architected Framework) -<a href="https://wa.aws.amazon.com/wat.concept.coe.en.html">AWS Well Architected Framework - Correction of Error</a></p></li><li><p>&#8220;How Complex Systems Fail&#8221; (Richard I. Cook, MD)  - <a href="https://how.complexsystems.fail/">https://how.complexsystems.fail/</a></p></li></ol>]]></content:encoded></item><item><title><![CDATA[A Separation of Errors]]></title><description><![CDATA[Service monitoring, part art, part science. What I attempt to do in this series of posts on metrics is to lay down some basic principles that I have found to be helpful without being too prescriptive.]]></description><link>https://compiling.enstaria.com/p/a-separation-of-errors</link><guid isPermaLink="false">https://compiling.enstaria.com/p/a-separation-of-errors</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 21 Dec 2024 12:40:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lIeG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Monitoring a service for errors has a specific set of concerns that run somewhat orthogonally to how programming languages expose errors. To a programming language, an error is an error and as a developer, we are left to deal with it once it happens. Service errors are also orthogonally separate from protocol/transport level errors. A network transport protocol is concerned with the reliability of communication of data and is unconcerned with any errors unrelated to transport. However, as a service owner, one error is not the same as another error. Solving for a network partition error might require a completely different approach than solving for a missing API key. Furthermore, some errors are a normal part of programming against a physical world with physical limits and are part of the application processing flow itself. Problems arise, however, when all errors are just treated as errors and there is not clear heuristic for how to tell them apart.</p><p>It can be helpful to think of an application as a black box that has gauges on the outside that tell you what&#8217;s going on inside. We as developers are responsible for wiring those gauges to the things that we want to know about. And when the gauges start spiking, we would prefer if the gauge could actually tell us something about where things are going wrong, not just that something is going wrong. And if your black box is dependent on another black box, you want to know if it is your black box that is causing the error or if the problem is in another black box at the other end of a wire.</p><p>I generally believe that effective service operations involve being able (1) tell at a glance on a service dashboard if it is operating within expected parameters, and (2) have faith that if a new health condition affects my service, I can see it in the dashboard and classify it one of three domains: Server fault, client fault, or processing error. I like these domains because they attempt to completely separate the space into separate responsibility domains.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lIeG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lIeG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 424w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 848w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 1272w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lIeG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png" width="680" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/448e4666-a965-4b39-bccd-5ccf34954135_680x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lIeG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 424w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 848w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 1272w, https://substackcdn.com/image/fetch/$s_!lIeG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F448e4666-a965-4b39-bccd-5ccf34954135_680x580.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A <strong>Server Fault</strong> has the following characteristics: (1) It is 100% the responsibility of the service owner to fix, (2) counts against the service&#8217;s availability metric, (3) while not possible to drive to zero, driving to zero should always be a goal and (4) is always a reflection of the sum of the system design and engineering that have been applied to the service and the availability of its upstream dependencies. These errors count because they reduce the availability of the overall business and can have a direct impact on revenue for better or for worse in a big way.</p><p>In the worst possible case, if the availability of a service drops significantly, it has huge impact on all customers of our service, causing a direct drop in revenue. In the best possible, case a service with high availability can easily beat out any other competitor in an A/B test even with inferior CX, simply by being more available.</p><p>Driving to a high service availability posture can produce a sense of pride in the owning team and every order of magnitude increase in availability (adding a '9) requires a whole new set of engineering skills and advantages in availability can lead to overwhelmingly crushing defeats when comparing like competitors. When a server fault alarm goes off, the service owner is 100% responsible for fixing it.</p><p>A <strong>Client Fault</strong> has the following characteristics: (1) Is is 100% the responsibility of the client to fix (putting aside service-vended client scenarios). (2) Almost always indicates a bug or misconfiguration in the client code or lack of data validation. When a client fault alarm goes off, the the owner of the client is 100% responsible for fixing it.</p><p>A <strong>Processing Error</strong> has the following characteristics: (1) unlike server and client faults, processing errors are intentionally produced by the system and are usually expected and explainable based on some physical resource state or constraint, and (2) there will always be some sort of rate of these errors, which may or may not be predictable. Processing errors happen because as a normal fact of life, the application has to honor all of the rules of the system which usually map to resources in the physical world. Products cannot be sold that are not in inventory. Credit cards cannot be honored if they are expired or over limit. Addresses cannot be shipped to if they do not exist. Monitoring these errors is important from a business perspective to understand &#8220;how&#8221; the system is operating. Monitoring generally requires some understanding of what &#8220;normal&#8221; error rate is and may require investigation only when deviance from normal is observed.</p><h3>Service Endpoint Error Handling</h3><p>In order for a service endpoint to reliably emit observable metrics, here are some opinionated heuristics for service response behavior and metrics behavior related to errors:</p><ol><li><p>The handler for a service endpoint should emit a reliable, observable metric stream that classifies errors into (a) Service Faults, (b) Client Faults, and (c) Processing Errors.</p></li><li><p>Error metrics are emitted with a separate (more useful) cardinality than HTTP status codes.</p></li><li><p>Separately from emitting metrics, in order to build predictable HTTP clients, services have a predictable response behavior at the HTTP layer and map</p><ol><li><p>Service Faults to HTTP status codes in the 5xx range</p></li><li><p>Client Faults to HTTP status codes in the 4xx range</p></li><li><p>Processing Errors to HTTP status codes in the 2xx range</p></li></ol></li><li><p>Processing Errors are always serialized into and returned reliably in the application response, so (a) that the client application has a chance of observing them and (b) the application can in turn return them to downstream systems. </p></li><li><p>Processing errors from an upstream dependency should be (a) expected and (b) passed through back to the client of the service endpoint, either (a) unchanged or (b) translated. There should be no confusion in the code between an expected processing error and an unexpected client or server fault.</p></li><li><p>Any server or client faults from calling an upstream dependency should be <strong>wrapped</strong> into a new Server fault.</p></li><li><p>If a service cannot get a valid response from a critical dependency and fails the transaction it should directly count against the service&#8217;s availability.</p></li><li><p>The definition of 'critical dependency' is defined by the service owner and allows for degradation and fallback behaviors.</p></li></ol>]]></content:encoded></item><item><title><![CDATA[Rates scale better than counts]]></title><description><![CDATA[Service monitoring, part art, part science. What I attempt to do in this series of posts on metrics is to lay down some basic principles that I have found to be helpful without being too prescriptive.]]></description><link>https://compiling.enstaria.com/p/rates-scale-better-than-counts</link><guid isPermaLink="false">https://compiling.enstaria.com/p/rates-scale-better-than-counts</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 14 Dec 2024 12:39:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/28ed8ab0-2d5f-4344-92c8-4bfe9466da17_1440x160.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One observability approach I have seen often gotten wrong is attempting to use counts to understand how the service is operating. There are a number of issues with this approach, so let&#8217;s start with dashboards and monitors and then work our way back to the code.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yu3z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yu3z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 424w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 848w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 1272w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yu3z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png" width="1440" height="160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e51fe146-e25f-4921-902f-485015ea55e4_1440x160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:160,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yu3z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 424w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 848w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 1272w, https://substackcdn.com/image/fetch/$s_!yu3z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe51fe146-e25f-4921-902f-485015ea55e4_1440x160.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p><strong>Counts require too much context:</strong> As a service owner, if I look at my dashboard every single day, I might have a gut feeling on whether &#8220;83&#8221; or &#8220;2.22k&#8221; is a good or not good number for a specific metric for my service, but usually I don&#8217;t. Because in addition to count, I also need to ensure I am looking at the correct timeframe. Is &#8220;83&#8221; a good number when I&#8217;m looking at a days worth of data? 5 minutes? 2 weeks? What should that look like over 90 days? Even if I know the answer to that question, it is highly unlikely everybody else on my team does. And what about anyone else who comes to look at my dashboard? Are they going to understand what these counts mean and whether they are good or bad? Or am I going to have to always be present to interpret the data for them? That doesn&#8217;t scale.</p><p><strong>Counts don&#8217;t scale:</strong> As a business changes over time, and hopefully grows exponentially, counts will inevitably change. A count that looked good yesterday at 100tps, might not look good tomorrow at 200tps or in the middle of the night at 50 tps and and certainly will be quite different when traffic reaches 12,000tps.</p><p><strong>Monitoring with &#8220;magic numbers&#8221;:</strong> Another problem with counts is when we need to build some sort of logic to do something when a count reaches a certain threshold. This can include monitors and can also include automations like blocklists. If I build a monitor that pages when the number of errors reaches &#8220;10&#8221; every 5 minutes, how do I know that&#8217;s the right threshold? When does it need to change? Is 10 errors every 5 minutes OK during peak traffic in the middle of the day? Is the same number OK when traffic drops significantly in the middle of the night.</p><p><strong>Normalization of deviance:</strong> Using &#8220;magic numbers&#8221; for monitors is almost always the wrong approach because as humans we tend to base these numbers on how much pain we are willing to endure. If I set my monitor to page me in the middle of the night, you better believe I&#8217;m going to make sure that magic number is set high enough that I won&#8217;t get woken up very often. It&#8217;s very normal practice to tune monitor thresholds to try to keep monitors from being too noisy and also not noisy enough, so that part is normal. However, being as counts don&#8217;t scale, what ends up happening in practice is there is a lot more noise than signal to work with. Therefore, we normalize the deviance by setting the thresholds higher than they need to be to account for noise, and as traffic increases were are constantly fighting with thresholds rather than searching for signal and understanding if the problem we are dealing with is getting better or worse.</p><p><strong>Rates Scale Better:</strong> Rates are a pretty simple concept. Let me start with a simple example.</p><p><code>Rate = Signal Count / Total Transactions * 100</code></p><p>Using this approach scales to any scale of transactions and any timeframe. If you&#8217;ve established what the normal error rate is, such as x%, you can now do a number of different things that you cannot do with counts.</p><ol><li><p><strong>Signal becomes constant over timescale:</strong> Viewing the dashboard for any timescale requires no interpretation. You can view rate over 5 minutes, 60 minutes, 2 weeks, or 90 days, and it becomes much more constant.</p></li><li><p><strong>Signal over noise is much easier to visualize:</strong> If you look at a graph and the error rate spikes, you know that you have deviance over the norm without interpretation. If you are displaying counts, you may see spikes, but without context, you might not know of those spikes are caused by traffic spikes or actual deviation from the normal error rate.</p></li><li><p><strong>Much easier to tune thresholds</strong>: Rates smooth out traffic spikes. So traffic based on time of day or day of year doesn&#8217;t matter so much. Your are monitoring the rate at which something happens. You still need to tune the monitor to make sure action happens at the proper threshold, but over time, it tends to be a much more stable number.</p></li><li><p><strong>Deviance becomes less normal: </strong>If you have constant error rate that is expected and that changes with a given deployment, you know it. If error rate goes up, you know you might have a problem. It might be good, it might not be good, but at least you can see easily, react to it, determine what&#8217;s happening, and respond accordingly. It&#8217;s hard to see that with &#8220;magic number&#8221; counts that usually are set too high to see deviance.</p></li><li><p><strong>Easier to convey context:</strong> Once you understand your rates much better, understand what &#8220;normal&#8221; looks like, and have set your monitor thresholds accordingly, you can now do something to even better convey context to those outside of your service who come and take a look at your dashboard. You can add indicators to your graphs to include thresholds for Sev1/Sev2 notifications and in some cases it is best practice to include green/orange/red bands or lines to indicate expected performance or desired targets. This allows for easy interpretation of how your service is operating for anyone coming to see how your service is operating and makes for much easier monitoring for oncall operations.</p></li></ol><p><strong>Overall &#8220;Goodness&#8221; Rate:</strong> The above formula makes sense for determining <em>rate of occurrence</em> of some signal in your system, such as a known error, authorization failure, etc. However, there is another approach that works even better for high-level &#8220;goodness&#8221; measurement. There are two primary &#8220;goodness&#8221; indicators every service owner should know, availability and success rate. These are binary. The service is either available or it is not. The transaction was either successful or it was not. For determining &#8220;goodness&#8221; of availability, my preferred approach (when I have control over the metrics being emitted) is to compute availability by emitting a metric from the service client called &#8220;Available&#8221; with only two possible states:</p><p><code>Available=1 // Request was successfully processed <br>Available=0 // Request failed</code></p><p>Turning these into a rate is much simpler using a formula like this:</p><p><code>Availability = AVG(Available)*100</code></p><p>Requests that I choose not to count towards availability do not emit a metric, so there is need to have context and understanding of all the possible states. There are only two states. It&#8217;s binary. Simple to dashboard, simple to monitor</p>]]></content:encoded></item><item><title><![CDATA[Availability Monitoring]]></title><description><![CDATA[Service monitoring, part art, part science. What I attempt to do in this series of posts on metrics is to lay down some basic principles that I have found to be helpful without being too prescriptive.]]></description><link>https://compiling.enstaria.com/p/availability-monitoring</link><guid isPermaLink="false">https://compiling.enstaria.com/p/availability-monitoring</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sun, 08 Dec 2024 11:03:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/69f1394d-590a-4f56-9b2c-4e0672b39d82_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Failure happens, all the time. If I had to pick one metric to use to monitor my service it would be availability. I want this metric to be accurate, unvarnished, and real. This is the metric that any service owner should crave in order to observe and monitor their service. It&#8217;s the best way to define what &#8220;good&#8221; is from a resiliency perspective. The goal of monitoring this metric is two-fold: (1) For notable availability drops (below some desirable threshold), we need to be notified immediately so that we can restore service. (2) For the long-tail availability drops (failures that happen all the time, but not at a level that trigger operational response), we can use this metric to observe, find, and diagnose where the weak edges are in our service and come up with new approaches, automated remediations, or necessary fixes in order to add more armor and resiliency to our service.</p><p>How we define availability goes a long way towards defining how we monitor it. Availability for any given request generally means that we have been able to send the service a request and we have received a valid response. How this is actually computed may slightly vary by service and use case. I would like to start by describing a general method for computing availability, and then look at how it may be adapted for specific use cases.</p><p>A common heuristic that is used internally at Amazon for all tier 1 services is:</p><p><code>               (SUM of successful responses)<br> Availability = &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br>                 (SUM of valid requests)</code></p><p>This computation is done using time series data in order to be able to compute availability for any time frame. The keys to utilizing this formula for any given service is defining &#8220;successful response&#8221; and &#8220;valid request&#8221;. This is where the service team should be able to weigh in with a (hopefully concise) definition of how they have chosen to compute these data points through their service&#8217;s metrics. A &#8220;valid request&#8221; generally means that the request passed AuthN/AuthZ checks and input validation checks. A &#8220;successful response&#8221; generally means that a request was successfully processed by the service and it returned a successful response.</p><p>Let me start with a real world example of how I have seen these metrics computed for a Tier 1 web page service on a high traffic website. The service is available to the public internet, has dozens of upstream dependencies that are used to render the page, and is called through a client proxy. In this case we used metrics from the client proxy and computed availability using the following implementation of the above heuristic:</p><p><code>                (SUM of Http 2xx and 3xx Responses)<br>Availability = &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>                (All requests - Http 4xx responses)</code></p><p>For this formula, we had landed on the following definitions:</p><ul><li><p>&#8220;Successful Response&#8221; = We were able to successfully render a web page and return it with an HTTP 2xx or 3xx response.</p></li><li><p>&#8220;Valid Request&#8221; = Any request that did not result in an HTTP 4xx response.</p></li></ul><p>The reason why we eliminated HTTP 4xx responses from the calculation for this use case was primarily to eliminate spurious invalid requests caused by bot traffic from the computation. For a public web page on a high traffic website, there is potential for a non-trivial amount of bot traffic that can cause spurious errors. We wanted the availability of our service to be computed against known valid requests, that is real customers. We didn&#8217;t want the presence or absence of traffic from invalid requests to skew our availability metric. It&#8217;s important to note that we separately monitored HTTP 4xx errors by type in order to observe anomalies and find content errors. However, we considered that use case to be orthogonal to the availability of our service. If we included this traffic in the computation, it could potentially skew the statistic, and our goal was to maintain 5 9&#8217;s of availability against real customers and valid use cases not bots or content errors.</p><p><strong>NOTE</strong><em>: Using HTTP status codes in the manner above to compute availability only works if all availability issues result in an HTTP 5xx error. This includes critical dependency calls and timeouts. I have found that in practice, it is possible that a developer might chose to obscure availability issues and return a creative HTTP code other than 5xx, which would make this method problematic to monitor at the HTTP layer. </em></p><p>It is a best practice, where possible, to collect service availability metrics from the downstream client of your service. This is generally possible in cases where you have control over the client and the environment. Internal services are good candidates for monitoring availability from a client that has been vended and instrumented by the service team. However, for external endpoints, client-side monitoring is usually not possible. In that case, the best you can do is emit metrics either from the service itself or from a load balancer or proxy layer downstream from the service. The caveat for this is that this this monitoring will be blind of infrastructure/network connectivity issues that clients may have in reaching the service. Service teams should have a means of detecting these issues, if possible through secondary client-side metrics, or synthetics/canary style monitoring.</p><p>Just one additional point that is worth calling out. In addition to monitoring the service itself, the same technique can be used to monitor the availability of a service&#8217;s upstream dependencies. A well-built dashboard and monitoring system for a service will cover availability of critical dependencies at the same incident response level as the service itself.</p>]]></content:encoded></item><item><title><![CDATA[It's not about you]]></title><description><![CDATA[growing yourself means growing others]]></description><link>https://compiling.enstaria.com/p/its-not-about-you</link><guid isPermaLink="false">https://compiling.enstaria.com/p/its-not-about-you</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Tue, 03 Dec 2024 14:54:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3a85d413-9866-41e6-bf05-c381979ed731_1024x576.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Success isn't about the job title on your profile &#8211; it's about the impact you leave on the people who&#8217;ve travelled with you. One of the key attributes of true leaders is kindness. Not the superficial kind that's reserved for those who can benefit us, but genuine kindness that prioritizes the other person and makes their needs the priority over anything else.</p><p>In today's world, some have twisted success into a competitive sport, where winners stand on the shoulders of those they've pushed down. But here's the truth: real success isn't measured by how high you climb, but by how many people you bring up with you.</p><p>Bringing people with you means prioritizing their needs and bringing them with you on your growth journey. A growth journey comes in three steps:</p><ol><li><p><strong>Commitment to constant learning.</strong> Every person you meet is a potential teacher, every challenge a hidden lesson.</p></li><li><p><strong>Investment in yourself not for personal gain, but to better serve others. </strong>Think of it as filling your cup so you have more to share.</p></li><li><p><strong>Focus on growing those around you.</strong> If you're a leader, this isn't just part of your job &#8211; it's your primary mission.</p></li></ol><p>When someone on your team faces a challenge, don't see a problem &#8211; see potential waiting to be unlocked. When they struggle, don't see failure &#8211; see an opportunity for growth. Your role isn't to judge but to nurture, guide, and elevate. Here's the irony: when you make others' success your priority, your own success becomes inevitable. Not because you chased it, but because it's the natural byproduct of making your environment better.</p><p>True success isn't about reaching the summit alone &#8211; it's about how many people you've inspired to climb their own mountains. That's the kind of success that is long lasting and makes the world a better place to live in.</p>]]></content:encoded></item><item><title><![CDATA[Five Facets of Engineering Leadership]]></title><description><![CDATA[an introduction to a framework, which may be useful]]></description><link>https://compiling.enstaria.com/p/five-facets-of-engineering-leadership</link><guid isPermaLink="false">https://compiling.enstaria.com/p/five-facets-of-engineering-leadership</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Thu, 21 Nov 2024 14:58:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/78a59d8b-c67e-4b5f-b3ca-60f662245efe_1011x568.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Engineering leadership isn't a science. It's a set of practices and skills. Leading engineering teams requires navigating a balance of tradeoffs between people, technology, and business outcomes. I've been putting some thought lately into hiring and assessing performance of engineering leaders. It's not completely unique to me. I've absorbed content from a lot of different sources, including an interview with <a href="https://www.linkedin.com/in/emileifrem/">Emil Eifrem</a> on the <a href="https://www.developingleadership.co/episode/episode-4-rethinking-your-role-as-a-leader-with-emil-eifrem-from-neo4j">Developing Leadership Podcast</a> that prompted me to start writing some thoughts down. While I'm sure no framework is perfect, this one is landing well with me at the present stage. I'm currently thinking along the lines of five facets that form the foundation of great engineering leadership: People, Delivery, Operations, Strategy, and Business Impact.</p><p>These facets don't exist in isolation; they connect. Excellence in one area often enables success in others, overemphasis on one area can cause tradeoffs with others. And most importantly, neglect in any single facet can undermine the entire organization. An over-emphasis on delivery without strong people development creates burnout and turnover. Excessive focus on operations without strategic thinking leads to reshuffling deck chairs on a ship headed to nowhere. Too much strategy without solidly delivering results is like building a map to the end of the rainbow but never leaving the port.</p><p><strong>The facets in brief...</strong></p><p>The <strong>People</strong> facet is the starting point for any tech lead growing into a people manager. At its core, it's about creating environments where technical talent thrives through meaningful growth opportunities and psychological safety. However, mastering this skill also involves dealing with poor performance and addressing issues quickly and with empathy. A great people manager pushes each of his team members to be a better version of themselves, points out where they are not taking agency over their own destiny, and takes away excuses that are holding them back. It's not always about keeping individuals happy in a corner, it's about helping them win on the main stage.</p><p>As scope expands, people management evolves into building leadership benches, eliminating single points of failure, emphasizing fungibility, developing organizational culture, and creating scalable systems for growth and development across multiple teams. More senior levels of engineering leadership require balancing psychological needs with a sociological view of building an organization that drives business impact. This requires backbone and willingness to do and say the hard things.</p><p>The <strong>Delivery</strong> facet is where the rubber meets that road and is the most externally visible aspect of engineering. Delivery is essential to an engineering leader's success. Delivery requires going beyond being a nice people manager to being the coach who can take his team and drive the ball towards the goal. One of the most important delivery skills a manager can learn is saying "no". No sets a standard, defines a boundary, and enables the ability to prioritize. Saying No is really just the start of having the right conversation on focus and impact. Delivery also includes keeping an eye on quality. Quality is part of what you deliver. It's quite often that quality and delivery time end up being tradeoffs within any delivery cycle, and it's important for engineering managers to continuously help their teams make the appropriate tradeoffs between them.</p><p><strong>Operations</strong> excellence starts with system reliability and a focus on metrics. Metrics are a proxy to understanding the customer experience. New leaders learn to build and maintain reliable systems and read and pay daily attention to the metrics flowing out of the system in order to understand how system performance is impacting customers. Failure detection becomes measured in minutes, not hours. Change is viewed as constant, but not allowed to degrade performance. Operations is not about preventing failure, rather it is a mindset focused on taming failure. Good operational practices require you to think about failure as something you try to coax into a state of minimal impact, while maximizing the learnings.</p><p>The <strong>Strategy</strong> facet is an important muscle, but takes time to develop right. I would encourage any engineering manager to master the first three before worrying whether they are strategic. However, by the time a leader is managing managers, they should be able to build a view on where the tech needs to head to meet the ever-changing tech landscape, keeping an eye on key technological advances, and finding every opportunity to skate where the puck is moving.</p><p>Being able to drive<strong> Business Impact</strong> is ultimately the goal of every engineering team. Every member of an engineering organization should understand how their role directly relates to moving the needle for the business, which unlocks autonomy and decentralized decision-making for the benefit of all. Good engineering leadership is able to pass enough business context on to teams that individuals can become empowered to make the right choices on where they focus and how they solve problems. That doesn't mean the team will always get it right. There is no magic sauce to decentralization, but the closest ingredient is lots of communication of context. Be real, be hands-on, and help the team, but help them by getting out of their way until you need to get in their way.</p><p>The power of this framework is that it can be used for hiring and performance conversations, and it scales across leadership levels. Each facet provides clear paths for growth and development, whether you're leading a single team or an entire organization. The weight given to each facet naturally shifts based on leadership level. Engineering managers and senior engineering managers typically concentrate on people development, day-to-day execution, and operational excellence. Their focus remains closer to the ground&#8212;building high-performing teams, delivering projects efficiently, and ensuring system reliability. More senior managers and directors should be proficient and then elevate their perspective. Their people focus changes from psychological to sociological. Delivery starts becoming more focused on GTM. Operational concerns expand to include risk management, security, and scale. Most importantly, they must dedicate significant energy to strategy and business impact&#8212;understanding market dynamics, talent relevance, driving tech innovation, and ensuring engineering efforts align with business objectives.</p><p>If this works out, I am hoping that in a set of future posts, we can spend more time doing a deep dive into each facet, exploring specific practices, metrics, and tools that engineering leaders can use to develop their capabilities. We'll examine how to assess your current state in each area and create actionable plans for improvement. Whether you're just starting your leadership journey or seeking to elevate your impact at a higher level, these facets provide a clear framework for continuous growth and development.</p><p>Stay tuned for our first deep dive into the People facet, where we'll explore some thoughts and ideas on performance and growth. Because great engineering leadership isn't about being perfect in all areas, it's about always learning.</p><p>I'd love to learn from you all. What parts of this don't quite resonate with you? Any frameworks you are using that you like better? I'd love to hear other perspectives.</p>]]></content:encoded></item><item><title><![CDATA[I Deleted My LinkedIn Account. Here's Why]]></title><description><![CDATA[Growing up on a farm in Central Wisconsin, I had three life goals: eventually make a certain amount of money, build a house, and write a book.]]></description><link>https://compiling.enstaria.com/p/i-deleted-my-linkedin-account-heres</link><guid isPermaLink="false">https://compiling.enstaria.com/p/i-deleted-my-linkedin-account-heres</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sat, 14 Sep 2024 14:06:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fb1c51f0-668d-4086-8fbe-913d336f4cdd_1280x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Growing up on a farm in Central Wisconsin, I had three life goals: eventually make a certain amount of money, build a house, and write a book. Pretty modest dreams for a kid in the 80s, right? Little did I know that the money goal, once reached, really wasn't the right goal.</p><p>Enter LinkedIn, circa 2012. Suddenly, I'm obsessed. My LinkedIn profile became my digital shrine, a testament to every job, certification, and recommendation I'd ever received. I was like a digital hoarder, but instead of old newspapers and cats, I was collecting endorsements and connection requests. Hitting the magic 500 in those days was a feat, these days, not so much.</p><p>Fast forward to 2016. I've hit the jackpot - Vice President at a Fortune 500 company. My LinkedIn profile was so shiny you could see your reflection in it. I'd made it, baby!</p><p>Or so I thought.</p><p>That year was a rollercoaster ride through corporate hell. Restructuring, politics, and more buzzwords than a TED Talk convention. By the end of it, my boss was fired, my team was gone, and I was left wondering what the $%&amp;* just happened.</p><p>Reality check time. I realized two things:</p><ol><li><p>I wasn't the leadership god I thought I was.</p></li><li><p>I'd been so busy climbing the corporate ladder, I'd forgotten how to actually build it.</p></li></ol><p>Eight years of management had left me about as technically relevant as a floppy disk at a blockchain conference. I'd never personally worked on highly scalable distributed systems or built anything truly reliable. The cloud had happened and I was just an observer. I was all title, no substance.</p><p>So, I did what any self-respecting VP would do. I quit.</p><p><strong>I deleted my LinkedIn profile. </strong>I realized my profile had accomplished nothing for my career short of an anxiety inducing shrine of irrelevance.</p><p>I did something that I have done multiple times in my career. I pivoted. I took a massive step back and became an individual contributor at Amazon. For five years, I immersed myself in the world of cloud, highly reliable, scalable systems, and learned how a massively scaled company operated under the hood. I learned more in those five years than in the previous decade of management.</p><p>Coming out of Amazon, I had a revelation. Titles? They're just words. What really matters are two simple questions:</p><ol><li><p>Am I going to learn?</p></li><li><p>Am I going to have an impact?</p></li></ol><p>That's it. That's the secret sauce.</p><p>Now, here's where it gets juicy. I recreated my LinkedIn account, but this time with a completely different perspective. And you know what? I realized something profound:</p><p>In the industry, our hiring process is more broken than a chocolate teapot.</p><p>LinkedIn has become a proxy for who we are as professionals. But does it really show our ability to learn, teach, and make an impact?</p><p>So, here's my radical suggestion: Delete your LinkedIn account.</p><p>Yes, you heard me right. In this job market, where opportunities are scarcer than honest politicians, it might seem insane. But hear me out.</p><p>By deleting your LinkedIn, you're forcing yourself to redefine who you are as a professional. You're breaking free from the shackles of titles and buzzwords. You're giving yourself permission to focus on what really matters: your ability to learn, teach, and make a real impact.</p><p>Is this for everyone? Of course not. But it worked for me, and it might just work for you.</p><p>So, here's my challenge to you: Stop obsessing over your digital resume. Stop chasing titles like they're the last slice of pizza at a party. Instead, focus on becoming someone who can learn, adapt, and make a difference.</p><p>Because at the end of the day, that's what really matters. Not the fancy title on your LinkedIn profile, but the impact you can make in the real world.</p><p>Are you ready to take the plunge? Are you brave enough to define yourself by your abilities rather than your digital persona?</p><p>The delete button is waiting. The choice is yours.</p><p>(credit to my buddy Claude for helping me get this story on a page and CG for supplying artwork)</p>]]></content:encoded></item><item><title><![CDATA[Combating Bikeshedding]]></title><description><![CDATA[how to keep your team focused on what matters]]></description><link>https://compiling.enstaria.com/p/combating-bikeshedding</link><guid isPermaLink="false">https://compiling.enstaria.com/p/combating-bikeshedding</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Thu, 15 Aug 2024 14:10:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/67e77815-ee31-43ee-9a43-29a1fd727acd_1280x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In any business that&#8217;s moving fast, decision-making speed is critical. Decisions can run the gamut from trivial to business-critical, and getting good at scaling and distributing decision-making is both hard and ultimately rewarding, both for morale, as well as efficiency. It is possible for engineering teams to get stuck sometimes in making minor decisions that should be able to be federated, but for some reason get stuck into bike-shedding and just need to be made by a leader.</p><p>Bikeshedding refers to the phenomenon, which has some behavioral research behind it where teams spend too much time discussing trivial issues while ignoring more important, complex ones. The term originates from the idea that if you assign a committee to build a nuclear plant they will tend to spend a disproportionate amount of time debating how to build the bike shed than to tackle the design of the nuclear power plant. In essence, bike shedding occurs when people focus on small, simple decisions that need to be made because they are easier to reason about, than addressing the more complex designs that truly need attention. An example of possible bikeshedding is a long, protracted discussion on naming something or turf wars on using tech A vs tech B, when either are equivalent.</p><p>Bikeshedding is an anti-pattern. It can be overt, but usually not. It slows teams down and leads to suboptimal outcomes. While I am open to some diversity in approaches to engineering, there are things that I hold closely to and that are important. One of those things is that if teams are bikeshedding, it's a clear sign that we need a decision. There are only two possibilities for bikeshedding. (1) The issue isn&#8217;t trivial, it&#8217;s actually important that we get it right or (2) It&#8217;s truly not important and we need to make a decision and move on. So in either case, the best thing we can do is make the decision quickly and go.</p><p>In my role, I manage a series of meetings throughout the week that are designed to keep our team focused and aligned. (Full disclosure, I have enjoyed employing recommendations from <a href="https://www.linkedin.com/in/will-larson-a44b543/">Will Larson</a>'s writing). On Mondays, we have a staff leadership meeting where myself and my product leader meet with our leadership teams. On Tuesdays we have the Eng Ops Review meeting where we dive deep into metrics, initiatives and postmortems. On Wednesdays we have our Tech Spec Review, which allows us to take a close look at the design work being done. These are my favorite meetings all week. It allows me to dive deep with the teams across all aspects of engineering regularly.</p><p>Team members know that these meetings are where decisions can be made and are actually starting to use the process. So it&#8217;s highly likely that if they need help making a decision, they will raise the issue in one of these meetings. It&#8217;s hugely rewarding to see that happening. It makes the meeting valuable to me and everyone when we start using it for its intended purpose instead of it being a one-way meeting. Most decisions are either (1) Architectural, (2) Product oriented or (3) Bikeshedding. I highly encourage debate in these meetings to make sure every side is heard, but it&#8217;s also important to end the discussion with a promise on how the decision will be made.</p><p>Just last week during one of our tech spec reviews an engineer asked a simple question: "What should we name this feature?" It was a classic bikeshedding moment. Naming something may seem trivial, but it's the kind of question that developers could spend hours debating. My favorite approach to naming things like services is to try to not come up with a name that represents what the service does. Naming things can lock you or those that come later into a mental state that limits utility or expansion. However, in some cases, it makes sense to name things well, such as when it comes to schema objects and attributes. This could have been a bikeshedding moment, but it was easily solvable in this case.</p><p>There are really just two effective options for any bikeshedding moment. The first is to apply Occam's Razor, a principle that suggests the simplest solution is often the best one and just make the decision on move on. The second option, if it's not so clear, is to delegate the decision to someone who owns that aspect of the project to come up with tradeoffs to help make the decision quickly. Ideal timeframe for any decision is a small number of days. In this particular instance, I chose to delegate the decision to the product team. Although I could have easily used Occam's Razor to settle the debate, I recognized that naming the feature was something that the product team needed to own. It was important for the product, not engineering, to own this part of the project. By delegating, I placed the decision in the right hands and reinforced the importance of ownership within our organization. I also asked them to come back with an answer within a day.</p><p>Naming things is an easy example, but these techniques also work with architecture or tech stack decisions. These sorts of decisions can actually be more gnarly, especially if code has been written. My favorite quote on this is from <a href="https://www.linkedin.com/in/behorowitz/">Ben Horowitz</a>'s book, the Hard thing about Hard things:</p><blockquote><p>Early in my career as an engineer, I'd learned that <strong>all decisions were objective until the first line of code was written</strong>. After that, all decisions were emotional.</p></blockquote><p>Decisions made after code has been written almost always have an emotional element to them. Sometimes it helps to remind teams of the emotional part of the decision and ask them for objectivity. It can help to ask them to come up with a short list of the most important tradeoffs involved in the decision and which of those are most important to the business. Ultimately, decisions that involve code already written almost always have to be made by the most senior engineer or engineering leader.</p><p>Using Occam&#8217;s Razor or delegating to a reasonable owner and asking for quick turnaround are both effective tools for driving through and making decisions quickly. Some decisions can be deferred for later, which is another technique, but not making a decision that needs to be made is the worst possible outcome for a team that can leave lasting impacts. It&#8217;s better to come to a decision, even if it&#8217;s tempered with &#8220;for now&#8221;, and everyone can move on.</p><p>Bikeshedding is a common challenge in any team. So the next time a bikeshedding moment arises, remember to stay focused, delegate when necessary, help your team get to an answer quickly and move on.</p>]]></content:encoded></item><item><title><![CDATA[Core Engineering Problems]]></title><description><![CDATA[I was thinking today about a core set of engineering problems that have emerged more than once in systems I've worked on.]]></description><link>https://compiling.enstaria.com/p/core-engineering-problems</link><guid isPermaLink="false">https://compiling.enstaria.com/p/core-engineering-problems</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Wed, 10 Aug 2022 14:04:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7ce19912-30dd-40d0-853b-29bb6fa923ac_1024x682.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was thinking today about a core set of engineering problems that have emerged more than once in systems I've worked on. I'm not going to say or even pretend that this is a comprehensive list. This list doesn't even contain a single of the really hard-core computer science problems out there such as cache invalidation, naming things, or eventual consistency algorithms. No, this list is probably best described as a maslows hierarchy of needs for building internet software. If they are in place, the software has a chance of being something customers will love. If they are not in place, pain will ensue. What is this hierarchy of needs for software you ask? I break it down in to 4 software needs and 3 engineering needs.</p><p>4 things software needs before building features:</p><ol><li><p><strong>Trust (safety)</strong>: This is so obvious that I at first didn't have it on the list. In order for customers to experience the delight of your new service and actually invest a part of themselves in using it, they need to trust it. They need to trust you. They need to trust that if they install your app, you won't do something bad to them or to their device. Whether they ask for it or not, a customer who stays with you wants an offering that prioritizes their safety over all else.</p></li><li><p><strong>Scaling (+resiliency)</strong>: This may often be overlooked early, and in some cases that's OK for early stage. However, before too long, economics + demand will require scaling and resiliency aspects. Customers expect you to be up when they want to use your service, at their convenience, not yours. This needs upfront design once demand emerges.</p></li><li><p><strong>Engagement (+personalization)</strong>: I can sense resistance to this requirement already. Why am I putting engagement and personalization in a list of otherwise nonfunctional features? It's intentional. In order to build features in the right way, we need to assume that we understand how our customers engage with our product, why do they keep coming back? And how is our service personalized to them and their needs. These are not features, they are critically core fundamentals that need to be understood and engineered as first class citizens before we consider adding anything else. If we start with these, features are just additive to the core.</p></li><li><p><strong>Rendering (+globally):</strong> Understanding how you plan to render your experience to your customers across all of the devices your customer has in all regions of the world is often overlooked. I have seen so many systems that use HTML as some sort of content standard that I don't even blink anymore, but it shows a critical misunderstanding of how the internet works today, or an over-eagerness to jump on the latest javascript library before thinking about the customer experience. Customers expect to be able to access your service on any device that they have handy, at their preference, not yours, which more often than not is a mobile device. Building a service that scales globally across mobile (first) + web takes intention in the design and doing it well doesn't come for free.</p></li></ol><p>3 things engineering teams need to build features:</p><ol><li><p><strong>Operations (minimize):</strong> Engineers need to own their code in production, but that shouldn't mean they are spending all of their time doing operations. Take time up front to build an automated set of operational capabilities, and always expect failure to happen and have a plan of action for when it does.</p></li><li><p><strong>Safety (maximize):</strong> The number one thing, by far, that slows many engineering teams down is protocol that is intended to act as a safety net to keep failures from making it to production. It's the most frustrating thing in the world for everyone involved, including your customers to intentionally throw up safety protocols that require manual checks and sign-offs or manual testing before making a change. It's a defect, not a feature. Instead, make your deployments safe through automated testing, continuous integration and continuous deployments. The only safety nets you need (in most cases) are (1) instrumented observability that reliably tells you the availability of your experience (and wakes you up if it's down), (2) limiting your blast radius (rolling out change slowly), and (3) (most important) make rollbacks as fast as possible and automatic.</p></li><li><p><strong>Tooling (stop coding stupid):</strong> Don't reinvent the wheel, embrace frameworks, use tools that auto generate the stupid code so that you can focus on features and functionality. Stop building snowflakes, and instead build what customers really need.</p></li></ol><p>I'm 100% positive this is not a normative list, happy to hear other thoughts in the comments...</p>]]></content:encoded></item><item><title><![CDATA[What is Business Value?]]></title><description><![CDATA[Attempting to define business value as a software architect and technology leader has been a challenge.]]></description><link>https://compiling.enstaria.com/p/what-is-business-value</link><guid isPermaLink="false">https://compiling.enstaria.com/p/what-is-business-value</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Wed, 03 Aug 2022 14:21:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Attempting to define <em>business value</em> as a software architect and technology leader has been a challenge. A software engineer&#8217;s default tendency is to want to sink their teeth into something new. &#8220;Fresh meat&#8221;, as it has been called, writing cool new code is always more fun than maintaining that &#8220;legacy&#8221; code of the past. I&#8217;ve learned that the term &#8220;legacy code&#8221; really means &#8220;code someone else wrote that I don&#8217;t like&#8221;. Of course, as Uncle Bob (Robert Martin) would say, the code we are writing today is tomorrow&#8217;s legacy code, but that&#8217;s a challenging lesson to try to teach to an engineer.</p><p>One of the skills that I think differentiates a software developer from an architect is being able to negotiate a technical decision path that achieves the company&#8217;s goals and delivers maximum business value. Business value is a complex topic, but in some respects I think that agile coach David Hussman describes it best in his &#8220;<a href="https://devjam.com/2016/06/06/dudes-law-don-reinertsen-and-wallmart/">Dude&#8217;s Law</a>&#8221;. He asserts that V = W / H, where V is value, W is why (intent) and H is how (mechanics). Go read his post (or better yet, have David come coach your team) for more details because he can describe it better than I. However, I think his simple formula has some profound implications for business.</p><p>I&#8217;m reading Josh Kaufman&#8217;s book, <em><a href="https://personalmba.com/">The Personal MBA</a></em>. In his chapter, &#8220;Playing with Fire&#8221;, he discusses some of the pitfalls that many modern, especially large business fall into around using statistics, financial numbers, and algorithms to attempt to predict the future. Business executives have gotten used to using complex statistical models to attempt to divine the future, and in many respects are not all that better than reading tea leaves.</p><p>He espouses that <em>creating and delivering value</em> is essential for business success, but that many top executives come from a finance background and spend more time looking at and trusting their numbers than thinking about creating and delivering value. This can be exacerbated when the company doesn&#8217;t have enough people on the management team who have come through the ranks from the area where the value gets created. When the statistical models get trusted and attains higher weight than building and delivering value, I think the company&#8217;s future becomes suspect. It gets worse when a statistical model is used to drive &#8220;efficiencies&#8221; which in many cases end up lessening, not building value.</p><p>I also think that many companies today are far too short-sighted in their planning, especially those that are public. Planning, is based around annual financial statements (forced in some part by the SEC). Every decision is very heavily predicated on how the financial statements will look for the year, rather than building the overall value proposition.</p><p>So, it appears to me that learning how to quantify business value and use it in decision making is a skill not only important for software architects, but business leaders as well.</p>]]></content:encoded></item><item><title><![CDATA[How are excellence and integrity related?]]></title><description><![CDATA[For a couple of years now, I have had as my motto, Lenny Bennett&#8217;s quote &#8220;Excellence is the result of habitual integrity&#8221;.]]></description><link>https://compiling.enstaria.com/p/how-are-excellence-and-integrity</link><guid isPermaLink="false">https://compiling.enstaria.com/p/how-are-excellence-and-integrity</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Fri, 03 Jan 2020 15:20:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For a couple of years now, I have had as my motto, Lenny Bennett&#8217;s quote &#8220;Excellence is the result of habitual integrity&#8221;. You can find it as the subtitle to this blog. It all started a couple of years ago when I was running a small development group at my company that was a center of excellence. We were tasked with being the thought leaders, scientists, educators, and experts at what we did. I was looking for ways to motivate my team and demonstrate some characteristics of what I thought was key to our success. That quote seemed to sum it all up.</p><p>As time has passed, I&#8217;ve become more and more fond of the quote. I don&#8217;t know anything about Mr Bennett (he was before my time), or why he coined this phrase. However, I think it is fairly significant, and I&#8217;m still in the process of unwrapping all that it means. In fact, I&#8217;ve purposed to study and think on the topic over the next year to see where it leads. I&#8217;m convinced that integrity is key to success in life on many different fronts. Excellence is really more an accidental outcome, but more on that later.</p><p>In this post, I&#8217;m not going to be able to unwrap the meaning of integrity, how it applies to life on many fronts, and how one of its many outcomes is excellence. However, let me start with something simple, the definition of the word integrity.</p><p><a href="https://1828.mshaffer.com/d/word/integrity">Webster&#8217;s 1828 dictionary</a>, which is one of my favorite dictionaries, defines integrity in this way:</p><blockquote><p><em>1. <strong>Wholeness; entireness; unbroken state.</strong></em></p><p><em>The constitution of the United States guaranties to each state the integrity of its territories.</em></p><p><em>The contracting parties guarantied the integrity of the empire.</em></p><p><em>2. <strong>The entire, unimpaired state of any thing, particularly of the mind; moral soundness or purity; incorruptness; uprightness; honesty.</strong></em></p><p><em>Integrity comprehends the whole moral character, but has a special reference to uprightness in mutual dealings, transfers of property,and agencies for others.</em></p><p><em>The moral grandeur of independent integrity is the sublimest thing in nature, before which the pomp of eastern magnificence and the splendor of conquest are odious as well as perishable.</em></p><p><em>3. <strong>Purity; genuine, unadulterated, unimpaired state; as the integrity of language.</strong></em></p></blockquote><p>The more recent <a href="https://www.merriam-webster.com/dictionary/integrity">Miriam Webster Dictionary</a> defines it this way</p><blockquote><p><em>Definition of INTEGRITY</em></p><p><em>1.<strong> firm adherence to a code of especially moral or artistic values : incorruptibility</strong></em></p><p><em>2. <strong>an unimpaired condition : soundness</strong><br>3. <strong>the quality or state of being complete or undivided : completeness</strong></em></p></blockquote><p>Just a quick comparison of what nearly 200 years has done to this word in the Webster dictionary.</p><p>1) Both mention completeness, entireness. Integrity is the state of being complete without holes, without brokeness</p><p>2) Adherence to morality is mentioned in both, although the 1828 edition has much more to say about what that means. (Why this is true is surprising and worth a post in and of itself).</p><p>Just reading through these definitions of integrity gives me a lot of things to think about. The next thing that comes to mind is how these definitions drive me to believe that integrity is related to reality, which is the very definition of truth itself. Definitely much more to think on for a future post.</p>]]></content:encoded></item><item><title><![CDATA[Integrity is hard, but makes life easy]]></title><description><![CDATA[As I was thinking about the topic of integrity this week, I received a newsletter from Mark Bouman, a missionary to Cambodia.]]></description><link>https://compiling.enstaria.com/p/integrity-is-hard-but-makes-life</link><guid isPermaLink="false">https://compiling.enstaria.com/p/integrity-is-hard-but-makes-life</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Thu, 31 Jan 2019 15:19:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As I was thinking about the topic of integrity this week, I received a newsletter from Mark Bouman, a missionary to Cambodia. He had a pretty profound statement in his newsletter that I think relates directly to integrity.</p><blockquote><p><em>If you&#8217;re only willing to do what&#8217;s easy,life will be hard. If you&#8217;re willing to dowhat&#8217;s hard, life will be easy.&#8221; Doinghard things, going the extra mile, makingthings right in relationships, keepingcommitments even when no oneis looking, these are the things that</em></p><p><em>make us who we are&#8230;</em></p><p><em>Mark Bouman</em></p></blockquote><p>Integrity is being true to who you are, and always doing the right thing, even if it&#8217;s not the easiest route. It&#8217;s providing the value your customers paid you for even if it costs you more than you originally thought to deliver. It&#8217;s tidying up your code before you commit it so that your coworkers don&#8217;t have to inherit a mess. It means being consistent to a methodology that you&#8217;ve committed to even when everyone around you is getting by with under-achieving.</p><p>Integrity&#8230; It&#8217;s a hard thing to commit to, but if you do, life becomes a whole lot easier.</p><p>Integrity makes you real.</p>]]></content:encoded></item><item><title><![CDATA[Prayer]]></title><description><![CDATA[God, give me grace to accept with serenity the things that cannot be changed, Courage to change the things which should be changed, and the Wisdom to distinguish the one from the other.]]></description><link>https://compiling.enstaria.com/p/prayer</link><guid isPermaLink="false">https://compiling.enstaria.com/p/prayer</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Wed, 26 Sep 2018 14:18:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>God, give me grace to accept with serenity the things that cannot be changed, Courage to change the things which should be changed, and the Wisdom to distinguish the one from the other. Living one day at a time, Enjoying one moment at a time, Accepting hardship as a pathway to peace, Taking, as Jesus did, This sinful world as it is, Not as I would have it, Trusting that You will make all things right, If I surrender to Your will, So that I may be reasonably happy in this life, And supremely happy with You forever in the next.</p><p>Amen.</p></blockquote><p>&#8212; Reinhold Niebuhr</p>]]></content:encoded></item><item><title><![CDATA[What is important is seldom urgent and what is urgent is seldom important]]></title><description><![CDATA[I was reminded of this quote from Eisenhower the other day:]]></description><link>https://compiling.enstaria.com/p/what-is-important-is-seldom-urgent</link><guid isPermaLink="false">https://compiling.enstaria.com/p/what-is-important-is-seldom-urgent</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sun, 26 Aug 2018 14:16:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/96cb9f06-2851-44b9-a8a4-40060da2d879_500x254.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was reminded of this quote from Eisenhower the other day:</p><blockquote><p><em>What is important is seldom urgent and what is urgent is seldom important</em></p></blockquote><p>I was looking at a long list of tasks that all seemed to urgently stare me in the face. Each of them was blinking like the red light screaming at Sandra Bullock in Gravity, calling to me to be worked on.</p><p>In times like this, I think the Time Management Matrix from Covey&#8217;s 7 Habits book is a good exercise to start with. (succinctly described <a href="https://en.wikipedia.org/wiki/Time_management#The_Eisenhower_Method">here</a>) It&#8217;s a good way to look at what&#8217;s screaming at you and think about separating the important from the non-important and critically analyzing yourself on whether you are taking the time to work on important things that are not urgent.</p><p>It&#8217;s really the latter (quadrant 2) that worries me the most sometimes. If we are not taking the time to strategically think through what are the most important things we need to get done this week, this month, this quarter, or this year, we tend to let those little blinking red lights tell us what to do and we end being automatons, following whatever fire-drill happens to have the most sirens.</p><p>It&#8217;s not all that bad to respond to fire-drills. Certainly any important things (quadrant 1) are important. But, for a lot of us, if the source of the tasks we choose to do is entirely driven by quadrant 1, then I think it&#8217;s a good time to stop. Think about what our strategy and goals are for the year, for the month, for the quarter, whatever, and then make sure that we are forcing what is important to the forefront whether or not it is the loudest voice.</p>]]></content:encoded></item><item><title><![CDATA[How to drive innovation]]></title><description><![CDATA[I&#8217;m in the middle of several books right now, one of them being Game Changer, an operating manager&#8217;s guide to turning innovation into strategic advantage by A.G Lafley and Ram Charan.]]></description><link>https://compiling.enstaria.com/p/how-to-drive-innovation</link><guid isPermaLink="false">https://compiling.enstaria.com/p/how-to-drive-innovation</guid><dc:creator><![CDATA[Andrew Elmhorst]]></dc:creator><pubDate>Sun, 26 Aug 2018 14:14:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kiO2!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf51854-cc46-4a24-bebd-51f1eda69fe7_650x650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;m in the middle of several books right now, one of them being <em>Game Changer, an operating manager&#8217;s guide to turning innovation into strategic advantage</em> by A.G Lafley and Ram Charan. The basic premise of the book (co-written by the former CEO of Proctor &amp; Gamble) is that innovation can be teased out, encouraged, measured, and integrated into everything done within a company, including it&#8217;s operating plan.</p><p>Here are a few ideas I thought worth saving so far:</p><ul><li><p>There is a difference between invention and innovation. &#8220;<em>An invention is simply a new idea. An innovation is the conversion of a new idea into revenues and profits. An idea that looks great in the lab and fails in the market is not an innovation. It is, at best, a curiosity.&#8221;</em></p></li><li><p>Innovation enables you to be on the offensive. Not innovating puts you on the defensive. &#8220;<em>Innovate or Die</em>&#8221;.</p></li><li><p>&#8220;<em>Collaboration is essential; failure is a regular visitor. Innovation leaders are comfortable with uncertainty and have an open mind; they are receptive to ideas from very different disciplines. They have organized innovation into a disciplined process that is replicable.</em>&#8221; They can manage risks.</p></li><li><p>&#8220;<em>Every company has a budgeting process that is repetitive, refined and ingrained &#8230; with every manager participating in the process&#8230;. But few corporations can say the same when it comes to innovation.&#8221;</em></p></li></ul>]]></content:encoded></item></channel></rss>