God, grant me the serenity to accept the things I cannot change,

courage to change the things I can,

and wisdom to know the difference.

 

The Serenity Prayer 

 

At Button we run a SaaS platform. It's a distributed system, and like most distributed systems it's almost always experiencing some sort of failure. Offering a reliable product in spite of this requires knowing when and where failures occur so we can build resiliency around them. 

 

This isn't always easy. Even with a centralized view of all those failures, sense has to be made of them. Are those HTTP timeouts okay or suggestive of a burgeoning outage? Are they drowning out a critical null pointer dereference? Is a lack of errors a good sign or cause for concern (because the monitoring system is down)?

 

There's lots of signal and lots of noise, and we need help to know the difference. For us, that help arrived in the form of a simple model and some tooling tweaks. With confidence and calm, we're alerted to the things we can can change and accept the things we cannot.

 

A Simple Model

The bedrock of our approach is a basic taxonomy. Errors are generally one of two types, each of which we need to understand quite differently:

  • Deterministic Errors are those that happen consistently given the same system state and input. A programming bug like calling a non-existent function is a deterministic error, as is assuming the wrong shaped payload in an HTTP request. No matter how many times a procedure is retried, the same failure will always occur. 

  • Transient Errors are those that happen spuriously given the same system state (at least, locally) and input. These are typically seen at process boundaries, for instance when an HTTP request times out or a database connection fails. Unlike a deterministic error, retrying the same procedure will often not yield the same failure, and in fact succeed. These errors usually emanate from operations that take a lot of time; accordingly it's interesting to observe how long attempting them takes. This moves us away from a simple, binary understanding of system behavior (success vs failure) and towards a more enlightening gradient (1%, 10%, 50% within normal operating bounds).

 

Conflating these two types in an error handling scheme will likely cause a good deal of frustration, since they have distinct causes and distinct remedies. Deterministic errors reveal an honest mistake or invalid assumption on behalf of the programmer. They can be corrected by updating the software. Transient errors are a fact of life: the network and its associates are just unreliable. They are problematic only in aggregate (e.g., if >= 1% of network requests are failing), and actionable only in that they can likely be tuned (increase that timeout) or avoided (drain that data center).

 

Handled separately, the way forward is a bit clearer. We need a precision tool for surfacing deterministic errors with enough supporting context to make a fix obvious. Every time that tool shows up with a new error, we need to triage and prep a fix with urgency. Then, we need an aggregate tool to track the rate of transient errors of a given type. Every time that tool reports a rate crossing some healthy threshold, we similarly need to triage and evaluate.

 

Some Tooling Tweaks

Somewhere, somehow, we'll have to make a determination when a failure occurs: was it transient or not? In our programming environment of choice, that might look something like this:

 

 

  1. We attempt some operation that might exhibit a transient failure. 
  2. If it fails, we inspect the error to determine if it was transient or not. If it was, we mark it as such (helpful later on). Always, we re-throw the error to preserve any logic up the callstack.
  3. We note that we attempted an operation, whether or not it was successful, whether or not a failure was transient, and how long it took.

Tedious though it would be to repeat this ceremony over and over again every time we invoked a spurious operation, we are software engineers and abstraction is our game. Bundled within a function, this boilerplate only needs to be parameterized by a few values depending on its context:

  • operation — You know, that actual thing we're trying to accomplish.
  • isTransient — A function that evaluates an error and returns true or false. The implementation of this will depend highly on what operation is. For example, if operation is making an HTTP call, isTransient might return true for HTTP 5XXs, socket errors, and timeouts, and false otherwise.

Elided in the code snippet above for brevity, a couple other descriptive parameters will go a long way in distinguishing different classes of transient failures in our tooling:

  • target — The target of a given operation, most likely an external system. Think: "rpc", "postgres", "redis", etc.
  • action — If target is the noun, action is the corresponding verb. With this, we can differentiate between things like read vs write failures. This is best demonstrated with a little table:

Despite getting some of the boilerplate out of the way, all of this might still seem a bit tedious: every time we make a make an rpc call we need to specify isTransient, target, and action? Well, no: we can again reach for abstraction.

 

It's never a bad idea as a company to have a standard library of sorts, specific to the needs of your team. Such a library is a suitable home for a general purpose "rpc call" module that can conveniently package all this (and more!) behind an interface that asks little more of the typical developer. isTransient can assume a default implementation appropriate to most rpc calls, target is implicit to the interface, and since a developer will likely need to specify the hostname, method, and pathname of an http request anyways, they get action for free. The same trick can be applied to any other operation; most folks will never need to sweat all this instrumentation and categorization happening behind-the-scenes.

 

Still unresolved however is with what tools we route and track our two failure types. Unmistakably, the least important aspect of this scheme is what specific tools we use — many will do admirably — but to complete the illustration we'll reach for Sentry and Prometheus.

 

 

Sentry is an excellent precision tool for recording our deterministic failures. Errors are typically reported with fully hydrated stack traces, then coherently grouped together under an "Issue" complete with all the accessories you'd expect (histograms, summary statistics, …). Triaging, discussing, and ultimately resolving Issues is as seamless as you'd hope, and Sentry is happy to dispatch alerts via whatever channel a team prefers.

 

To route deterministic failures into Sentry, we place a catch-all error handler high in our application's callstack. Since errors that bubble up to it will have already been evaluated as deterministic or transient (and marked accordingly), we can most simply forward the deterministic ones to Sentry and discard the transient ones.

 

Prometheus meanwhile is an excellent aggregate tool for recording all attempts of an operation that can fail transiently. It centers around the notion of a metric: any number that can be observed to change over time. Metrics can have labels associated with them, which is a convenient way to indicate, for a given operation, whether it succeeded or failed, was transient or not, and which target/action it was for. There's plenty more to say on this topic, but for our purposes it means we can easily derive charts like this, that show the 90th percentile duration of all operations, grouped by target and action:

 

 

This may seem like a lot of data to exfiltrate for every operation an application might need to invoke. Fortunately, Prometheus operates by polling applications for their metrics on a fixed sampling interval. All that's required of an application is to update some bit of in-process state when an operation is attempted, then expose that state over a well-known HTTP route. We've never feared instrumenting even the most performance-critical applications this way.

 

Alerting is as simple as we described earlier — we can set a threshold failure rate over which we wish to be paged to investigate. If our threshold was set at 0.01% per target, then the following metric would have paged us shortly after 14:00:

 

 

Undoubtedly such subtle shifts in failure rates are hard to detect without the right tooling. With it, big changes in system performance leap off the page.

 

Where Does All This Leave Us?

With good system coverage of this scheme, it's reasonable and safe to expect a few dramatic outcomes.

 

First, expect to spend way less time sifting through logs or idly wondering if something is breaking without your knowing. You'll be alerted early when the characteristics of your system change for the worse. When they do, you'll have excellent supporting information to guide your investigation. But — you won't be paged when the odd HTTP request times out while things are otherwise healthy.

 

Second, despite the overhead of the scheme, expect your engineering team to spend less time dithering about how to handle error conditions while they're making routine updates. This scheme does the heavy lifting for them. In almost all cases, folks will just want to use the standard library abstractions provided and let any errors bubble up to the catch-all handler.

 

Finally, expect your customers to see a higher quality of service. Measuring something is the first step towards optimizing it.

 

 

Special thanks to Chris Maddern and Adam Berger for reviewing drafts of this post.