Scaling challenges are inherently fun. Properly motivated, they mean so many people find value in what you’re building that you’re compelled to address them. They knock elbows with some of the most technically interesting topics in computer engineering. Maybe above all else, they tend to be ruthlessly measurable, so your success — when it comes — is obvious.
For the first four years of our business and platform (in essence, a set of APIs and clients companies we partner with use), we at Button resisted the siren song of an elaborate architecture that could scale to billions of customers with zero downtime. We had other more urgent priorities at the time, like getting our first thousands of customers through shipping and iterating on our product. Yet somewhere in the salad days of 2019 it became obvious we had outgrown aspects of our initial design and, for reasons we’ll discuss, concluded the moment was right to re-author our topology and go “multi-region”.
Now, with two years of operational experience running the result of that effort, we feel qualified to share the juiciest nuggets of our experience. In this post we’ll try to cover the most salient and transferrable aspects, namely:
- Why do you go multi-region?
- When do you go multi-region?
- How do you go multi-region?
This will often be through the lens of our journey, with an earnest attempt to generalize where possible. Let’s dive in!
Why Do You Go Multi-Region?
Before we address this question directly, let’s take a quick detour to an even more fundamental question: what do we even mean when we say “multi-region?”
Consider the following relatively simple hosted service, running out of a single data center in Oregon:
Requests from the internet arrive at a public-facing load balancer in a data center. The load balancer forwards requests to an application (there may be many instances running to choose from), which in turn consults a database and another internal service to respond.
Even with such a simple setup, we’re already in great position to motivate the idea behind going multi-region.
First, consider the resiliency of this design. Put your thumb over the database and imagine if it were down. If every request depends on access to that database, then we’ve identified a single point of failure: it goes down and our business goes down. The same goes for the internal service and many things in between: networking devices, physical machines, power supplies, etc. Going multi-region isn’t the only way to address these problems, but as we’ll see it’s a very effective way to defend against an entire class of them in one fell swoop.
Second, consider the latency of this design. Our customers on the west coast of the United States are probably perfectly happy with how long it takes to talk to our servers in Oregon (likely in the low tens of milliseconds per request). Folks in South Africa probably feel differently however, since the physical distance bits have to travel is so much larger. They likely experience hundreds of milliseconds of latency per request; which, multiplied a few times to account for establishing a secure connection and actually transferring data, will yield noticeably peeved humans. Unlike resiliency, latency seems to be uniquely solvable by going multi-region. There’s just little wiggling around the speed of light.
Given these deficiencies, the concept of a multi-region design shouldn’t be all that surprising: host your software in many data centers!
Here we’ve copied everything from our Oregon data center and fork-lifted it into another running in South Africa. With careful design of our application, we can be completely indifferent as to which data center serves any given request. From this property, we can develop a policy that determines which data center a given request is routed to. Something like:
Route a request to the nearest data center that is fully operational.
Our resiliency characteristics are dramatically different under this design. Not only can we place a thumb over the database in Oregon and imagine it offline, we can place our hand over the entire region (maybe intrepid Oregon state decides to disconnect from the continental US and float off into the pacific) and still see a way for our service to be operational. Per our routing rule, as long as we can detect malfunctions in Oregon, traffic should be routed away from the failing region and to the nearest operational one.
This is profound. Out of the dizzying possibilities of things that could fail catastrophically in a region, all have the exact same fix: detect the failure and route folks away while we correct it. And while the chance that a specific component fails may be low, the chance that any component fails may be quite high.
The careful reader may note that we could achieve a similar resiliency story by spinning up redundant copies of a system in a single region. True in a sense, it’s also the case that catastrophic failures tend to be strongly correlated by geography. Data centers, it turns out, aren’t immune to fires, and cloud providers like AWS opt to communicate the health of their services by one facet: region.
With respect to latency, we’ve solved the problem physically. Light (and thus, a byte) just doesn’t have to travel as far. Depending on where our customers are, we can identify where new data centers will yield the most upside. Notably, the benefits of reducing latency are often non-linear. Button for example partners with large, established companies who may only sign a deal subject to an agreement on acceptable latency.
At this point maybe you’re convinced that running an additional region (or N additional regions) has some understandable upside. While probably not sufficient to motivate the effort in themselves, there are a handful of other benefits to be reaped as well:
- Diagnostics — Whether an issue is or is not present in other regions swiftly constrains what the likely root causes of it could be. A salient example for us has been regional issues with our hosting provider: noticing pathological behavior occurring only in a single region suggests it’s unlikely the problem is within our application.
- Consistent Infrastructure — Since there will be new regions to manage, you’ll have the opportunity to revisit every decision you made during the initial build of your infrastructure and apply any found wisdom. Consistency across regions being paramount, you’ll be incentivized to reach for tools that can manage things for you (e.g., no more getting away with clicking around in a cloud console). That also means it’ll be cheap to build incremental regions.
- New Migration Strategies — Redundancy unlocks a whole chapter out of your migration playbook. Test risky changes by deploying to a single region at a time, capping the downside if things do go sour. Route traffic away from a region that needs a critical database migrated. Rollout a major networking change at the heart of your system by rebuilding the entire region from scratch and slowly transitioning traffic over to it.
We’ve notably trivialized some major challenges in our description, namely:
- How do the servers in other regions get there?
- How do you actually keep many distinct regions running identically?
- How do you implement the routing policy that determines which region receives a request?
- What if that database needs to stay in-sync across regions? How does that happen?
- What if my application is huge and rebuilding all of it would be prohibitively expensive?
If you’re convinced of the upside and can defer your skepticism on these points, read on! All will be answered.
When Do You Go Multi-Region?
If the last section reads mostly as why you should go multi-region, this one will read mostly as its rebuttal. After all, answering when is in many ways the same as answering should.
Regrettably, when will predominantly depend on requirements completely unique to your team and your business. To make progress we’ll have to dispense with any commitment to precision and resort to generalities.
On one hand, there’s the time-value of refactoring. Your system will only get more complex and it’ll never be cheaper to overhaul than now. Designing features for multiple regions will always be easier than designing them for single regions and retrofitting them later.
On the other, it’s an enormous undertaking that could consume your best engineers for months. It will pull from your complexity budget. Designing features for multiple regions will always be harder than designing them for single regions.
So we’re looking for a sweet spot. Can you afford to ship less product? Are you confident enough in the shape of your business that you’re willing to invest in scaling it? How consequential would a prolonged outage be on your business’ probability of success? If you’re looking for a rule of thumb, this is about as good as we can offer:
You should go multi-region at the exact instant it would be irresponsible to your customers not to, and not a moment earlier.
A number of signals helped us determine we had crossed the threshold, which we present in case they have some power as patterns to match from:
- Constant Risk, Increasing Consequence — From day one we’ve built thoughtfully, and didn’t feel like our growth was producing much added risk (the percent chance) that we’d fail catastrophically. But the consequence (the effect) of a bad failure was ratcheting ever higher as we garnered more and more wins. We had more revenue to lose during an outage, more partners to suffer reputational damage from, more contracts contingent on uptime requirements. To the room, downtime was flatly unacceptable, and nothing could excuse it, up to and including downtime in our hosting provider.
- Product Stability — We’d converged on a set of interfaces that had created value for us and our partners for years, and we couldn’t see a near future that would fundamentally alter that. We were by no means done trying new things, but a kernel had emerged we were ready to commit to.
- Operational Terror — We had kicked the can on a number of changes to our stack for fear of what would happen when we rolled them out. Easy examples include migrations to a hot database in the critical path of our customers or tweaks deep in the internals of our networking. Having regional redundancy gave us a stress-free path through these upgrades.
- Customer Expectations — A double-edged sword of operating business-to-business is big customers will let you know what they want. We started receiving screenshots from the UK asking why our latency was so poor, which helped make the case internally. Even if we operated business-to-consumer, there’s evidence that latency has a direct effect on the bottom line (no doubt why our partners cared).
Finally, going multi-region is not an all-or-nothing proposition. The answer may be a hard “no” — or at best unclear — taking in your technical surface area as a whole, but an enthusiastic “yes!” for a strict subset of it. If you can, identifying a subset to take multi-region reduces many of the costs suggested while also increasing your likelihood of pulling it off.
For our part, we aggressively scoped the project to just the essential components of our stack that would enjoy the most upside of the new architecture. The split breaks along who the client of our API is:
One large portion of traffic is from partners relaying various bits of data to us. On the other side of a connection is another computer, meaning if we have a lasting outage, the computer can replay all of the updates we have missed (need it be said, we still take great pains for this never to be the case). It also turns out computers rarely care all that much about the end-to-end latency of a request.
Conversely, the other large bucket of traffic has a human on the other end: people beginning a shopping journey and requiring various resources from our API, like customized links. Here there is little hope folks will patiently wait out our downtime and try again when we recover. Outages translate directly to lost revenue and lost good-will. This is all the more acute when our downtime makes our partners look bad in front of their own customers.
Very confidently, we took only the latter cohort of traffic multi-region.
How Do You Go Multi-Region
Those disappointed by our hand waving in the last section are in for another bout. Of course there is no universal playbook detailing how you actually pull this off. But there are lots of patterns, and patterns are the way to enlightenment.
We’ll walk through some of the most consequential moves we played, which roughly fell into two tiers:
- Infrastructure — All the work required to run computers, set up networking, implement routing policies, and so forth consistently across many regions.
- Application — All the work required to redesign our application for multi-region deployment on top of the new infrastructure.
Infrastructure Move 1: Provisioning and Management
Ultimately our ability to create fungible data centers in regions around the world without prohibitive investment is attributable to a marvel of the modern world: cloud hosting. With plenty to say about costs, quirks, and limitations, being an API call away from provisioning a computer in Ohio as easily as in Singapore is approaching a miracle.
In standing up new infrastructure, we strove to satisfy two goals:
- Provisioning a new region is trivial. This fell from a broader project goal, which was to make regions as fungible as possible. We wanted it to be cheap and easy to add or remove regions at any time, for any reason.
- Auditing the consistency of a region is trivial. An evangelical devotion to consistency was the only way our relatively small team would be able to handle the maintenance burden running servers all over the world imposes. Confirming that every last switch, knob, and input was consistent across all regions had to be easy and fast.
Our cloud hosting provider offers programmatic access to nearly every capability they offer (these days, a table stakes feature). This enables higher-level tooling on top of their APIs, the most impactful of which for us has been Terraform.
Terraform exposes a declarative API well suited for our goals. You specify in code what resources you want (e.g., a computer, a networking subnet, a queue), the relationships between those resources (e.g., a service referencing the load balancer that fronts it), and instruct Terraform to apply what you’ve written. Terraform will observe what already exists and then create, update, or delete anything it needs to ensure your code matches what is provisioned.
We organized big chunks of resources under “modules” (a parameterized grouping of resources) and composed them all together in a single definition for everything running in a region.
Spinning up a new region amounts to creating a new instance of our module, passing the region and a cidr range as arguments, and applying.
Auditing for consistency across all our regions amounts to asking Terraform if any diff exists between our code and the live infrastructure.
Changing something across all regions amounts to submitting a code change for review, then applying the difference once it’s been accepted and merged.
Infrastructure Move 2: Implementing a Routing Policy
We use DNS to load balance between regions. Clients looking to send us a request make a DNS query to translate our hostname to an IP address, and then initiate a connection with that address.
The DNS resolver is capable of knowing some interesting facts, which lets it make an intelligent decision for which IP to return:
- The full roster of regions we serve traffic from
- The operational status of each of those regions
- The approximate latency from the client to each region
- The result of a coin flip
From these, the resolver can implement our desired policy:
- Take the set of all regions.
- Filter out any that are not operational.
- Sort by latency from low to high.
- Return the first IP.
Importantly, we’re able to configure exactly what checks are required to pass for the resolver to consider a region operational. If something awful happens in the middle of the night, the DNS system is sophisticated enough to have already drained traffic from the troublesome region while our on-call team is still drowsily trying to open their laptops.
That last fact however probably looks like an odd duck: why would you care if a DNS resolver can flip a coin or not? Well, it lets you create a "weighted” routing policy for gradually rolling out a change. For example, if we wanted to slowly introduce traffic to a new region, we would create two parallel sets of records, one with the new region and one without. We’d then start trickling out 0.1%, 1%, 5%, of traffic to the new set on up until we got to 100%.
This move ought to stress the importance of hostname allocation — something that is best gotten right on day one of a new company. We depended on different usage patterns of our API being discernable by hostname, such that we could route the things that needed multi-region support differently than other traffic. Once you publish a hostname to clients it can be hard or impossible to change it, so even if you plan to build for years out of a single region, it makes a lot of sense to create different hostnames for different workloads and route them all to the same place until the time for multi-region comes.
Infrastructure Move 3: Peering Connections
As noted earlier, we deployed a strict subset of our application to multiple regions. That subset still has dependencies back to other private resources in our original “home” region (everything that we didn’t plan to move with the project) though.
To resolve this, we reached for a technique generally known as VPC peering, which builds a private tunnel between regions. Activated, services in one region can communicate over private IP addresses with services in our home region, so internal traffic never hits the public internet.
Not optimal from an isolation perspective, the ability to selectively reach back into our home region for resources we weren’t ready to migrate was critical to keeping the project’s scope manageable. In the years since, we’ve one-by-one taken many of the reasons for the peering connection and deployed alternatives. A handful of requirements will likely always necessitate it, though.
Application Move 1: Topology Rules to Limit Fate Sharing
With all the new serving regions, your graph of connections and dependencies can quickly grow unwieldy. We arrived at a few hard and fast rules that cut down the chaos and make it easy to reason about the consequences of losing a region.
- Regions are allowed to send requests to our original “home” region if and only if it’s not a hard dependency for serving traffic. We do this to warm caches, report telemetry, flush writes to central locations, etc.
- The home region is not allowed to make requests to or know about any regions. Doing so would require some amount of configuration in the home region so it knows who to talk to, which works against our overarching goal of completely fungible regions: they should be able to come online or offline at any time for any reason.
- Regions are not allowed to talk to each other, full stop.
With these properties, it would take all of our regional data centers going offline to cause a service disruption for our customers.
Application Move 2: Monolith Mitosis
A favorite of all those who grew their company around a monolith: what do you do if half of a really big service needs to go multi-region and the other doesn’t?
Enter monolith mitosis: a time-honored technique for dividing and conquering. We used this technique to factor out a highly specialized service more appropriate for multi-region deployment.
Executed carefully, each step of this process is stress free. Note that if a service has a dependency like a database, it should be clear who will inherit ownership of it. At the end of the process only one service should have connectivity to it.
Application Move 3: State Management
Nearly all the work we took on at the application tier of our rewrite reduced to finding alternative ways to handle various bits of state. In plenty of ways that’s not surprising — all state answers to where, and when you spread your compute across multiple regions you’re compelled to find alternative wheres.
Here’s the short list of our favorite techniques:
3a: In-Memory Caches
If you require read access to low-cardinality, low-write data (think simple configuration values), an in-memory cache may be the way to go.
We had just this requirement many places in our application. In the simple days of yore, we would simply ask a config service for whatever values we needed as a hard dependency of serving each request.
To remove the dependency, we instead require each service to persist an in-memory cache of the values it needs, and consult only that cache when serving customers. Out of band, the service then polls for updates on some interval. As a further optimization, we used a highly-redundant document storage layer (s3) to cheaply escrow the data. We were quite willing to introduce a bit of replication lag for the resiliency and latency improvements this model affords.
Since we require access to this data to coherently serve requests, we have our health checks sensitive to at least one complete warming of this cache. This health check bubbles all the way up to our DNS resolver when determining which regions are operational.
If the data is high-cardinality but still low-write, the same pattern can be applied but with an external cache instead of an in-memory one.
3b: Direct Encoding
If you require a database lookup of a small, bounded amount of fields to process a request, encoding those fields directly in an ID supplied by a client may be the way to go.
The canonical example for this is a “session ID”, much like you’d find in a session cookie supplied by your web browser while on a site you’ve logged into.
We use a session ID as a part of our authentication scheme, and originally wrote to a single database table when authenticating a new client and read from the same table when verifying an existing client. This gave us the basic properties we wanted (ability to calculate business metrics like DAUs, know which partner had initiated the session, block malicious clients, etc.) at the cost of a very precarious single point of failure.
The alternative was to drop the database entirely and encode what data we needed right in the value of the session ID:
In the before box, you can imagine “s-1234” being the primary key for some row in our database, and in the after box the lengthier ID being some base-x encoded equivalent of the entire row. Since we’d otherwise lose the ability to discern which session IDs were guaranteed to have originated from our servers, we included a cryptographic signature of the encoded data as a part of the ID.
This scheme is quite similar to JWT — in our case we just made things a bit less general and more compact.
3a: Queues and Eventual Consistency
If you require a database write in the critical path of serving a request, buffering that write through a queue may be the way to go.
One function of our api service involves receiving a click from a user and synchronously writing a row to an internal database. We didn’t want to shard this database across regions (a centralized background service could only consult a single source-of-truth; remember the Topology Rules), and we didn’t want to reach for expensive (in money and operational complexity) distributed databases that support multi-region writes. Still, we had to support these writes even if the central database was down.
In the tradition of Rich Hickey’s classic sermon, we “[stuck] a queue in there”.
A queue can be conceptualized as a durable, hosted array you push and pop items from. They do very well with many people pushing and popping from them at the same time, and usually come with a bevy of bells and whistles to implement things like retries.
For us, this was useful as an alternative write path to the database. The api service pushes records to the queue (the exact records it would have sent off to the click service) and responds to the customer the moment this succeeds (the queue can be trusted to not drop messages). Meanwhile, a fleet of “flushers” poll the queue for any items pushed to it. When a flusher finds one, it’s sole job is to relay it to the click service in our home region. If the click service is having downtime (say someone kicked off a chunky migration), the flushers can slow down processing to a trickle, occasionally retrying until it comes back online.
Crucial to our goals, the queue and flusher have to be local to a region. The queue can fail (in fact, it is our most common failure point), and cross-region writes to it would kill our latency budget. We tie the operational status of a cluster to its ability to write to its own queue: any problems, and DNS automatically starts routing customers away. Keeping the flusher local ensures compliance with the Topology Rules enumerated above: the home region is oblivious to where writes are coming from.
Even absent a multi-region architecture, this alternative decouples users from our single point of failure, the database. It’s not hard to imagine the implicit benefits this has. Migrations for example are a lot less ominous when you have a buffer.
The one wrinkle with this design is accommodating systems that read the records written to our database, in our case the “background service”. Since we added a buffer, our database holds what’s referred to as an eventually consistent view of the writes that have actually occurred across the entire system. It has most of them, but there might be some lingering ones in regional queues it’s not aware of yet.
Depending on the nature of your data, this could be problematic. In our case, readers of this database are background computers, themselves driven by queues. Accordingly, if the background service doesn’t find a record it expects to, it takes a quick break and tries again a short time later.
Despite better intentions, the topics covered in this post give a misleading portrayal of where our energy and attention actually went throughout the course of the project. Blogs seem to foist any subject through a filter of what’s interesting that not even an author can completely resist.
So to correct the general impression, we submit the disclaimer that many boring (but critical) details were foreshortened or elided, and many engaging (but accessory) details were exaggerated.
Of us, this project demanded a keen attention to detail (exactly the requirements of our platform, the guarantees of a specific subsystem, the implicit expectations of another), an uncompromising devotion to communication (motivating and describing changes to our engineering team, documenting shifting responsibilities, rewriting on-call playbooks, broadcasting updates during critical migrations), and a willingness to sludge through hundreds of small changes in preparation for the big ones.
It’s just hard work, but the upside is there if you and your team are ready for it. Technically, culturally, practically, we are left completely convinced of the benefits.
(Special thanks to Adam Berger for reviewing an early draft.)