Take It to the Limit: Considerations for Building Reliable Systems

Complex systems usually operate in failure mode. This is because a complex system typically consists of many discrete pieces, each of which can fail in isolation (or in concert). In a microservice architecture where a given function potentially comprises several independent service calls, high availability hinges on the ability to be partially available. This is a core tenet behind resilience engineering. If a function depends on three services, each with a reliability of 90%, 95%, and 99%, respectively, partial availability could be the difference between 99.995% reliability and 84% reliability (assuming failures are independent). Resilience engineering means designing with failure as the normal.

Anticipating failure is the first step to resilience zen, but the second is embracing it. Telling the client “no” and failing on purpose is better than failing in unpredictable or unexpected ways. Backpressure is another critical resilience engineering pattern. Fundamentally, it’s about enforcing limits. This comes in the form of queue lengths, bandwidth throttling, traffic shaping, message rate limits, max payload sizes, etc. Prescribing these restrictions makes the limits explicit when they would otherwise be implicit (eventually your server will exhaust its memory, but since the limit is implicit, it’s unclear exactly when or what the consequences might be). Relying on unbounded queues and other implicit limits is like someone saying they know when to stop drinking because they eventually pass out.

Rate limiting is important not just to prevent bad actors from DoSing your system, but also yourself. Queue limits and message size limits are especially interesting because they seem to confuse and frustrate developers who haven’t fully internalized the motivation behind them. But really, these are just another form of rate limiting or, more generally, backpressure. Let’s look at max message size as a case study.

Imagine we have a system of distributed actors. An actor can send messages to other actors who, in turn, process the messages and may choose to send messages themselves. Now, as any good software engineer knows, the eighth fallacy of distributed computing is “the network is homogenous.” This means not all actors are using the same hardware, software, or network configuration. We have servers with 128GB RAM running Ubuntu, laptops with 16GB RAM running macOS, mobile clients with 2GB RAM running Android, IoT edge devices with 512MB RAM, and everything in between, all running a hodgepodge of software and network interfaces.

When we choose not to put an upper bound on message sizes, we are making an implicit assumption (recall the discussion on implicit/explicit limits from earlier). Put another way, you and everyone you interact with (likely unknowingly) enters an unspoken contract of which neither party can opt out. This is because any actor may send a message of arbitrary size. This means any downstream consumers of this message, either directly or indirectly, must also support arbitrarily large messages.

How can we test something that is arbitrary? We can’t. We have two options: either we make the limit explicit or we keep this implicit, arbitrarily binding contract. The former allows us to define our operating boundaries and gives us something to test. The latter requires us to test at some undefined production-level scale. The second option is literally gambling reliability for convenience. The limit is still there, it’s just hidden. When we don’t make it explicit, we make it easy to DoS ourselves in production. Limits become even more important when dealing with cloud infrastructure due to their multitenant nature. They prevent a bad actor (or yourself) from bringing down services or dominating infrastructure and system resources.

In our heterogeneous actor system, we have messages bound for mobile devices and web browsers, which are often single-threaded or memory-constrained consumers. Without an explicit limit on message size, a client could easily doom itself by requesting too much data or simply receiving data outside of its control—this is why the contract is unspoken but binding.

Let’s look at this from a different kind of engineering perspective. Consider another type of system: the US National Highway System. The US Department of Transportation uses the Federal Bridge Gross Weight Formula as a means to prevent heavy vehicles from damaging roads and bridges. It’s really the same engineering problem, just a different discipline and a different type of infrastructure.

The August 2007 collapse of the Interstate 35W Mississippi River bridge in Minneapolis brought renewed attention to the issue of truck weights and their relation to bridge stress. In November 2008, the National Transportation Safety Board determined there had been several reasons for the bridge’s collapse, including (but not limited to): faulty gusset plates, inadequate inspections, and the extra weight of heavy construction equipment combined with the weight of rush hour traffic.

The DOT relies on weigh stations to ensure trucks comply with federal weight regulations, fining those that exceed restrictions without an overweight permit.

The federal maximum weight is set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country’s highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.

Weight limits need to be enforced so civil engineers have a defined operating range for the roads, bridges, and other infrastructure they build. Computers are no different. This is the reason many systems enforce these types of limits. For example, Amazon clearly publishes the limits for its Simple Queue Service—the max in-flight messages for standard queues is 120,000 messages and 20,000 messages for FIFO queues. Messages are limited to 256KB in size. Amazon KinesisApache KafkaNATS, and Google App Engine pull queues all limit messages to 1MB in size. These limits allow the system designers to optimize their infrastructure and ameliorate some of the risks of multitenancy—not to mention it makes capacity planning much easier.

Unbounded anything—whether its queues, message sizes, queries, or traffic—is a resilience engineering anti-pattern. Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden. By making them explicit, we restrict the failure domain giving us more predictability, longer mean time between failures, and shorter mean time to recovery at the cost of more upfront work or slightly more complexity.

It’s better to be explicit and handle these limits upfront than to punt on the problem and allow systems to fail in unexpected ways. The latter might seem like less work at first but will lead to more problems long term. By requiring developers to deal with these limitations directly, they will think through their APIs and business logic more thoroughly and design better interactions with respect to stability, scalability, and performance.

You Are Not Paid to Write Code

“Taco Bell Programming” is the idea that we can solve many of the problems we face as software engineers with clever reconfigurations of the same basic Unix tools. The name comes from the fact that every item on the menu at Taco Bell, a company which generates almost $2 billion in revenue annually, is simply a different configuration of roughly eight ingredients.

Many people grumble or reject the notion of using proven tools or techniques. It’s boring. It requires investing time to learn at the expense of shipping code.  It doesn’t do this one thing that we need it to do. It won’t work for us. For some reason—and I continue to be completely baffled by this—everyone sees their situation as a unique snowflake despite the fact that a million other people have probably done the same thing. It’s a weird form of tunnel vision, and I see it at every level in the organization. I catch myself doing it on occasion too. I think it’s just human nature.

I was able to come to terms with this once I internalized something a colleague once said: you are not paid to write code. You have never been paid to write code. In fact, code is a nasty byproduct of being a software engineer.

Every time you write code or introduce third-party services, you are introducing the possibility of failure into your system.

I think the idea of Taco Bell Programming can be generalized further and has broader implications based on what I see in industry. There are a lot of parallels to be drawn from The Systems Bible by John Gall, which provides valuable commentary on general systems theory. Gall’s Fundamental Theorem of Systems is that new systems mean new problems. I think the same can safely be said of code—more code, more problems. Do it without a new system if you can.

Systems are seductive and engineers in particular seem to have a predisposition for them. They promise to do a job faster, better, and more easily than you could do it by yourself or with a less specialized system. But when you introduce a new system, you introduce new variables, new failure points, and new problems.

But if you set up a system, you are likely to find your time and effort now being consumed in the care and feeding of the system itself. New problems are created by its very presence. Once set up, it won’t go away, it grows and encroaches. It begins to do strange and wonderful things. Breaks down in ways you never thought possible. It kicks back, gets in the way, and opposes its own proper function. Your own perspective becomes distorted by being in the system. You become anxious and push on it to make it work. Eventually you come to believe that the misbegotten product it so grudgingly delivers is what you really wanted all the time. At that point encroachment has become complete. You have become absorbed. You are now a systems person.

The last systems principle we look at is one I find particularly poignant: almost anything is easier to get into than out of. When we introduce new systems, new tools, new lines of code, we’re with them for the long haul. It’s like a baby that doesn’t grow up.

We’re not paid to write code, we’re paid to add value (or reduce cost) to the business. Yet I often see people measuring their worth in code, in systems, in tools—all of the output that’s easy to measure. I see it come at the expense of attending meetings. I see it at the expense of supporting other teams. I see it at the expense of cross-training and personal/professional development. It’s like full-bore coding has become the norm and we’ve given up everything else.

Another area I see this manifest is with the siloing of responsibilities. Product, Platform, Infrastructure, Operations, DevOps, QA—whatever the silos, it’s created a sort of responsibility lethargy. “I’m paid to write software, not tests” or “I’m paid to write features, not deploy and monitor them.” Things of that nature.

I think this is only addressed by stewarding a strong engineering culture and instilling the right values and expectations. For example, engineers should understand that they are not defined by their tools but rather the problems they solve and ultimately the value they add. But it’s important to spell out that this goes beyond things like commits, PRs, and other vanity metrics. We should embrace the principles of systems theory and Taco Bell Programming. New systems or more code should be the last resort, not the first step. Further, we should embody what it really means to be an engineer rather than measuring raw output. You are not paid to write code.

Shit Rolls Downhill

Building software of significant complexity is tough because a lot of pieces have to come together and a lot of teams have to work in concert to be successful. It can be extraordinarily difficult to get everyone on the same page and moving in tandem toward a common goal. Product development is largely an exercise in trust (or perhaps more accurately, hiring), but even if you have the “right” people—people you can trust and depend on to get things done—you’re only halfway there.

Trust is an important quality to screen for, difficult though it may be. However, a person’s trustworthiness or dependability doesn’t really tell you much about that person as an engineer. The engineering culture is something that must be cultivated. Etsy’s CTO, John Allspaw, said it best in a recent interview:

Post-mortem debriefings every day are littered with the artifacts of people insisting, the second before an outage, that “I don’t have to care about that.”

If “abstracting away” is nothing for you but a euphemism for “Not my job,” “I don’t care about that,” or “I’m not interested in that,” I think Etsy might not be the place for you. Because when things break, when things don’t behave the way they’re expected to, you can’t hold up your arms and say “Not my problem.” That’s what I could call “covering your ass” engineering, and it may work at other companies, but it doesn’t work here.

Allspaw calls this the distinction between hiring software developers and software engineers. This perception often results in heated debate, but I couldn’t agree with it more. There is a very real distinction to be made. Abstraction is not about boundaries of concern, it’s about boundaries of focus. Engineers need to have an intimate understanding of this.

Engineering, as a discipline and as an activity, is multi-disciplinary. It’s just messy. And that’s actually the best part of engineering. It’s not about everyone knowing everything. It’s about paying attention to the shared, mutual understanding.

But engineering is more than just technical aptitude and a willingness to “dig in” to the guts of something. It’s about having an acute awareness of the delicate structure upon which software is built. More succinctly, it’s about having empathy. It’s recognizing the fact that shit rolls downhill.

Shit Rolls Downhill

For things to work, the entire structure has to hold, and no one point is any more or less important than the others. It almost always starts off with good intentions at the top, but the shit starts to compound and accelerate as it rolls effortlessly and with abandon toward the bottom. There are a few aspects to this I want to explore.

Understand the Relationships

This isn’t to say that folks near the top are less susceptible to shit. Everyone has to shovel it, but the way it manifests is different depending on where you find yourself on the hill. The key point is that the people above you are effectively your customers, either directly or indirectly, and if you’re toward the top, maybe literally.

And, as all customers do, they make demands. This is a very normal thing and is to be expected. Some of these demands are reasonable, others not so much. Again, this is normal, but what do we make of these demands?

There are some interesting insights we can take from The Innovator’s Dilemma (which, by the way, is an essential read for anyone looking to build, run, or otherwise contribute to a successful business), which are especially relevant toward the top of the hill. Mainly, we should not merely take the customer’s word as gospel. When it comes to products, feature requests, and “the way things should be done,” the customer tends to have a very narrow and predisposed view. I find the following passage to be particularly poignant:

Indeed, the power and influence of leading customers is a major reason why companies’ product development trajectories overshoot demands of mainstream markets.

Essentially, too much emphasis can be placed on the current or perceived needs of the customer, resulting in a failure to meet their unstated or future needs (or if we’re talking about internal customers, the current or future needs of the business). Furthermore, we can spend too much time focusing on the customer’s needs—often perceived needs—culminating in a paralysis to ship. This is very anti-continuous-delivery. Get things out fast, see where they land, and make appropriate adjustments on the fly.

Giving in to customer demands is a judgement game, but depending on the demand, it can have profound impact on the people further down the hill. Thus, these decisions should be made accordingly and in a way that involves a cross section of the hill. If someone near the top is calling all the shots, things are not going to work out, and in all likelihood, someone else is going to end up getting covered in shit.

An interesting corollary is the relationship between leadership and engineers. Even a single, seemingly innocuous question asked in passing by a senior manager can change the entire course of a development team. In fact, the manager was just trying to gain information, but the team interpreted the question as a statement suggesting “this thing needs to be done.” It’s important to recognize this interaction for what it is.

Set Appropriate Expectations

In truth, the relationship between teams is not equivalent to the relationship between actual customers and the business. You may depend on another team in order to provide a certain feature or to build a certain product. If the business is lagging, the customer might take their money elsewhere. If the team you depend on is lagging, you might not have the same liberty. This leads to the dangerous “us versus them” trap teams fall in as an organization grows. The larger a company gets, the more fingers get pointed because “they’re no longer us, they’re them.” There are more teams, they are more isolated, and there are more dependencies. It doesn’t matter how great your culture is, changing human nature is hard. And when pressure builds from above, the finger-pointing only intensifies.

Therefore, it’s critical to align yourself with the teams you depend on. Likewise, align yourself with the teams that depend on you, don’t alienate them. In part, this means have a realistic sense of urgency, have realistic expectations, and plan accordingly. It’s not reasonable to submit a work item to another team and turn around and call it a blocker. Doing so means you failed to plan, but now to outside observers, it’s the other team which is the problem. As we prioritize the work precipitated by our customers, so do the rest of our teams. With few exceptions, you cannot expect a team to drop everything they’re doing to focus on your needs. This is the aforementioned “us versus them” mentality. Instead, align. Speak with the team you depend on, understand where your needs fit within their current priorities, and if it’s a risk, be willing to roll up your sleeves and help out. This is exactly what Allspaw was getting at when he described what a “software engineer” is.

Setting realistic expectations is vital. Just as products ship with bugs, so does everything else in the stack. Granted, some bugs are worse than others, but no amount of QA will fully prevent them from going to production. Bugs will only get worked out if the code actually gets used. You cannot wait until something is perfect before adopting it. You will wait forever. Remember that Agile is micro failure on a macro level. Adopt quickly, deploy quickly, fail quickly, adjust quickly. As Jay Kreps once said, “The only way to really know if a system design works in the real world is to build it, deploy it for real applications, and see where it falls short.”

While it’s important to set appropriate expectations downward, it’s also important to communicate upward. Ensure that the teams relying on you have the correct expectations. Establish what the team’s short-term and long-term goals are and make them publicly available. Enable those teams to plan accordingly, and empower them so that they can help out when needed. Provide adequate documentation such that another engineer can jump in at any time with minimal handoff.

Be Curious

This largely gets back to the quote by John Allspaw. The point is that we want to hire and develop software engineers, not programmers. Being an engineer should mean having an innate curiosity. Figure out what you don’t know and push beyond it.

Understand, at least on some level, the things that you depend on. Own everything. Similarly, if you built it and it’s running in production, it’s on you to support it. Throwing code over the wall is no longer acceptable. When there’s a problem with something you depend on, don’t just throw up your hands and say “not my problem.” Investigate it. If you’re certain it’s a problem in someone else’s system, bring it to them and help root cause it. Provide context. When did it start happening? What were the related events? What were the effects? Don’t just send an error message from the logs.

This is the engineering culture that gets you the rest of the way there. The people are important, especially early on, but it’s the core values and practices that will carry you. The Innovator’s Dilemma again provides further intuition:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

Summary

There will always be gravity. As such, shit will always roll downhill. It’s important to embrace this structure, to understand the relationships, and to set appropriate expectations. Equally important is fostering an engineering culture—a culture of curiosity, ownership, and mutual understanding. Having the right people is essential, but it’s only half the problem. The other half is instilling the right values and practices. Shit rolls downhill, but if you have the right people, values, and practices in place, that manure might just grow something amazing.

Abstraction Considered Harmful

“Abstraction is sometimes harmful,” he proclaims to the sound of anxious whooping and subdued applause from the audience. Peter Alvaro’s 2015 Strange Loop keynote, I See What You Mean, remains one of my favorite talks—not just because of its keen insight on distributed computing and language design, but because of a more fundamental, almost primordial, understanding of systems thinking.

Abstraction is what we use to manage complexity. We build something of significant complexity, we mask its inner workings, and we expose what we think is necessary for interacting with it.

Programmers are lazy, and abstractions help us be lazy. The builders of abstractions need not think about how their abstractions will be used—this would be far too much effort. Likewise, the users of abstractions need not think about how their abstractions work—this would be far too much effort. And now we have a nice, neatly wrapped package we can use and reuse to build all kinds of applications—after all, duplicating it would be far too much effort and goes against everything we consider sacred as programmers.

It usually works like this: in order to solve a problem, a programmer first needs to solve a sub-problem. This sub-problem is significant enough in complexity or occurs frequently enough in practice that the programmer doesn’t want to solve it for the specific case—an abstraction is born. Now, this can go one of two ways. Either the abstraction is rock solid and the programmer never has to think about the mundane details again (think writing loops instead of writing a bunch of jmp statements)—success—or the abstraction is leaky because the underlying problem is sufficiently complex (think distributed database transactions). Infinite sadness.

It’s kind of a cruel irony. The programmer complains that there’s not enough abstraction for a hard sub-problem. Indeed, the programmer doesn’t care about solving the sub-problem. They are focused on solving the greater problem at hand. So, as any good programmer would do, we build an abstraction for the hard sub-problem, mask its inner workings, and expose what we think is necessary for interacting with it. But then we discover that the abstraction leaks and complain that it isn’t perfect. It turns out, hard problems are hard. The programmer then simply does away with the abstraction and solves the sub-problem for their specific case, handling the complexity in a way that makes sense for their application.

Abstraction doesn’t magically make things less hard. It just attempts to hide that fact from you. Just because the semantics are simple doesn’t mean the solution is. In fact, it’s often the opposite, yet this seems to be a frequently implied assumption.

Duplication is far cheaper than the wrong abstraction. Just deciding which little details we need to expose on our abstractions can be difficult, particularly when we don’t know how they will be used. The truth is, we can’t know how they will be used because some of the uses haven’t even been conceived yet. Abstraction is a delicate balance of precision and granularity. To quote Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

But, as we know, requirements are fluid. Too precise and we lose granularity, hindering our ability to adapt in the future. Too granular and we weaken the abstraction. But a strong abstraction for a hard problem isn’t really strong at all when it leaks.

The key takeaway is that abstractions leak, and we have to deal with that. There is never a silver bullet for problems of sufficient complexity. Peter ends his talk on a polemic against the way we currently view abstraction:

[Let’s] not make concrete, static abstractions. Trust ourselves to let ourselves peer below the facade. There’s a lot of complexity down there, but we need to engage with that complexity. We need tools that help us engage with the complexity, not a fire blanket. Abstractions are going to leak, so make the abstractions fluid.

Abstraction, in and of itself, is not harmful. On the contrary, it’s necessary for progress. What’s harmful is relying on impenetrable barriers to protect our precious programmers from hard problems. After all, the 21st century engineer understands that in order to play in the sand, we all need to be comfortable getting our feet a little wet from time to time.

From the Ground Up: Reasoning About Distributed Systems in the Real World

The rabbit hole is deep. Down and down it goes. Where it ends, nobody knows. But as we traverse it, patterns appear. They give us hope, they quell the fear.

Distributed systems literature is abundant, but as a practitioner, I often find it difficult to know where to start or how to synthesize this knowledge without a more formal background. This is a non-academic’s attempt to provide a line of thought for rationalizing design decisions. This piece doesn’t necessarily contribute any new ideas but rather tries to provide a holistic framework by studying some influential existing ones. It includes references which provide a good starting point for thinking about distributed systems. Specifically, we look at a few formal results and slightly less formal design principles to provide a basis from which we can argue about system design.

This is your last chance. After this, there is no turning back. I wish I could say there is no red-pill/blue-pill scenario at play here, but the world of distributed systems is complex. In order to make sense of it, we reason from the ground up while simultaneously stumbling down the deep and cavernous rabbit hole.

Guiding Principles

In order to reason about distributed system design, it’s important to lay out some guiding principles or theorems used to establish an argument. Perhaps the most fundamental of which is the Two Generals Problem originally introduced by Akkoyunlu et al. in Some Constraints and Trade-offs in the Design of Network Communications and popularized by Jim Gray in Notes on Data Base Operating Systems in 1975 and 1978, respectively. The Two Generals Problem demonstrates that it’s impossible for two processes to agree on a decision over an unreliable network. It’s closely related to the binary consensus problem (“attack” or “don’t attack”) where the following conditions must hold:

  • Termination: all correct processes decide some value (liveness property).
  • Validity: if all correct processes decide v, then v must have been proposed by some correct process (non-triviality property).
  • Integrity: all correct processes decide at most one value v, and is the “right” value (safety property).
  • Agreement: all correct processes must agree on the same value (safety property).

It becomes quickly apparent that any useful distributed algorithm consists of some intersection of both liveness and safety properties. The problem becomes more complicated when we consider an asynchronous network with crash failures:

  • Asynchronous: messages may be delayed arbitrarily long but will eventually be delivered.
  • Crash failure: processes can halt indefinitely.

Considering this environment actually leads us to what is arguably one of the most important results in distributed systems theory: the FLP impossibility result introduced by Fischer, Lynch, and Patterson in their 1985 paper Impossibility of Distributed Consensus with One Faulty Process. This result shows that the Two Generals Problem is provably impossible. When we do not consider an upper bound on the time a process takes to complete its work and respond in a crash-failure model, it’s impossible to make the distinction between a process that is crashed and one that is taking a long time to respond. FLP shows there is no algorithm which deterministically solves the consensus problem in an asynchronous environment when it’s possible for at least one process to crash. Equivalently, we say it’s impossible to have a perfect failure detector in an asynchronous system with crash failures.

When talking about fault-tolerant systems, it’s also important to consider Byzantine faults, which are essentially arbitrary faults. These include, but are not limited to, attacks which might try to subvert the system. For example, a security attack might try to generate or falsify messages. The Byzantine Generals Problem is a generalized version of the Two Generals Problem which describes this fault model. Byzantine fault tolerance attempts to protect against these threats by detecting or masking a bounded number of Byzantine faults.

Why do we care about consensus? The reason is it’s central to so many important problems in system design. Leader election implements consensus allowing you to dynamically promote a coordinator to avoid single points of failure. Distributed databases implement consensus to ensure data consistency across nodes. Message queues implement consensus to provide transactional or ordered delivery. Distributed init systems implement consensus to coordinate processes. Consensus is fundamentally an important problem in distributed programming.

It has been shown time and time again that networks, whether local-area or wide-area, are often unreliable and largely asynchronous. As a result, these proofs impose real and significant challenges to system design.

The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.

L. Peter Deutsch’s fallacies of distributed computing are a key jumping-off point in the theory of distributed systems. It presents a set of incorrect assumptions which many new to the space frequently make, of which the first is “the network is reliable.”

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The CAP theorem, while recently the subject of scrutiny and debate over whether it’s overstated or not, is a useful tool for establishing fundamental trade-offs in distributed systems and detecting vendor sleight of hand. Gilbert and Lynch’s Perspectives on the CAP Theorem lays out the intrinsic trade-off between safety and liveness in a fault-prone system, while Fox and Brewer’s Harvest, Yield, and Scalable Tolerant Systems characterizes it in a more pragmatic light. I will continue to say unequivocally that the CAP theorem is important within the field of distributed systems and of significance to system designers and practitioners.

A Renewed Hope

Following from the results detailed earlier would imply many distributed algorithms, including those which implement linearizable operations, serializable transactions, and leader election, are a hopeless endeavor. Is it game over? Fortunately, no. Carefully designed distributed systems can maintain correctness without relying on pure coincidence.

First, it’s important to point out that the FLP result does not indicate consensus is unreachable, just that it’s not always reachable in bounded time. Second, the system model FLP uses is, in some ways, a pathological one. Synchronous systems place a known upper bound on message delivery between processes and on process computation. Asynchronous systems have no fixed upper bounds. In practice, systems tend to exhibit partial synchrony, which is described as one of two models by Dwork and Lynch in Consensus in the Presence of Partial Synchrony. In the first model of partial synchrony, fixed bounds exist but they are not known a priori. In the second model, the bounds are known but are only guaranteed to hold starting at unknown time T. Dwork and Lynch present fault-tolerant consensus protocols for both partial-synchrony models combined with various fault models.

Chandra and Toueg introduce the concept of unreliable failure detectors in Unreliable Failure Detectors for Reliable Distributed Systems. Each process has a local, external failure detector which can make mistakes. The detector monitors a subset of the processes in the system and maintains a list of those it suspects to have crashed. Failures are detected by simply pinging each process periodically and suspecting any process which doesn’t respond to the ping within twice the maximum round-trip time for any previous ping. The detector makes a mistake when it erroneously suspects a correct process, but it may later correct the mistake by removing the process from its list of suspects. The presence of failure detectors, even unreliable ones, makes consensus solvable in a slightly relaxed system model.

While consensus ensures processes agree on a value, atomic broadcast ensures processes deliver the same messages in the same order. This same paper shows that the problems of consensus and atomic broadcast are reducible to each other, meaning they are equivalent. Thus, the FLP result and others apply equally to atomic broadcast, which is used in coordination services like Apache ZooKeeper.

In Introduction to Reliable and Secure Distributed Programming, Cachin, Guerraoui, and Rodrigues suggest most practical systems can be described as partially synchronous:

Generally, distributed systems appear to be synchronous. More precisely, for most systems that we know of, it is relatively easy to define physical time bounds that are respected most of the time. There are, however, periods where the timing assumptions do not hold, i.e., periods during which the system is asynchronous. These are periods where the network is overloaded, for instance, or some process has a shortage of memory that slows it down. Typically, the buffer that a process uses to store incoming and outgoing messages may overflow, and messages may thus get lost, violating the time bound on the delivery. The retransmission of the messages may help ensure the reliability of the communication links but introduce unpredictable delays. In this sense, practical systems are partially synchronous.

We capture partial synchrony by assuming timing assumptions only hold eventually without stating exactly when. Similarly, we call the system eventually synchronous. However, this does not guarantee the system is synchronous forever after a certain time, nor does it require the system to be initially asynchronous then after a period of time become synchronous. Instead it implies the system has periods of asynchrony which are not bounded, but there are periods where the system is synchronous long enough for an algorithm to do something useful or terminate. The key thing to remember with asynchronous systems is that they contain no timing assumptions.

Lastly, On the Minimal Synchronism Needed for Distributed Consensus by Dolev, Dwork, and Stockmeyer describes a consensus protocol as t-resilient if it operates correctly when at most t processes fail. In the paper, several critical system parameters and synchronicity conditions are identified, and it’s shown how varying them affects the t-resiliency of an algorithm. Consensus is shown to be provably possible for some models and impossible for others.

Fault-tolerant consensus is made possible by relying on quorums. The intuition is that as long as a majority of processes agree on every decision, there is at least one process which knows about the complete history in the presence of faults.

Deterministic consensus, and by extension a number of other useful algorithms, is impossible in certain system models, but we can model most real-world systems in a way that circumvents this. Nevertheless, it shows the inherent complexities involved with distributed systems and the rigor needed to solve certain problems.

Theory to Practice

What does all of this mean for us in practice? For starters, it means distributed systems are usually a harder problem than they let on. Unfortunately, this is often the cause of improperly documented trade-offs or, in many cases, data loss and safety violations. It also suggests we need to rethink the way we design systems by shifting the focus from system properties and guarantees to business rules and application invariants.

One of my favorite papers is End-To-End Arguments in System Design by Saltzer, Reed, and Clark. It’s an easy read, but it presents a compelling design principle for determining where to place functionality in a distributed system. The principle idea behind the end-to-end argument is that functions placed at a low level in a system may be redundant or of little value when compared to the cost of providing them at that low level. It follows that, in many situations, it makes more sense to flip guarantees “inside out”—pushing them outwards rather than relying on subsystems, middleware, or low-level layers of the stack to maintain them.

To illustrate this, we consider the problem of “careful file transfer.” A file is stored by a file system on the disk of computer A, which is linked by a communication network to computer B. The goal is to move the file from computer A’s storage to computer B’s storage without damage and in the face of various failures along the way. The application in this case is the file-transfer program which relies on storage and network abstractions. We can enumerate just a few of the potential problems an application designer might be concerned with:

  1. The file, though originally written correctly onto the disk at host A, if read now may contain incorrect data, perhaps because of hardware faults in the disk storage system.
  2. The software of the file system, the file transfer program, or the data communication system might make a mistake in buffering and copying the data of the file, either at host A or host B.
  3. The hardware processor or its local memory might have a transient error while doing the buffering and copying, either at host A or host B.
  4. The communication system might drop or change the bits in a packet, or lose a packet or deliver a packet more than once.
  5. Either of the hosts may crash part way through the transaction after performing an unknown amount (perhaps all) of the transaction.

Many of these problems are Byzantine in nature. When we consider each threat one by one, it becomes abundantly clear that even if we place countermeasures in the low-level subsystems, there will still be checks required in the high-level application. For example, we might place checksums, retries, and sequencing of packets in the communication system to provide reliable data transmission, but this really only eliminates threat four. An end-to-end checksum and retry mechanism at the file-transfer level is needed to guard against the remaining threats.

Building reliability into the low level has a number of costs involved. It takes a non-trivial amount of effort to build it. It’s redundant and, in fact, hinders performance by reducing the frequency of application retries and adding unneeded overhead. It also has no actual effect on correctness because correctness is determined and enforced by the end-to-end checksum and retries. The reliability and correctness of the communication system is of little importance, so going out of its way to ensure resiliency does not reduce any burden on the application. In fact, ensuring correctness by relying on the low level might be altogether impossible since threat number two requires writing correct programs, but not all programs involved may be written by the file-transfer application programmer.

Fundamentally, there are two problems with placing functionality at the lower level. First, the lower level is not aware of the application needs or semantics, which means logic placed there is often insufficient. This leads to duplication of logic as seen in the example earlier. Second, other applications which rely on the lower level pay the cost of the added functionality even when they don’t necessarily need it.

Saltzer, Reed, and Clark propose the end-to-end principle as a sort of “Occam’s razor” for system design, arguing that it helps guide the placement of functionality and organization of layers in a system.

Because the communication subsystem is frequently specified before applications that use the subsystem are known, the designer may be tempted to “help” the users by taking on more function than necessary. Awareness of end-to end arguments can help to reduce such temptations.

However, it’s important to note that the end-to-end principle is not a panacea. Rather, it’s a guideline to help get designers to think about their solutions end to end, acknowledge their application requirements, and consider their failure modes. Ultimately, it provides a rationale for moving function upward in a layered system, closer to the application that uses the function, but there are always exceptions to the rule. Low-level mechanisms might be built as a performance optimization. Regardless, the end-to-end argument contends that lower levels should avoid taking on any more responsibility than necessary. The “lessons” section from Google’s Bigtable paper echoes some of these same sentiments:

Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used. For example, we initially planned to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them. Now that we have many real applications running on Bigtable, we have been able to examine their actual needs, and have discovered that most applications require only single-row transactions. Where people have requested distributed transactions, the most important use is for maintaining secondary indices, and we plan to add a specialized mechanism to satisfy this need. The new mechanism will be less general than distributed transactions, but will be more efficient (especially for updates that span hundreds of rows or more) and will also interact better with our scheme for optimistic cross-datacenter replication.

We’ll see the end-to-end argument as a common theme throughout the remainder of this piece.

Whose Guarantee Is It Anyway?

Generally, we rely on robust algorithms, transaction managers, and coordination services to maintain consistency and application correctness. The problem with these is twofold: they are often unreliable and they impose a massive performance bottleneck.

Distributed coordination algorithms are difficult to get right. Even tried-and-true protocols like two-phase commit are susceptible to crash failures and network partitions. Protocols which are more fault tolerant like Paxos and Raft generally don’t scale well beyond small clusters or across wide-area networks. Consensus systems like ZooKeeper own your availability, meaning if you depend on one and it goes down, you’re up a creek. Since quorums are often kept small for performance reasons, this might be less rare than you think.

Coordination systems become a fragile and complex piece of your infrastructure, which seems ironic considering they are usually employed to reduce fragility. On the other hand, message-oriented middleware largely use coordination to provide developers with strong guarantees: exactly-once, ordered, transactional delivery and the like.

From transmission protocols to enterprise message brokers, relying on delivery guarantees is an anti-pattern in distributed system design. Delivery semantics are a tricky business. As such, when it comes to distributed messaging, what you want is often not what you need. It’s important to look at the trade-offs involved, how they impact system design (and UX!), and how we can cope with them to make better decisions.

Subtle and not-so-subtle failure modes make providing strong guarantees exceedingly difficult. In fact, some guarantees, like exactly-once delivery, aren’t even really possible to achieve when we consider things like the Two Generals Problem and the FLP result. When we try to provide semantics like guaranteed, exactly-once, and ordered message delivery, we usually end up with something that’s over-engineered, difficult to deploy and operate, fragile, and slow. What is the upside to all of this? Something that makes your life easier as a developer when things go perfectly well, but the reality is things don’t go perfectly well most of the time. Instead, you end up getting paged at 1 a.m. trying to figure out why RabbitMQ told your monitoring everything is awesome while proceeding to take a dump in your front yard.

If you have something that relies on these types of guarantees in production, know that this will happen to you at least once sooner or later (and probably much more than that). Eventually, a guarantee is going to break down. It might be inconsequential, it might not. Not only is this a precarious way to go about designing things, but if you operate at a large scale, care about throughput, or have sensitive SLAs, it’s probably a nonstarter.

The performance implications of distributed transactions are obvious. Coordination is expensive because processes can’t make progress independently, which in turn limits throughput, availability, and scalability. Peter Bailis gave an excellent talk called Silence is Golden: Coordination-Avoiding Systems Design which explains this in great detail and how coordination can be avoided. In it, he explains how distributed transactions can result in nearly a 400x decrease in throughput in certain situations.

Avoiding coordination enables infinite scale-out while drastically improving throughput and availability, but in some cases coordination is unavoidable. In Coordination Avoidance in Database Systems, Bailis et al. answer a key question: when is coordination necessary for correctness? They present a property, invariant confluence (I-confluence), which is necessary and sufficient for safe, coordination-free, available, and convergent execution. I-confluence essentially works by pushing invariants up into the business layer where we specify correctness in terms of application semantics rather than low-level database operations.

Without knowledge of what “correctness” means to your app (e.g., the invariants used in I-confluence), the best you can do to preserve correctness under a read/write model is serializability.

I-confluence can be determined given a set of transactions and a merge function used to reconcile divergent states. If I-confluence holds, there exists a coordination-free execution strategy that preserves invariants. If it doesn’t hold, no such strategy exists—coordination is required. I-confluence allows us to identify when we can and can’t give up coordination, and by pushing invariants up, we remove a lot of potential bottlenecks from areas which don’t require it.

If we recall, “synchrony” within the context of distributed computing is really just making assumptions about time, so synchronization is basically two or more processes coordinating around time. As we saw, a system which performs no coordination will have optimal performance and availability since everyone can proceed independently. However, a distributed system which performs zero coordination isn’t particularly useful or possible as I-confluence shows. Christopher Meiklejohn’s Strange Loop talk, Distributed, Eventually Consistent Computations, provides an interesting take on coordination with the parable of the car. A car requires friction to drive, but that friction is limited to very small contact points. Any other friction on the car causes problems or inefficiencies. If we think about physical time as friction, we know we can’t eliminate it altogether because it’s essential to the problem, but we want to reduce the use of it in our systems as much as possible. We can typically avoid relying on physical time by instead using logical time, for example, with the use of Lamport clocks or other conflict-resolution techniques. Lamport’s Time, Clocks, and the Ordering of Events in a Distributed System is the classical introduction to this idea.

Often, systems simply forgo coordination altogether for latency-sensitive operations, a perfectly reasonable thing to do provided the trade-off is explicit and well-documented. Sadly, this is frequently not the case. But we can do better. I-confluence provides a useful framework for avoiding coordination, but there’s a seemingly larger lesson to be learned here. What it really advocates is reexamining how we design systems, which seems in some ways to closely parallel our end-to-end argument.

When we think low level, we pay the upfront cost of entry—serializable transactions, linearizable reads and writes, coordination. This seems contradictory to the end-to-end principle. Our application doesn’t really care about atomicity or isolation levels or linearizability. It cares about two users sharing the same ID or two reservations booking the same room or a negative balance in a bank account, but the database doesn’t know that. Sometimes these rules don’t even require any expensive coordination.

If all we do is code our business rules and constraints into the language our infrastructure understands, we end up with a few problems. First, we have to know how to translate our application semantics into these low-level operations while avoiding any impedance mismatch. In the context of messaging, guaranteed delivery doesn’t really mean anything to our application which cares about what’s done with the messages. Second, we preclude ourselves from using a lot of generalized solutions and, in some cases, we end up having to engineer specialized ones ourselves. It’s not clear how well this scales in practice. Third, we pay a performance penalty that could otherwise be avoided (as I-confluence shows). Lastly, we put ourselves at the mercy of our infrastructure and hope it makes good on its promises—it often doesn’t.

Working on a messaging platform team, I’ve had countless conversations which resemble the following exchange:

Developer: “We need fast messaging.”
Me: “Is it okay if messages get dropped occasionally?”
Developer: “What? Of course not! We need it to be reliable.”
Me: “Okay, we’ll add a delivery ack, but what happens if your application crashes before it processes the message?”
Developer: “We’ll ack after processing.”
Me: “What happens if you crash after processing but before acking?”
Developer: “We’ll just retry.”
Me: “So duplicate delivery is okay?”
Developer: “Well, it should really be exactly-once.”
Me: “But you want it to be fast?”
Developer: “Yep. Oh, and it should maintain message ordering.”
Me: “Here’s TCP.”

If, instead, we reevaluate the interactions between our systems, their APIs, their semantics, and move some of that responsibility off of our infrastructure and onto our applications, then maybe we can start to build more robust, resilient, and performant systems. With messaging, does our infrastructure really need to enforce FIFO ordering? Preserving order with distributed messaging in the presence of failure while trying to simultaneously maintain high availability is difficult and expensive. Why rely on it when it can be avoided with commutativity? Likewise, transactional delivery requires coordination which is slow and brittle while still not providing application guarantees. Why rely on it when it can be avoided with idempotence and retries? If you need application-level guarantees, build them into the application level. The infrastructure can’t provide it.

I really like Gregor Hohpe’s “Your Coffee Shop Doesn’t Use Two-Phase Commit” because it shows how simple solutions can be if we just model them off of the real world. It gives me hope we can design better systems, sometimes by just turning things on their head. There’s usually a reason things work the way they do, and it often doesn’t even involve the use of computers or complicated algorithms.

Rather than try to hide complexities by using flaky and heavy abstractions, we should engage directly by recognizing them in our design decisions and thinking end to end. It may be a long and winding path to distributed systems zen, but the best place to start is from the beginning.

I’d like to thank Tom Santero for reviewing an early draft of this writing. Any inaccuracies or opinions expressed are mine alone.

Infrastructure Engineering in the 21st Century

Infrastructure engineering is an inherently treacherous problem space because it’s core to so many things. Systems today are increasingly distributed and increasingly complex but are built on unreliable components and will continue to be. This includes unreliable networks and faulty hardware. The 21st century engineer understands failure is routine.

Naturally, application developers would rather not have to think about low-level failure modes so they can focus on solving the problem at hand. Infrastructure engineers are then tasked with competing goals: provide enough abstraction to make application development tractable and provide enough reliability to make subsystems useful. The second goal often comes with an additional proviso in that there must be sufficient reliability without sacrificing performance to the point of no longer being useful. Anyone who has worked on enterprise messaging systems can tell you that these goals are often contradictory. The result is a wall of sand intended to keep the developer’s feet dry from the incoming tide. The 21st century engineer understands that in order to play in the sand, we all need to be comfortable getting our feet a little wet from time to time.

With the deluge of technology becoming available today, it’s tempting to introduce it all into your stack. Likewise, engineers are never happy. Left unchecked, we will hyper optimize and iterate into oblivion. It’s a problem I call “innovating to a fault.” Relying on “it’s done when it’s done” is a great way to ship vaporware. Have tangible objectives, make them high-level, and realize things change and evolve over time. Frame the concrete things you’re doing today within the context of those objectives. There’s a difference between Agile micromanagey roadmaps and having a clear vision. Determine when to innovate and when not to. Not Invented Here syndrome can be a deadly disease. Take inventory of what’s being built, make sure it ties back to your objectives, and avoid falling prey to tech pop culture. Optimize for the right problems. The 21st century engineer understands that you are not defined by your tools, you are defined by what you produce at the end of the day.

The prevalence of microservice architecture has made production tooling and instrumentation more important than ever. Teams should take ownership of their systems. If you’re not willing to stand by your work, don’t ship it. However, just because something falls outside of your system’s boundaries doesn’t mean it’s not your problem. If you rely on it, own it. Don’t be afraid to roll up your sleeves and dive into someone else’s code. The 21st century engineer understands that they live and die by the code they have in production, and if they don’t run anything in production, they aren’t really an engineer at all.

The way in which we design systems today is different from the way we designed them in the 20th century and the way we will design them in the future. There is a vast amount of research that has gone into computer science and related fields dating back to the invention of the modern computer. Research from the 50’s, 60’s, all the way up to today shows that system design always is an evolving process. Compiling this body of knowledge together provides an invaluable foundation from which we can build. The 21st century engineer understands that without a deeper understanding of that foundation or with a blind trust, we are only as good as our sand castle.

It’s our responsibility as software engineers, as system designers, as programmers to use this knowledge. Our job is not to build systems or write code, our job is to solve problems, of which code is often a byproduct. No one cares about the code you write, they care about the problems you solve. More specifically, they care about the business problems you solve. The 21st century engineer understands that if we’re not thinking about our solutions end to end, we’re not really doing our job.

Engage to Assuage

Abstraction is important. It’s how humans deal with complexity. You shouldn’t have to understand every little intricate detail behind how your system works. It would take years to do so. But abstraction comes at a cost. You agree to the abstraction’s interface, you place your trust in it, and then you remove it from your mind. That is, until it fails—and abstractions of sufficient complexity will fail. After all, we are building atop unreliable components. Also, a layer of abstraction doesn’t provide any guarantees in higher levels above it, which often results in some false assumptions.

We cannot understand how everything will work, but we should have enough understanding of how it will not work. More plainly, we should understand the cost of the abstractions we use so that we can pay for them with confidence. This doesn’t mean giving up on abstraction but engaging with the complexity that it manages.

I’ve written before about how distributed systems are a UX problem. They’re also a design problem. And a development problem. And an ops problem. And a business problem. The point is they are everyone’s problem because they are complex, and things that are sufficiently complex eventually leak. There is no airtight abstraction in this world. Without understanding limitations and trade-offs, without using the knowledge and research that has come before us, without thinking end to end, we set ourselves up for failure. If we’re going to call ourselves engineers, let’s start acting like it. Nothing is a black box to the 21st century engineer.

You Own Your Availability

There’s been a lot of discussion around “availability” lately. It’s often trumpeted with phrases like “you own your availability,” meaning there is no buck-passing when it comes to service uptime. The AWS outage earlier this week served as a stark reminder that, while owning your availability is a commendable ambition, for many it’s still largely owned by Amazon and the like.

In order to “own” your availability, it’s important to first understand what “availability” really means. Within the context of distributed-systems theory, availability is usually discussed in relation to the CAP theorem. Formally, CAP defines availability as a liveness property: “every request received by a non-failing node in the system must result in a response.” This is a weak definition for two reasons. First, the proviso “every request received by a non-failing node” means that a system in which all nodes have failed is trivially available.  Second, Gilbert and Lynch stipulate no upper bound on latency, only that operations eventually return a response. This means an operation could take weeks to complete and availability would not be violated.

Martin Kleppmann points out these issues in his recent paper “A Critique of the CAP Theorem.” I don’t think there is necessarily a problem with the formalizations made by CAP, just a matter of engineering practicality. Kleppmann’s critique recalls a pertinent quote from Leslie Lamport on the topic of liveness:

Liveness properties are inherently problematic. The question of whether a real system satisfies a liveness property is meaningless; it can be answered only by observing the system for an infinite length of time, and real systems don’t run forever. Liveness is always an approximation to the property we really care about. We want a program to terminate within 100 years, but proving that it does would require the addition of distracting timing assumptions. So, we prove the weaker condition that the program eventually terminates. This doesn’t prove that the program will terminate within our lifetimes, but it does demonstrate the absence of infinite loops.

Despite the pop culture surrounding it, CAP is not meant to neatly classify systems. It’s meant to serve as a jumping-off point from which we can reason from the ground up about distributed systems and the inherent limitations associated with them. It’s a reality check.

Practically speaking, availability is typically described in terms of “uptime” or the proportion of time which requests are successfully served. Brewer refers to this as “yield,” which is the probability of completing a request. This is the metric that is normally measured in “nines,” such as “five-nines availability.”

In the presence of faults there is typically a tradeoff between providing no answer (reducing yield) and providing an imperfect answer (maintaining yield, but reducing harvest).

However, this definition is only marginally more useful than CAP’s since it still doesn’t provide an upper bound on computation.

CAP is better used as a starting point for system design and understanding trade-offs than as a tool for reasoning about availability because it doesn’t really account for real availability. “Harvest” and “yield” show that availability is really a probabilistic property and that the trade with consistency is usually a gradient. But availability is much more nuanced than CAP’s “are we serving requests?” and harvest/yield’s “how many requests?” In practice, availability equates to SLAs. How many requests are we serving? At what rate? At what latency? At what percentiles? These things can’t really be formalized into a theorem like CAP because they are empirically observed, not properties of an algorithm.

Availability is specified by an SLA but observed by outside users. Unlike consistency, which is a property of the system and maintained by algorithm invariants, availability is determined by the client. For example, one user’s requests are served but another user’s are not. To the first user, the system is completely available.

To truly own your availability, you have to own every piece of infrastructure from the client to you, in addition to the infrastructure your system uses. Therefore, you can’t own your availability anymore than you can own Comcast’s fiber or Verizon’s 4G network. This is obviously impractical, if not impossible, but it might also be taking “own your availability” a bit too literally.

What “you own your availability” actually means is “you own your decisions.” Plain and simple. You own the decision to use AWS. You own the decision to use DynamoDB. You own the decision to not use multiple vendors. Owning your availability means making informed decisions about technology and vendors. “What is the risk/reward for using this database?” “Does using a PaaS/IaaS incur vendor lock-in? What happens when that service goes down?” It also means making informed decisions about the business. “What is the cost of our providers not meeting their SLAs? Is it cost-effective to have redundant providers?”

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy. Use them to set expectations and make intelligent decisions, but don’t bank the business on them. Availability is not a timeshare. It’s not at will. You can’t just pawn it off, just like you can’t redirect your tech support to Amazon or Google.

It’s impossible to own your availability because there are too many things left to probability, too many unknowns, and too many variables outside of our control. Own as much as you can predict, as much as you can control, and as much as you can afford. The rest comes down to making informed decisions, hoping for the best, and planning for the worst.

Designed to Fail

When it comes to reliability engineering, people often talk about things like fault injection, monitoring, and operations runbooks. These are all critical pieces for building systems which can withstand failure, but what’s less talked about is the need to design systems which deliberately fail.

Reliability design has a natural progression which closely follows that of architectural design. With monolithic systems, we care more about preventing failure from occurring. With service-oriented architectures, controlling failure becomes less manageable, so instead we learn to anticipate it. With highly distributed microservice architectures where failure is all but guaranteed, we embrace it.

What does it mean to embrace failure? Anticipating failure is understanding the behavior when things go wrong, building systems to be resilient to it, and having a game plan for when it happens, either manual or automated. Embracing failure means making a conscious decision to purposely fail, and it’s essential for building highly available large-scale systems.

A microservice architecture typically means a complex web of service dependencies. One of SOA’s goals is to isolate failure and allow for graceful degradation. The key to being highly available is learning to be partially available. Frequently, one of the requirements for partial availability is telling the client “no.” Outright rejecting service requests is often better than allowing them to back up because, when dealing with distributed services, the latter usually results in cascading failure across dependent systems.

While designing our distributed messaging service at Workiva, we made explicit decisions to drop messages on the floor if we detect the system is becoming overloaded. As queues become backed up, incoming messages are discarded, a statsd counter is incremented, and a backpressure notification is sent to the client. Upon receiving this notification, the client can respond accordingly by failing fast, exponentially backing off, or using some other flow-control strategy. By bounding resource utilization, we maintain predictable performance, predictable (and measurable) lossiness, and impede cascading failure.

Other techniques include building kill switches into service calls and routers. If an overloaded service is not essential to core business, we fail fast on calls to it to prevent availability or latency problems upstream. For example, a spam-detection service is not essential to an email system, so if it’s unavailable or overwhelmed, we can simply bypass it. Netflix’s Hystrix has a set of really nice patterns for handling this.

If we’re not careful, we can often be our own worst enemy. Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves. By isolating and controlling it, we can prevent failure from becoming widespread and unpredictable. By building in backpressure mechanisms and other types of intentional “failure” modes, we can ensure better availability and reliability for our systems through graceful degradation. Sometimes it’s better to fight fire with fire and failure with failure.

Service-Disoriented Architecture

“You can have a second computer once you’ve shown you know how to use the first one.” -Paul Barham

The first rule of distributed systems is don’t distribute your system until you have an observable reason to. Teams break this rule on the regular. People have been talking about service-oriented architecture for a long time, but only recently have microservices been receiving the hype.

The problem, as Martin Fowler observes, is that teams are becoming too eager to adopt a microservice architecture without first understanding the inherent overheads. A contributing factor, I think, is you only hear the success stories from companies who did it right, like Netflix. However, what folks often fail to realize is that these companies—in almost all cases—didn’t start out that way. There was a long and winding path which led them to where they are today. The inverse of this, which some refer to as microservice envy, is causing teams to rush into microservice hell. I call this service-disoriented architecture (or sometimes disservice-oriented architecture when the architecture is DOA).

The term “monolith” has a very negative connotation—unscalable, unmaintainable, unresilient. These things are not intrinsically tied to each other, however, and there’s no reason a single system can’t be modular, maintainable, and fault tolerant at reasonable scale. It’s just less sexy. Refactoring modular code is much easier than refactoring architecture, and refactoring across service boundaries is equally difficult. Fowler describes this as monolith-first, and I think it’s the right approach (with some exceptions, of course).

Don’t even consider microservices unless you have a system that’s too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don’t try to separate it into separate services.

Service-oriented architecture is about organizational complexity and system complexity. If you have both, you have a case to distribute. If you have one of the two, you might have a case (although if you have organizational complexity without system complexity, you’ve probably scaled your organization improperly). If you have neither, you do not have a case to distribute. State, specifically distributed state, is hell, and some pundits argue SOA is satan—perhaps a necessary evil.

There are a lot of motivations for microservices: anti-fragility, fault tolerance, independent deployment and scaling, architectural abstraction, and technology isolation. When services are loosely coupled, the system as a whole tends to be less fragile. When instances are disposable and stateless, services tend to be more fault tolerant because we can spin them up and down, balance traffic, and failover. When responsibility is divided across domain boundaries, services can be independently developed, deployed, and scaled while allowing the right tools to be used for each.

We also need to acknowledge the disadvantages. Adopting a microservice architecture does not automatically buy you anti-fragility. Distributed systems are incredibly precarious. We have to be aware of things like asynchrony, network partitions, node failures, and the trade-off between availability and data consistency. We have to think about resiliency but also the business and UX implications. We have to consider the boundaries of distributed systems like CAP and exactly-once delivery.

When distributing, the emphasis should be on resilience engineering and adopting loosely coupled, stateless components—not microservices for microservices’ sake. We need to view eventual consistency as a tool, not a side effect. The problem I see is that teams often end up with what is essentially a complex, distributed monolith. Now you have two problems. If you’re building a microservice which doesn’t make sense outside the context of another system or isn’t useful on its own, stop and re-evaluate. If you’re designing something to be fast and correct, realize that distributing it will frequently take away both.

Like anti-fragility, microservices do not automatically buy you better maintainability or even scalability. Adopting them requires the proper infrastructure and organization to be in place. Without these, you are bound to fail. In theory, they are intended to increase development velocity, but in many cases the microservice premium ends up slowing it down while creating organizational dependencies and bottlenecks.

There are some key things which must be in place in order for a microservice architecture to be successful: a proper continuous-delivery pipeline, competent DevOps and Ops teams, and prudent service boundaries, to name a few. Good monitoring is essential. It’s also important we have a thorough testing and integration story. This isn’t even considering the fundamental development complexities associated with SOA mentioned earlier.

The better strategy is a bottom-up approach. Start with a monolith or small set of coarse-grained services and work your way up. Make sure you have the data model right. Break out new, finer-grained services as you need to and as you become more confident in your ability to maintain and deploy discrete services. It’s largely about organizational momentum. A young company jumping straight to a microservice architecture is like a golf cart getting on the freeway.

Microservices offer a number of advantages, but for many companies they are a bit of a Holy Grail. Developers are always looking for a silver bullet, but there is always a cost. What we need to do is minimize this cost, and with microservices, this typically means easing our way into it rather than diving into the deep end. Team autonomy and rapid iteration are noble goals, but if we’re not careful, we can end up creating an impedance. Microservices require organization and system maturity. Otherwise, they end up being a premature architectural optimization with a lot of baggage. They end up creating a service-disoriented architecture.