From the Ground Up: Reasoning About Distributed Systems in the Real World

The rabbit hole is deep. Down and down it goes. Where it ends, nobody knows. But as we traverse it, patterns appear. They give us hope, they quell the fear.

Distributed systems literature is abundant, but as a practitioner, I often find it difficult to know where to start or how to synthesize this knowledge without a more formal background. This is a non-academic’s attempt to provide a line of thought for rationalizing design decisions. This piece doesn’t necessarily contribute any new ideas but rather tries to provide a holistic framework by studying some influential existing ones. It includes references which provide a good starting point for thinking about distributed systems. Specifically, we look at a few formal results and slightly less formal design principles to provide a basis from which we can argue about system design.

This is your last chance. After this, there is no turning back. I wish I could say there is no red-pill/blue-pill scenario at play here, but the world of distributed systems is complex. In order to make sense of it, we reason from the ground up while simultaneously stumbling down the deep and cavernous rabbit hole.

Guiding Principles

In order to reason about distributed system design, it’s important to lay out some guiding principles or theorems used to establish an argument. Perhaps the most fundamental of which is the Two Generals Problem originally introduced by Akkoyunlu et al. in Some Constraints and Trade-offs in the Design of Network Communications and popularized by Jim Gray in Notes on Data Base Operating Systems in 1975 and 1978, respectively. The Two Generals Problem demonstrates that it’s impossible for two processes to agree on a decision over an unreliable network. It’s closely related to the binary consensus problem (“attack” or “don’t attack”) where the following conditions must hold:

  • Termination: all correct processes decide some value (liveness property).
  • Validity: if all correct processes decide v, then v must have been proposed by some correct process (non-triviality property).
  • Integrity: all correct processes decide at most one value v, and is the “right” value (safety property).
  • Agreement: all correct processes must agree on the same value (safety property).

It becomes quickly apparent that any useful distributed algorithm consists of some intersection of both liveness and safety properties. The problem becomes more complicated when we consider an asynchronous network with crash failures:

  • Asynchronous: messages may be delayed arbitrarily long but will eventually be delivered.
  • Crash failure: processes can halt indefinitely.

Considering this environment actually leads us to what is arguably one of the most important results in distributed systems theory: the FLP impossibility result introduced by Fischer, Lynch, and Patterson in their 1985 paper Impossibility of Distributed Consensus with One Faulty Process. This result shows that the Two Generals Problem is provably impossible. When we do not consider an upper bound on the time a process takes to complete its work and respond in a crash-failure model, it’s impossible to make the distinction between a process that is crashed and one that is taking a long time to respond. FLP shows there is no algorithm which deterministically solves the consensus problem in an asynchronous environment when it’s possible for at least one process to crash. Equivalently, we say it’s impossible to have a perfect failure detector in an asynchronous system with crash failures.

When talking about fault-tolerant systems, it’s also important to consider Byzantine faults, which are essentially arbitrary faults. These include, but are not limited to, attacks which might try to subvert the system. For example, a security attack might try to generate or falsify messages. The Byzantine Generals Problem is a generalized version of the Two Generals Problem which describes this fault model. Byzantine fault tolerance attempts to protect against these threats by detecting or masking a bounded number of Byzantine faults.

Why do we care about consensus? The reason is it’s central to so many important problems in system design. Leader election implements consensus allowing you to dynamically promote a coordinator to avoid single points of failure. Distributed databases implement consensus to ensure data consistency across nodes. Message queues implement consensus to provide transactional or ordered delivery. Distributed init systems implement consensus to coordinate processes. Consensus is fundamentally an important problem in distributed programming.

It has been shown time and time again that networks, whether local-area or wide-area, are often unreliable and largely asynchronous. As a result, these proofs impose real and significant challenges to system design.

The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.

L. Peter Deutsch’s fallacies of distributed computing are a key jumping-off point in the theory of distributed systems. It presents a set of incorrect assumptions which many new to the space frequently make, of which the first is “the network is reliable.”

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The CAP theorem, while recently the subject of scrutiny and debate over whether it’s overstated or not, is a useful tool for establishing fundamental trade-offs in distributed systems and detecting vendor sleight of hand. Gilbert and Lynch’s Perspectives on the CAP Theorem lays out the intrinsic trade-off between safety and liveness in a fault-prone system, while Fox and Brewer’s Harvest, Yield, and Scalable Tolerant Systems characterizes it in a more pragmatic light. I will continue to say unequivocally that the CAP theorem is important within the field of distributed systems and of significance to system designers and practitioners.

A Renewed Hope

Following from the results detailed earlier would imply many distributed algorithms, including those which implement linearizable operations, serializable transactions, and leader election, are a hopeless endeavor. Is it game over? Fortunately, no. Carefully designed distributed systems can maintain correctness without relying on pure coincidence.

First, it’s important to point out that the FLP result does not indicate consensus is unreachable, just that it’s not always reachable in bounded time. Second, the system model FLP uses is, in some ways, a pathological one. Synchronous systems place a known upper bound on message delivery between processes and on process computation. Asynchronous systems have no fixed upper bounds. In practice, systems tend to exhibit partial synchrony, which is described as one of two models by Dwork and Lynch in Consensus in the Presence of Partial Synchrony. In the first model of partial synchrony, fixed bounds exist but they are not known a priori. In the second model, the bounds are known but are only guaranteed to hold starting at unknown time T. Dwork and Lynch present fault-tolerant consensus protocols for both partial-synchrony models combined with various fault models.

Chandra and Toueg introduce the concept of unreliable failure detectors in Unreliable Failure Detectors for Reliable Distributed Systems. Each process has a local, external failure detector which can make mistakes. The detector monitors a subset of the processes in the system and maintains a list of those it suspects to have crashed. Failures are detected by simply pinging each process periodically and suspecting any process which doesn’t respond to the ping within twice the maximum round-trip time for any previous ping. The detector makes a mistake when it erroneously suspects a correct process, but it may later correct the mistake by removing the process from its list of suspects. The presence of failure detectors, even unreliable ones, makes consensus solvable in a slightly relaxed system model.

While consensus ensures processes agree on a value, atomic broadcast ensures processes deliver the same messages in the same order. This same paper shows that the problems of consensus and atomic broadcast are reducible to each other, meaning they are equivalent. Thus, the FLP result and others apply equally to atomic broadcast, which is used in coordination services like Apache ZooKeeper.

In Introduction to Reliable and Secure Distributed Programming, Cachin, Guerraoui, and Rodrigues suggest most practical systems can be described as partially synchronous:

Generally, distributed systems appear to be synchronous. More precisely, for most systems that we know of, it is relatively easy to define physical time bounds that are respected most of the time. There are, however, periods where the timing assumptions do not hold, i.e., periods during which the system is asynchronous. These are periods where the network is overloaded, for instance, or some process has a shortage of memory that slows it down. Typically, the buffer that a process uses to store incoming and outgoing messages may overflow, and messages may thus get lost, violating the time bound on the delivery. The retransmission of the messages may help ensure the reliability of the communication links but introduce unpredictable delays. In this sense, practical systems are partially synchronous.

We capture partial synchrony by assuming timing assumptions only hold eventually without stating exactly when. Similarly, we call the system eventually synchronous. However, this does not guarantee the system is synchronous forever after a certain time, nor does it require the system to be initially asynchronous then after a period of time become synchronous. Instead it implies the system has periods of asynchrony which are not bounded, but there are periods where the system is synchronous long enough for an algorithm to do something useful or terminate. The key thing to remember with asynchronous systems is that they contain no timing assumptions.

Lastly, On the Minimal Synchronism Needed for Distributed Consensus by Dolev, Dwork, and Stockmeyer describes a consensus protocol as t-resilient if it operates correctly when at most t processes fail. In the paper, several critical system parameters and synchronicity conditions are identified, and it’s shown how varying them affects the t-resiliency of an algorithm. Consensus is shown to be provably possible for some models and impossible for others.

Fault-tolerant consensus is made possible by relying on quorums. The intuition is that as long as a majority of processes agree on every decision, there is at least one process which knows about the complete history in the presence of faults.

Deterministic consensus, and by extension a number of other useful algorithms, is impossible in certain system models, but we can model most real-world systems in a way that circumvents this. Nevertheless, it shows the inherent complexities involved with distributed systems and the rigor needed to solve certain problems.

Theory to Practice

What does all of this mean for us in practice? For starters, it means distributed systems are usually a harder problem than they let on. Unfortunately, this is often the cause of improperly documented trade-offs or, in many cases, data loss and safety violations. It also suggests we need to rethink the way we design systems by shifting the focus from system properties and guarantees to business rules and application invariants.

One of my favorite papers is End-To-End Arguments in System Design by Saltzer, Reed, and Clark. It’s an easy read, but it presents a compelling design principle for determining where to place functionality in a distributed system. The principle idea behind the end-to-end argument is that functions placed at a low level in a system may be redundant or of little value when compared to the cost of providing them at that low level. It follows that, in many situations, it makes more sense to flip guarantees “inside out”—pushing them outwards rather than relying on subsystems, middleware, or low-level layers of the stack to maintain them.

To illustrate this, we consider the problem of “careful file transfer.” A file is stored by a file system on the disk of computer A, which is linked by a communication network to computer B. The goal is to move the file from computer A’s storage to computer B’s storage without damage and in the face of various failures along the way. The application in this case is the file-transfer program which relies on storage and network abstractions. We can enumerate just a few of the potential problems an application designer might be concerned with:

  1. The file, though originally written correctly onto the disk at host A, if read now may contain incorrect data, perhaps because of hardware faults in the disk storage system.
  2. The software of the file system, the file transfer program, or the data communication system might make a mistake in buffering and copying the data of the file, either at host A or host B.
  3. The hardware processor or its local memory might have a transient error while doing the buffering and copying, either at host A or host B.
  4. The communication system might drop or change the bits in a packet, or lose a packet or deliver a packet more than once.
  5. Either of the hosts may crash part way through the transaction after performing an unknown amount (perhaps all) of the transaction.

Many of these problems are Byzantine in nature. When we consider each threat one by one, it becomes abundantly clear that even if we place countermeasures in the low-level subsystems, there will still be checks required in the high-level application. For example, we might place checksums, retries, and sequencing of packets in the communication system to provide reliable data transmission, but this really only eliminates threat four. An end-to-end checksum and retry mechanism at the file-transfer level is needed to guard against the remaining threats.

Building reliability into the low level has a number of costs involved. It takes a non-trivial amount of effort to build it. It’s redundant and, in fact, hinders performance by reducing the frequency of application retries and adding unneeded overhead. It also has no actual effect on correctness because correctness is determined and enforced by the end-to-end checksum and retries. The reliability and correctness of the communication system is of little importance, so going out of its way to ensure resiliency does not reduce any burden on the application. In fact, ensuring correctness by relying on the low level might be altogether impossible since threat number two requires writing correct programs, but not all programs involved may be written by the file-transfer application programmer.

Fundamentally, there are two problems with placing functionality at the lower level. First, the lower level is not aware of the application needs or semantics, which means logic placed there is often insufficient. This leads to duplication of logic as seen in the example earlier. Second, other applications which rely on the lower level pay the cost of the added functionality even when they don’t necessarily need it.

Saltzer, Reed, and Clark propose the end-to-end principle as a sort of “Occam’s razor” for system design, arguing that it helps guide the placement of functionality and organization of layers in a system.

Because the communication subsystem is frequently specified before applications that use the subsystem are known, the designer may be tempted to “help” the users by taking on more function than necessary. Awareness of end-to end arguments can help to reduce such temptations.

However, it’s important to note that the end-to-end principle is not a panacea. Rather, it’s a guideline to help get designers to think about their solutions end to end, acknowledge their application requirements, and consider their failure modes. Ultimately, it provides a rationale for moving function upward in a layered system, closer to the application that uses the function, but there are always exceptions to the rule. Low-level mechanisms might be built as a performance optimization. Regardless, the end-to-end argument contends that lower levels should avoid taking on any more responsibility than necessary. The “lessons” section from Google’s Bigtable paper echoes some of these same sentiments:

Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used. For example, we initially planned to support general-purpose transactions in our API. Because we did not have an immediate use for them, however, we did not implement them. Now that we have many real applications running on Bigtable, we have been able to examine their actual needs, and have discovered that most applications require only single-row transactions. Where people have requested distributed transactions, the most important use is for maintaining secondary indices, and we plan to add a specialized mechanism to satisfy this need. The new mechanism will be less general than distributed transactions, but will be more efficient (especially for updates that span hundreds of rows or more) and will also interact better with our scheme for optimistic cross-datacenter replication.

We’ll see the end-to-end argument as a common theme throughout the remainder of this piece.

Whose Guarantee Is It Anyway?

Generally, we rely on robust algorithms, transaction managers, and coordination services to maintain consistency and application correctness. The problem with these is twofold: they are often unreliable and they impose a massive performance bottleneck.

Distributed coordination algorithms are difficult to get right. Even tried-and-true protocols like two-phase commit are susceptible to crash failures and network partitions. Protocols which are more fault tolerant like Paxos and Raft generally don’t scale well beyond small clusters or across wide-area networks. Consensus systems like ZooKeeper own your availability, meaning if you depend on one and it goes down, you’re up a creek. Since quorums are often kept small for performance reasons, this might be less rare than you think.

Coordination systems become a fragile and complex piece of your infrastructure, which seems ironic considering they are usually employed to reduce fragility. On the other hand, message-oriented middleware largely use coordination to provide developers with strong guarantees: exactly-once, ordered, transactional delivery and the like.

From transmission protocols to enterprise message brokers, relying on delivery guarantees is an anti-pattern in distributed system design. Delivery semantics are a tricky business. As such, when it comes to distributed messaging, what you want is often not what you need. It’s important to look at the trade-offs involved, how they impact system design (and UX!), and how we can cope with them to make better decisions.

Subtle and not-so-subtle failure modes make providing strong guarantees exceedingly difficult. In fact, some guarantees, like exactly-once delivery, aren’t even really possible to achieve when we consider things like the Two Generals Problem and the FLP result. When we try to provide semantics like guaranteed, exactly-once, and ordered message delivery, we usually end up with something that’s over-engineered, difficult to deploy and operate, fragile, and slow. What is the upside to all of this? Something that makes your life easier as a developer when things go perfectly well, but the reality is things don’t go perfectly well most of the time. Instead, you end up getting paged at 1 a.m. trying to figure out why RabbitMQ told your monitoring everything is awesome while proceeding to take a dump in your front yard.

If you have something that relies on these types of guarantees in production, know that this will happen to you at least once sooner or later (and probably much more than that). Eventually, a guarantee is going to break down. It might be inconsequential, it might not. Not only is this a precarious way to go about designing things, but if you operate at a large scale, care about throughput, or have sensitive SLAs, it’s probably a nonstarter.

The performance implications of distributed transactions are obvious. Coordination is expensive because processes can’t make progress independently, which in turn limits throughput, availability, and scalability. Peter Bailis gave an excellent talk called Silence is Golden: Coordination-Avoiding Systems Design which explains this in great detail and how coordination can be avoided. In it, he explains how distributed transactions can result in nearly a 400x decrease in throughput in certain situations.

Avoiding coordination enables infinite scale-out while drastically improving throughput and availability, but in some cases coordination is unavoidable. In Coordination Avoidance in Database Systems, Bailis et al. answer a key question: when is coordination necessary for correctness? They present a property, invariant confluence (I-confluence), which is necessary and sufficient for safe, coordination-free, available, and convergent execution. I-confluence essentially works by pushing invariants up into the business layer where we specify correctness in terms of application semantics rather than low-level database operations.

Without knowledge of what “correctness” means to your app (e.g., the invariants used in I-confluence), the best you can do to preserve correctness under a read/write model is serializability.

I-confluence can be determined given a set of transactions and a merge function used to reconcile divergent states. If I-confluence holds, there exists a coordination-free execution strategy that preserves invariants. If it doesn’t hold, no such strategy exists—coordination is required. I-confluence allows us to identify when we can and can’t give up coordination, and by pushing invariants up, we remove a lot of potential bottlenecks from areas which don’t require it.

If we recall, “synchrony” within the context of distributed computing is really just making assumptions about time, so synchronization is basically two or more processes coordinating around time. As we saw, a system which performs no coordination will have optimal performance and availability since everyone can proceed independently. However, a distributed system which performs zero coordination isn’t particularly useful or possible as I-confluence shows. Christopher Meiklejohn’s Strange Loop talk, Distributed, Eventually Consistent Computations, provides an interesting take on coordination with the parable of the car. A car requires friction to drive, but that friction is limited to very small contact points. Any other friction on the car causes problems or inefficiencies. If we think about physical time as friction, we know we can’t eliminate it altogether because it’s essential to the problem, but we want to reduce the use of it in our systems as much as possible. We can typically avoid relying on physical time by instead using logical time, for example, with the use of Lamport clocks or other conflict-resolution techniques. Lamport’s Time, Clocks, and the Ordering of Events in a Distributed System is the classical introduction to this idea.

Often, systems simply forgo coordination altogether for latency-sensitive operations, a perfectly reasonable thing to do provided the trade-off is explicit and well-documented. Sadly, this is frequently not the case. But we can do better. I-confluence provides a useful framework for avoiding coordination, but there’s a seemingly larger lesson to be learned here. What it really advocates is reexamining how we design systems, which seems in some ways to closely parallel our end-to-end argument.

When we think low level, we pay the upfront cost of entry—serializable transactions, linearizable reads and writes, coordination. This seems contradictory to the end-to-end principle. Our application doesn’t really care about atomicity or isolation levels or linearizability. It cares about two users sharing the same ID or two reservations booking the same room or a negative balance in a bank account, but the database doesn’t know that. Sometimes these rules don’t even require any expensive coordination.

If all we do is code our business rules and constraints into the language our infrastructure understands, we end up with a few problems. First, we have to know how to translate our application semantics into these low-level operations while avoiding any impedance mismatch. In the context of messaging, guaranteed delivery doesn’t really mean anything to our application which cares about what’s done with the messages. Second, we preclude ourselves from using a lot of generalized solutions and, in some cases, we end up having to engineer specialized ones ourselves. It’s not clear how well this scales in practice. Third, we pay a performance penalty that could otherwise be avoided (as I-confluence shows). Lastly, we put ourselves at the mercy of our infrastructure and hope it makes good on its promises—it often doesn’t.

Working on a messaging platform team, I’ve had countless conversations which resemble the following exchange:

Developer: “We need fast messaging.”
Me: “Is it okay if messages get dropped occasionally?”
Developer: “What? Of course not! We need it to be reliable.”
Me: “Okay, we’ll add a delivery ack, but what happens if your application crashes before it processes the message?”
Developer: “We’ll ack after processing.”
Me: “What happens if you crash after processing but before acking?”
Developer: “We’ll just retry.”
Me: “So duplicate delivery is okay?”
Developer: “Well, it should really be exactly-once.”
Me: “But you want it to be fast?”
Developer: “Yep. Oh, and it should maintain message ordering.”
Me: “Here’s TCP.”

If, instead, we reevaluate the interactions between our systems, their APIs, their semantics, and move some of that responsibility off of our infrastructure and onto our applications, then maybe we can start to build more robust, resilient, and performant systems. With messaging, does our infrastructure really need to enforce FIFO ordering? Preserving order with distributed messaging in the presence of failure while trying to simultaneously maintain high availability is difficult and expensive. Why rely on it when it can be avoided with commutativity? Likewise, transactional delivery requires coordination which is slow and brittle while still not providing application guarantees. Why rely on it when it can be avoided with idempotence and retries? If you need application-level guarantees, build them into the application level. The infrastructure can’t provide it.

I really like Gregor Hohpe’s “Your Coffee Shop Doesn’t Use Two-Phase Commit” because it shows how simple solutions can be if we just model them off of the real world. It gives me hope we can design better systems, sometimes by just turning things on their head. There’s usually a reason things work the way they do, and it often doesn’t even involve the use of computers or complicated algorithms.

Rather than try to hide complexities by using flaky and heavy abstractions, we should engage directly by recognizing them in our design decisions and thinking end to end. It may be a long and winding path to distributed systems zen, but the best place to start is from the beginning.

I’d like to thank Tom Santero for reviewing an early draft of this writing. Any inaccuracies or opinions expressed are mine alone.

Breaking and Entering: Lose the Lock While Embracing Concurrency

This article originally appeared on Workiva’s engineering blog as a two-part series.

Providing robust message routing was a priority for us at Workiva when building our distributed messaging infrastructure. This encompassed directed messaging, which allows us to route messages to specific endpoints based on service or client identifiers, but also topic fan-out with support for wildcards and pattern matching.

Existing message-oriented middleware, such as RabbitMQ, provide varying levels of support for these but don’t offer the rich features needed to power Wdesk. This includes transport fallback with graceful degradation, tunable qualities of service, support for client-side messaging, and pluggable authentication middleware. As such, we set out to build a new system, not by reinventing the wheel, but by repurposing it.

Eventually, we settled on Apache Kafka as our wheel, or perhaps more accurately, our log. Kafka demonstrates a telling story of speed, scalability, and fault tolerance—each a requisite component of any reliable messaging system—but it’s only half the story. Pub/sub is a critical messaging pattern for us and underpins a wide range of use cases, but Kafka’s topic model isn’t designed for this purpose. One of the key engineering challenges we faced was building a practical routing mechanism by which messages are matched to interested subscribers. On the surface, this problem appears fairly trivial and is far from novel, but it becomes quite interesting as we dig deeper.

Back to Basics

Topic routing works by matching a published message with interested subscribers. A consumer might subscribe to the topic “foo.bar.baz,” in which any message published to this topic would be delivered to them. We also must support * and # wildcards, which match exactly one word and zero or more words, respectively. In this sense, we follow the AMQP spec:

The routing key used for a topic exchange MUST consist of zero or more words delimited by dots. Each word may contain the letters A–Z and a–z and digits 0–9. The routing pattern follows the same rules as the routing key with the addition that * matches a single word, and # matches zero or more words. Thus the routing pattern *.stock.# matches the routing keys usd.stock and eur.stock.db but not stock.nasdaq.

This problem can be modeled using a trie structure. RabbitMQ went with this approach after exploring other options, like caching topics and indexing the patterns or using a deterministic finite automaton. The latter options have greater time and space complexities. The former requires backtracking the tree for wildcard lookups.

The subscription trie looks something like this:

subscription_trie

Even in spite of the backtracking required for wildcards, the trie ends up being a more performant solution due to its logarithmic complexity and tendency to fit CPU cache lines. Most tries have hot paths, particularly closer to the root, so caching becomes indispensable. The trie approach is also vastly easier to implement.

In almost all cases, this subscription trie needs to be thread-safe as clients are concurrently subscribing, unsubscribing, and publishing. We could serialize access to it with a reader-writer lock. For some, this would be the end of the story, but for high-throughput systems, locking is a major bottleneck. We can do better.

Breaking the Lock

We considered lock-free techniques that could be applied. Lock-free concurrency means that while a particular thread of execution may be blocked, all CPUs are able to continue processing other work. For example, imagine a program that protects access to some resource using a mutex. If a thread acquires this mutex and is subsequently preempted, no other thread can proceed until this thread is rescheduled by the OS. If the scheduler is adversarial, it may never resume execution of the thread, and the program would be effectively deadlocked. A key point, however, is that the mere lack of a lock does not guarantee a program is lock-free. In this context, “lock” really refers to deadlock, livelock, or the misdeeds of a malevolent scheduler.

In practice, what lock-free concurrency buys us is increased system throughput at the expense of increased tail latencies. Looking at a transactional system, lock-freedom allows us to process many concurrent transactions, any of which may block, while guaranteeing systemwide progress. Depending on the access patterns, when a transaction does block, there are always other transactions which can be processed—a CPU never idles. For high-throughput databases, this is essential.

Concurrent Timelines and Linearizability

Lock-freedom can be achieved using a number of techniques, but it ultimately reduces to a small handful of fundamental patterns. In order to fully comprehend these patterns, it’s important to grasp the concept of linearizability.

It takes approximately 100 nanoseconds for data to move from the CPU to main memory. This means that the laws of physics govern the unavoidable discrepancy between when you perceive an operation to have occurred and when it actually occurred. There is the time from when an operation is invoked to when some state change physically occurs (call it tinv), and there is the time from when that state change occurs to when we actually observe the operation as completed (call it tcom). Operations are not instantaneous, which means the wall-clock history of operations is uncertain. tinv and tcom vary for every operation. This is more easily visualized using a timeline diagram like the one below:

timeline

This timeline shows several reads and writes happening concurrently on some state. Physical time moves from left to right. This illustrates that even if a write is invoked before another concurrent write in real time, the later write could be applied first. If there are multiple threads performing operations on shared state, the notion of physical time is meaningless.

We use a linearizable consistency model to allow some semblance of a timeline by providing a total order of all state updates. Linearizability requires that each operation appears to occur atomically at some point between its invocation and completion. This point is called the linearization point. When an operation completes, it’s guaranteed to be observable by all threads because, by definition, the operation occurred before its completion time. After this point, reads will only see this value or a later one—never anything before. This gives us a proper sequencing of operations which can be reasoned about. Linearizability is a correctness condition for concurrent objects.

Of course, linearizability comes at a cost. This is why most memory models aren’t linearizable by default. Going back to our subscription trie, we could make operations on it appear atomic by applying a global lock. This kills throughput, but it ensures linearization.

lock trie

In reality, the trie operations do not occur at a specific instant in time as the illustration above depicts. However, mutual exclusion gives it the appearance and, as a result, linearizability holds at the expense of systemwide progress. Acquiring and releasing the lock appear instantaneous in the timeline because they are backed by atomic hardware operations like test-and-set. Linearizability is a composable property, meaning if an object is composed of linearizable objects, it is also linearizable. This allows us to construct abstractions from linearizable hardware instructions to data structures, all the way up to linearizable distributed systems.

Read-Modify-Write and CAS

Locks are expensive, not just due to contention but because they completely preclude parallelism. As we saw, if a thread which acquires a lock is preempted, any other threads waiting for the lock will continue to block.

Read-modify-write operations like compare-and-swap offer a lock-free approach to ensuring linearizable consistency. Such techniques loosen the bottleneck by guaranteeing systemwide throughput even if one or more threads are blocked. The typical pattern is to perform some speculative work then attempt to publish the changes with a CAS. If the CAS fails, then another thread performed a concurrent operation, and the transaction needs to be retried. If it succeeds, the operation was committed and is now visible, preserving linearizability. The CAS loop is a pattern used in many lock-free data structures and proves to be a useful primitive for our subscription trie.

CAS is susceptible to the ABA problem. These operations work by comparing values at a memory address. If the value is the same, it’s assumed that nothing has changed. However, this can be problematic if another thread modifies the shared memory and changes it back before the first thread resumes execution. The ABA problem is represented by the following sequence of events:

  1. Thread T1 reads shared-memory value A
  2. T1 is preempted, and T2 is scheduled
  3. T2 changes A to B then back to A
  4. T2 is preempted, and T1 is scheduled
  5. T1 sees the shared-memory value is A and continues

In this situation, T1 assumes nothing has changed when, in fact, an invariant may have been violated. We’ll see how this problem is addressed later.

At this point, we’ve explored the subscription-matching problem space, demonstrated why it’s an area of high contention, and examined why locks pose a serious problem to throughput. Linearizability provides an important foundation of understanding for lock-freedom, and we’ve looked at the most fundamental pattern for building lock-free data structures, compare-and-swap. Next, we will take a deep dive on applying lock-free techniques in practice by building on this knowledge. We’ll continue our narrative of how we applied these same techniques to our subscription engine and provide some further motivation for them.

Lock-Free Applied

Let’s revisit our subscription trie from earlier. Our naive approach to making it linearizable was to protect it with a lock. This proved easy, but as we observed, severely limited throughput. For a message broker, access to this trie is a critical path, and we usually have multiple threads performing inserts, removals, and lookups on it concurrently. Intuition tells us we can implement these operations without coarse-grained locking by relying on a CAS to perform mutations on the trie.

If we recall, read-modify-write is typically applied by copying a shared variable to a local variable, performing some speculative work on it, and attempting to publish the changes with a CAS. When inserting into the trie, our speculative work is creating an updated copy of a node. We commit the new node by updating the parent’s reference with a CAS. For example, if we want to add a subscriber to a node, we would copy the node, add the new subscriber, and CAS the pointer to it in the parent.

This approach is broken, however. To see why, imagine if a thread inserts a subscription on a node while another thread concurrently inserts a subscription as a child of that node. The second insert could be lost due to the sequencing of the reference updates. The diagram below illustrates this problem. Dotted lines represent a reference updated with a CAS.

trie cas add

The orphaned nodes containing “x” and “z” mean the subscription to “foo.bar” was lost. The trie is in an inconsistent state.

We looked to existing research in the field of non-blocking data structures to help illuminate a path. “Concurrent Tries with Efficient Non-Blocking Snapshots” by Prokopec et al. introduces the Ctrie, a non-blocking, concurrent hash trie based on shared-memory, single-word CAS instructions.

A hash array mapped trie (HAMT) is an implementation of an associative array which, unlike a hashmap, is dynamically allocated. Memory consumption is always proportional to the number of keys in the trie. A HAMT works by hashing keys and using the resulting bits in the hash code to determine which branches to follow down the trie. Each node contains a table with a fixed number of branch slots. Typically, the number of branch slots is 32. On a 64-bit machine, this would mean it takes 256 bytes (32 branches x 8-byte pointers) to store the branch table of a node.

The size of L1-L3 cache lines is 64 bytes on most modern processors. We can’t fit the branch table in a CPU cache line, let alone the entire node. Instead of allocating space for all branches, we use a bitmap to indicate the presence of a branch at a particular slot. This reduces the size of an empty node from roughly 264 bytes to 12 bytes. We can safely fit a node with up to six branches in a single cache line.

The Ctrie is a concurrent, lock-free version of the HAMT which ensures progress and linearizability. It solves the CAS problem described above by introducing indirection nodes, or I-nodes, which remain present in the trie even as nodes above and below change. This invariant ensures correctness on inserts by applying the CAS operation on the I-node instead of the internal node array.

An I-node may point to a Ctrie node, or C-node, which is an internal node containing a bitmap and array of references to branches. A branch is either an I-node or a singleton node (S-node) containing a key-value pair. The S-node is a leaf in the Ctrie. A newly initialized Ctrie starts with a root pointer to an I-node which points to an empty C-node. The diagram below illustrates a sequence of inserts on a Ctrie.

ctrie insert

An insert starts by atomically reading the I-node’s reference. Next, we copy the C-node and add the new key, recursively insert on an I-node, or extend the Ctrie with a new I-node. The new C-node is then published by performing a CAS on the parent I-node. A failed CAS indicates another thread has mutated the I-node. We re-linearize by atomically reading the I-node’s reference again, which gives us the current state of the Ctrie according to its linearizable history. We then retry the operation until the CAS succeeds. In this case, the linearization point is a successful CAS. The following figure shows why the presence of I-nodes ensures consistency.

ctrie insert correctness

In the above diagram, (k4,v4) is inserted into a Ctrie containing (k1,v1), (k2,v2), and (k3,v3). The new key-value pair is added to node C1 by creating a copy, C1, with the new entry. A CAS is then performed on the pointer at I1, indicated by the dotted line. Since C1 continues pointing to I2, any concurrent updates which occur below it will remain present in the trie. C1 is then garbage collected once no more threads are accessing it. Because of this, Ctries are much easier to implement in a garbage-collected language. It turns out that this deferred reclamation also solves the ABA problem described earlier by ensuring memory addresses are recycled only when it’s safe to do so.

The I-node invariant is enough to guarantee correctness for inserts and lookups, but removals require some additional invariants in order to avoid update loss. Insertions extend the Ctrie with additional levels, while removals eliminate the need for some of these levels. This is because we want to keep the Ctrie as compact as possible while still remaining correct. For example, a remove operation could result in a C-node with a single S-node below it. This state is valid, but the Ctrie could be made more compact and lookups on the lone S-node more efficient if it were moved up into the C-node above. This would allow the I-node and C-node to be removed.

The problem with this approach is it will cause insertions to be lost. If we move the S-node up and replace the dangling I-node reference with it, another thread could perform a concurrent insert on that I-node just before the compression occurs. The insert would be lost because the pointer to the I-node would be removed.

This issue is solved by introducing a new type of node called the tomb node (T-node) and an associated invariant. The T-node is used to ensure proper ordering during removals. The invariant is as follows: if an I-node points to a T-node at some time t0, then for all times greater than t0, the I-node points to the same T-node. More concisely, a T-node is the last value assigned to an I-node. This ensures that no insertions occur at an I-node if it is being compressed. We call such an I-node a tombed I-node.

If a removal results in a non-root-level C-node with a single S-node below it, the C-node is replaced with a T-node wrapping the S-node. This guarantees that every I-node except the root points to a C-node with at least one branch. This diagram depicts the result of removing (k2,v2) from a Ctrie:

ctrie removal

Removing (k2,v2) results in a C-node with a single branch, so it’s subsequently replaced with a T-node. The T-node provides a sequencing mechanism by effectively acting as a marker. While it solves the problem of lost updates, it doesn’t give us a compacted trie. If two keys have long matching hash code prefixes, removing one of the keys would result in a long chain of C-nodes followed by a single T-node at the end.

An invariant was introduced which says once an I-node points to a T-node, it will always point to that T-node. This means we can’t change a tombed I-node’s pointer, so instead we replace the I-node with its resurrection. The resurrection of a tombed I-node is the S-node wrapped in its T-node. When a T-node is produced during a removal, we ensure that it’s still reachable, and if it is, resurrect its tombed I-node in the C-node above. If it’s not reachable, another thread has already performed the compression. To ensure lock-freedom, all operations which read a T-node must help compress it instead of waiting for the removing thread to complete. Compression on the Ctrie from the previous diagram is illustrated below.

ctrie compression

The resurrection of the tombed I-node ensures the Ctrie is optimally compressed for arbitrarily long chains while maintaining integrity.

With a 32-bit hash code space, collisions are rare but still nontrivial. To deal with this, we introduce one final node, the list node (L-node). An L-node is essentially a persistent linked list. If there is a collision between the hash codes of two different keys, they are placed in an L-node. This is analogous to a hash table using separate chaining to resolve collisions.

One interesting property of the Ctrie is support for lock-free, linearizable, constant-time snapshots. Most concurrent data structures do not support snapshots, instead opting for locks or requiring a quiescent state. This allows Ctries to have O(1) iterator creation, clear, and size retrieval (amortized).

Constant-time snapshots are implemented by writing the Ctrie as a persistent data structure and assigning a generation count to each I-node. A persistent hash trie is updated by rewriting the path from the root of the trie down to the leaf the key belongs to while leaving the rest of the trie intact. The generation demarcates Ctrie snapshots. To create a new snapshot, we copy the root I-node and assign it a new generation. When an operation detects that an I-node’s generation is older than the root’s generation, it copies the I-node to the new generation and updates the parent. The path from the root to some node is only updated the first time it’s accessed, making the snapshot a O(1) operation.

The final piece needed for snapshots is a special type of CAS operation. There is a race condition between the thread creating a snapshot and the threads which have already read the root I-node with the previous generation. The linearization point for an insert is a successful CAS on an I-node, but we need to ensure that both the I-node has not been modified and its generation matches that of the root. This could be accomplished with a double compare-and-swap, but most architectures do not support such an operation.

The alternative is to use a RDCSS double-compare-single-swap originally described by Harris et al. We implement an operation with similar semantics to RDCSS called GCAS, or generation compare-and-swap. The GCAS allows us to atomically compare both the I-node pointer and its generation to the expected values before committing an update.

After researching the Ctrie, we wrote a Go implementation in order to gain a deeper understanding of the applied techniques. These same ideas would hopefully be adaptable to our problem domain.

Generalizing the Ctrie

The subscription trie shares some similarities to the hash array mapped trie but there are some key differences. First, values are not strictly stored at the leaves but can be on internal nodes as well. Second, the decomposed topic is used to determine how the trie is descended rather than a hash code. Wildcards complicate lookups further by requiring backtracking. Lastly, the number of branches on a node is not a fixed size. Applying the Ctrie techniques to the subscription trie, we end up with something like this:

matchbox

Much of the same logic applies. The main distinctions are the branch traversal based on topic words and rules around wildcards. Each branch is associated with a word and set of subscribers and may or may not point to an I-node. The I-nodes still ensure correctness on inserts. The behavior of T-nodes changes slightly. With the Ctrie, a T-node is created from a C-node with a single branch and then compressed. With the subscription trie, we don’t introduce a T-node until all branches are removed. A branch is pruned if it has no subscribers and points to nowhere or it has no subscribers and points to a tombed I-node. The GCAS and snapshotting remain unchanged.

We implemented this Ctrie derivative in order to build our concurrent pattern-matching engine, matchbox. This library provides an exceptionally simple API which allows a client to subscribe to a topic, unsubscribe from a topic, and lookup a topic’s subscribers. Snapshotting is also leveraged to retrieve the global subscription tree and the topics to which clients are currently subscribed. These are useful to see who currently has subscriptions and for what.

In Practice

Matchbox has been pretty extensively benchmarked, but to see how it behaves, it’s critical to observe its performance under contention. Many messaging systems opt for a mutex which tends to result in a lot of lock contention. It’s important to know what the access patterns look like in practice, but for our purposes, it’s heavily parallel. We don’t want to waste CPU cycles if we can help it.

To see how matchbox compares to lock-based subscription structures, I benchmarked it against gnatsd, a popular high-performance messaging system also written in Go. Gnatsd uses a tree-like structure protected by a mutex to manage subscriptions and offers similar wildcard semantics.

The benchmarks consist of one or more insertion goroutines and one or more lookup goroutines. Each insertion goroutine inserts 1000 subscriptions, and each lookup goroutine looks up 1000 subscriptions. We scale these goroutines up to see how the systems behave under contention.

The first benchmark is a 1:1 concurrent insert-to-lookup workload. A lookup corresponds to a message being published and matched to interested subscribers, while an insert occurs when a client subscribes to a topic. In practice, lookups are much more frequent than inserts, so the second benchmark is a 1:3 concurrent insert-to-lookup workload to help simulate this. The timings correspond to the complete insert and lookup workload. GOMAXPROCS was set to 8, which controls the number of operating system threads that can execute simultaneously. The benchmarks were run on a machine with a 2.6 GHz Intel Core i7 processor.

matchbox_bench_1_1

matchbox_bench_1_3

It’s quite clear that the lock-free approach scales a lot better under contention. This follows our intuition because lock-freedom allows system-wide progress even when a thread is blocked. If one goroutine is blocked on an insert or lookup operation, other operations may proceed. With a mutex, this isn’t possible.

Matchbox performs well, particularly in multithreaded environments, but there are still more optimizations to be made. This includes improvements both in memory consumption and runtime performance. Applying the Ctrie techniques to this type of trie results in a fairly non-compact structure. There may be ways to roll up branches—either eagerly or after removals—and expand them lazily as necessary. Other optimizations might include placing a cache or Bloom filter in front of the trie to avoid descending it. The main difficulty with these will be managing support for wildcards.

Conclusion

To summarize, we’ve seen why subscription matching is often a major concern for message-oriented middleware and why it’s frequently a bottleneck. Concurrency is crucial for high-performance systems, and we’ve looked at how we can achieve concurrency without relying on locks while framing it within the context of linearizability. Compare-and-swap is a fundamental tool used to implement lock-free data structures, but it’s important to be conscious of the pitfalls. We introduce invariants to protect data consistency. The Ctrie is a great example of how to do this and was foundational in our lock-free subscription-matching implementation. Finally, we validated our work by showing that lock-free data structures scale dramatically better with multithreaded workloads under contention.

My thanks to Steven Osborne and Dustin Hiatt for reviewing this article.

Infrastructure Engineering in the 21st Century

Infrastructure engineering is an inherently treacherous problem space because it’s core to so many things. Systems today are increasingly distributed and increasingly complex but are built on unreliable components and will continue to be. This includes unreliable networks and faulty hardware. The 21st century engineer understands failure is routine.

Naturally, application developers would rather not have to think about low-level failure modes so they can focus on solving the problem at hand. Infrastructure engineers are then tasked with competing goals: provide enough abstraction to make application development tractable and provide enough reliability to make subsystems useful. The second goal often comes with an additional proviso in that there must be sufficient reliability without sacrificing performance to the point of no longer being useful. Anyone who has worked on enterprise messaging systems can tell you that these goals are often contradictory. The result is a wall of sand intended to keep the developer’s feet dry from the incoming tide. The 21st century engineer understands that in order to play in the sand, we all need to be comfortable getting our feet a little wet from time to time.

With the deluge of technology becoming available today, it’s tempting to introduce it all into your stack. Likewise, engineers are never happy. Left unchecked, we will hyper optimize and iterate into oblivion. It’s a problem I call “innovating to a fault.” Relying on “it’s done when it’s done” is a great way to ship vaporware. Have tangible objectives, make them high-level, and realize things change and evolve over time. Frame the concrete things you’re doing today within the context of those objectives. There’s a difference between Agile micromanagey roadmaps and having a clear vision. Determine when to innovate and when not to. Not Invented Here syndrome can be a deadly disease. Take inventory of what’s being built, make sure it ties back to your objectives, and avoid falling prey to tech pop culture. Optimize for the right problems. The 21st century engineer understands that you are not defined by your tools, you are defined by what you produce at the end of the day.

The prevalence of microservice architecture has made production tooling and instrumentation more important than ever. Teams should take ownership of their systems. If you’re not willing to stand by your work, don’t ship it. However, just because something falls outside of your system’s boundaries doesn’t mean it’s not your problem. If you rely on it, own it. Don’t be afraid to roll up your sleeves and dive into someone else’s code. The 21st century engineer understands that they live and die by the code they have in production, and if they don’t run anything in production, they aren’t really an engineer at all.

The way in which we design systems today is different from the way we designed them in the 20th century and the way we will design them in the future. There is a vast amount of research that has gone into computer science and related fields dating back to the invention of the modern computer. Research from the 50’s, 60’s, all the way up to today shows that system design always is an evolving process. Compiling this body of knowledge together provides an invaluable foundation from which we can build. The 21st century engineer understands that without a deeper understanding of that foundation or with a blind trust, we are only as good as our sand castle.

It’s our responsibility as software engineers, as system designers, as programmers to use this knowledge. Our job is not to build systems or write code, our job is to solve problems, of which code is often a byproduct. No one cares about the code you write, they care about the problems you solve. More specifically, they care about the business problems you solve. The 21st century engineer understands that if we’re not thinking about our solutions end to end, we’re not really doing our job.

Engage to Assuage

Abstraction is important. It’s how humans deal with complexity. You shouldn’t have to understand every little intricate detail behind how your system works. It would take years to do so. But abstraction comes at a cost. You agree to the abstraction’s interface, you place your trust in it, and then you remove it from your mind. That is, until it fails—and abstractions of sufficient complexity will fail. After all, we are building atop unreliable components. Also, a layer of abstraction doesn’t provide any guarantees in higher levels above it, which often results in some false assumptions.

We cannot understand how everything will work, but we should have enough understanding of how it will not work. More plainly, we should understand the cost of the abstractions we use so that we can pay for them with confidence. This doesn’t mean giving up on abstraction but engaging with the complexity that it manages.

I’ve written before about how distributed systems are a UX problem. They’re also a design problem. And a development problem. And an ops problem. And a business problem. The point is they are everyone’s problem because they are complex, and things that are sufficiently complex eventually leak. There is no airtight abstraction in this world. Without understanding limitations and trade-offs, without using the knowledge and research that has come before us, without thinking end to end, we set ourselves up for failure. If we’re going to call ourselves engineers, let’s start acting like it. Nothing is a black box to the 21st century engineer.

Everything You Know About Latency Is Wrong

Okay, maybe not everything you know about latency is wrong. But now that I have your attention, we can talk about why the tools and methodologies you use to measure and reason about latency are likely horribly flawed. In fact, they’re not just flawed, they’re probably lying to your face.

When I went to Strange Loop in September, I attended a workshop called “Understanding Latency and Application Responsiveness” by Gil Tene. Gil is the CTO of Azul Systems, which is most renowned for its C4 pauseless garbage collector and associated Zing Java runtime. While the workshop was four and a half hours long, Gil also gave a 40-minute talk called “How NOT to Measure Latency” which was basically an abbreviated, less interactive version of the workshop. If you ever get the opportunity to see Gil speak or attend his workshop, I recommend you do. At the very least, do yourself a favor and watch one of his recorded talks or find his slide decks online.

The remainder of this post is primarily a summarization of that talk. You may not get anything out of it that you wouldn’t get out of the talk, but I think it can be helpful to absorb some of these ideas in written form. Plus, for my own benefit, writing about them helps solidify it in my head.

What is Latency?

Latency is defined as the time it took one operation to happen. This means every operation has its own latency—with one million operations there are one million latencies. As a result, latency cannot be measured as work units / time. What we’re interested in is how latency behaves. To do this meaningfully, we must describe the complete distribution of latencies. Latency almost never follows a normal, Gaussian, or Poisson distribution, so looking at averages, medians, and even standard deviations is useless.

Latency tends to be heavily multi-modal, and part of this is attributed to “hiccups” in response time. Hiccups resemble periodic freezes and can be due to any number of reasons—GC pauses, hypervisor pauses, context switches, interrupts, database reindexing, cache buffer flushes to disk, etc. These hiccups never resemble normal distributions and the shift between modes is often rapid and eclectic.

Screen Shot 2015-10-04 at 4.32.24 PM

How do we meaningfully describe the distribution of latencies? We have to look at percentiles, but it’s even more nuanced than this. A trap that many people fall into is fixating on “the common case.” The problem with this is that there is a lot more to latency behavior than the common case. Not only that, but the “common” case is likely not as common as you think.

This is partly a tooling problem. Many of the tools we use do not do a good job of capturing and representing this data. For example, the majority of latency graphs produced by Grafana, such as the one below, are basically worthless. We like to look at pretty charts, and by plotting what’s convenient we get a nice colorful graph which is quite readable. Only looking at the 95th percentile is what you do when you want to hide all the bad stuff. As Gil describes, it’s a “marketing system.” Whether it’s the CTO, potential customers, or engineers—someone’s getting duped. Furthermore, averaging percentiles is mathematically absurd. To conserve space, we often keep the summaries and throw away the data, but the “average of the 95th percentile” is a meaningless statement. You cannot average percentiles, yet note the labels in most of your Grafana charts. Unfortunately, it only gets worse from here.

graph_logbase10_ms

Gil says, “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal. The rest of it is noise.” To this point, someone in the workshop naturally responded with “But what if the max is just something like a VM restarting? That doesn’t describe the behavior of the system. It’s just an unfortunate, unlikely occurrence.” By ignoring the maximum, you’re effectively saying “this doesn’t happen.” If you can identify the cause as noise, you’re okay, but if you’re not capturing that data, you have no idea of what’s actually happening.

How Many Nines?

But how many “nines” do I really need to look at? The 99th percentile, by definition, is the latency below which 99% of the observations may be found. Is the 99th percentile rare? If we have a single search engine node, a single key-value store node, a single database node, or a single CDN node, what is the chance we actually hit the 99th percentile?

Gil describes some real-world data he collected which shows how many of the web pages we go to actually experience the 99th percentile, displayed in table below. The second column counts the number of HTTP requests generated by a single access of the web page. The third column shows the likelihood of one access experiencing the 99th percentile. With the exception of google.com, every page has a probability of 50% or higher of seeing the 99th percentile.

Screen Shot 2015-10-04 at 6.15.24 PM

The point Gil makes is that the 99th percentile is what most of your web pages will see. It’s not “rare.”

What metric is more representative of user experience? We know it’s not the average or the median. 95th percentile? 99.9th percentile? Gil walks through a simple, hypothetical example: a typical user session involves five page loads, averaging 40 resources per page. How many users will not experience something worse than the 95th percentile? 0.003%. By looking at the 95th percentile, you’re looking at a number which is relevant to 0.003% of your users. This means 99.997% of your users are going to see worse than this number, so why are you even looking at it?

On the flip side, 18% of your users are going to experience a response time worse than the 99.9th percentile, meaning 82% of users will experience the 99.9th percentile or better. Going further, more than 95% of users will experience the 99.97th percentile and more than 99% of users will experience the 99.995th percentile.

The median is the number that 99.9999999999% of response times will be worse than. This is why median latency is irrelevant. People often describe “typical” response time using a median, but the median just describes what everything will be worse than. It’s also the most commonly used metric.

If it’s so critical that we look at a lot of nines (and it is), why do most monitoring systems stop at the 95th or 99th percentile? The answer is simply because “it’s hard!” The data collected by most monitoring systems is usually summarized in small, five or ten second windows. This, combined with the fact that we can’t average percentiles or derive five nines from a bunch of small samples of percentiles means there’s no way to know what the 99.999th percentile for the minute or hour was. We end up throwing away a lot of good data and losing fidelity.

A Coordinated Conspiracy

Benchmarking is hard. Almost all latency benchmarks are broken because almost all benchmarking tools are broken. The number one cause of problems in benchmarks is something called “coordinated omission,” which Gil refers to as “a conspiracy we’re all a part of” because it’s everywhere. Almost all load generators have this problem.

We can look at a common load-testing example to see how this problem manifests. With this type of test, a client generally issues requests at a certain rate, measures the response time for each request, and puts them in buckets from which we can study percentiles later.

The problem is what if the thing being measured took longer than the time it would have taken before sending the next thing? What if you’re sending something every second, but this particular thing took 1.5 seconds? You wait before you send the next one, but by doing this, you avoided measuring something when the system was problematic. You’ve coordinated with it by backing off and not measuring when things were bad. To remain accurate, this method of measuring only works if all responses fit within an expected interval.

Coordinated omission also occurs in monitoring code. The way we typically measure something is by recording the time before, running the thing, then recording the time after and looking at the delta. We put the deltas in stats buckets and calculate percentiles from that. The code below is taken from a Cassandra benchmark.

Screen Shot 2015-10-04 at 7.29.09 PM

However, if the system experiences one of the “hiccups” described earlier, you will only have one bad operation and 10,000 other operations waiting in line. When those 10,000 other things go through, they will look really good when in reality the experience was really bad. Long operations only get measured once, and delays outside the timing window don’t get measured at all.

In both of these examples, we’re omitting data that looks bad on a very selective basis, but just how much of an impact can this have on benchmark results? It turns out the impact is huge.

Screen Shot 2015-10-04 at 7.27.43 PM

Imagine a “perfect” system which processes 100 requests/second at exactly 1 ms per request. Now consider what happens when we freeze the system (for example, using CTRL+Z) after 100 seconds of perfect operation for 100 seconds and repeat. We can intuitively characterize this system:

  • The average over the first 100 seconds is 1 ms.
  • The average over the next 100 seconds is 50 seconds.
  • The average over the 200 seconds is 25 seconds.
  • The 50th percentile is 1 ms.
  • The 75th percentile is 50 seconds.
  • The 99.99th percentile is 100 seconds.

Screen Shot 2015-10-04 at 7.49.10 PM

Now we try measuring the system using a load generator. Before freezing, we run 100 seconds at 100 requests/second for a total of 10,000 requests at 1 ms each. After the stall, we get one result of 100 seconds. This is the entirety of our data, and when we do the math, we get these results:

  • The average over the 200 seconds is 10.9 ms (should be 25 seconds).
  • The 50th percentile is 1 ms.
  • The 75th percentile is 1 ms (should be 50 seconds).
  • The 99.99th percentile is 1 ms (should be 100 seconds).

Screen Shot 2015-10-04 at 7.57.23 PM

Basically, your load generator and monitoring code tell you the system is ready for production, when in fact it’s lying to you! A simple “CTRL+Z” test can catch coordinated omission, but people rarely do it. It’s critical to calibrate your system this way. If you find it giving you these kind of results, throw away all the numbers—they’re worthless.

You have to measure at random or “fair” rates. If you measure 10,000 things in the first 100 seconds, you have to measure 10,000 things in the second 100 seconds during the stall. If you do this, you’ll get the correct numbers, but they won’t be as pretty. Coordinated omission is the simple act of erasing, ignoring, or missing all the “bad” stuff, but the data is good.

Surely this data can still be useful though, even if it doesn’t accurately represent the system? For example, we can still use it to identify performance regressions or validate improvements, right? Sadly, this couldn’t be further from the truth. To see why, imagine we improve our system. Instead of pausing for 100 seconds after 100 seconds of perfect operation, it handles all requests at 5 ms each after 100 seconds. Doing the math, we get the following:

  • The 50th percentile is 1 ms
  • The 75th percentile is 2.5 ms (stall showed 1 ms)
  • The 99.99th percentile is 5 ms (stall showed 1 ms)

This data tells us we hurt the four nines and made the system 5x worse! This would tell us to revert the change and go back to the way it was before, which is clearly the wrong decision. With bad data, better can look worse. This shows that you cannot have any intuition based on any of these numbers. The data is garbage.

With many load generators, the situation is actually much worse than this. These systems work by generating a constant load. If our test is generating 100 requests/second, we run 10,000 requests in the first 100 seconds. When we stall, we process just one request. After the stall, the load generator sees that it’s 9,999 requests behind and issues those requests to catch back up. Not only did it get rid of the bad requests, it replaced them with good requests. Now the data is twice as wrong as just dropping the bad requests.

What coordinated omission is really showing you is service time, not response time. If we imagine a cashier ringing up customers, the service time is the time it takes the cashier to do the work. The response time is the time a customer waits before they reach the register. If the rate of arrival is higher than the service rate, the response time will continue to grow. Because hiccups and other phenomena happen, response times often bounce around. However, coordinated omission lies to you about response time by actually telling you the service time and hiding the fact that things stalled or waited in line.

Measuring Latency

Latency doesn’t live in a vacuum. Measuring response time is important, but you need to look at it in the context of load. But how do we properly measure this? When you’re nearly idle, things are nearly perfect, so obviously that’s not very useful. When you’re pedal to the metal, things fall apart. This is somewhat useful because it tells us how “fast” we can go before we start getting angry phone calls.

However, studying the behavior of latency at saturation is like looking at the shape of your car’s bumper after wrapping it around a pole. The only thing that matters when you hit the pole is that you hit the pole. There’s no point in trying to engineer a better bumper, but we can engineer for the speed at which we lose control. Everything is going to suck at saturation, so it’s not super useful to look at beyond determining your operating range.

What’s more important is testing the speeds in between idle and hitting the pole. Define your SLAs and plot those requirements, then run different scenarios using different loads and different configurations. This tells us if we’re meeting our SLAs but also how many machines we need to provision to do so. If you don’t do this, you don’t know how many machines you need.

How do we capture this data? In an ideal world, we could store information for every request, but this usually isn’t practical. HdrHistogram is a tool which allows you to capture latency and retain high resolution. It also includes facilities for correcting coordinated omission and plotting latency distributions. The original version of HdrHistogram was written in Java, but there are versions for many other languages.

Screen Shot 2015-10-05 at 12.00.04 AM

To Summarize

To understand latency, you have to consider the entire distribution. Do this by plotting the latency distribution curve. Simply looking at the 95th or even 99th percentile is not sufficient. Tail latency matters. Worse yet, the median is not representative of the “common” case, the average even less so. There is no single metric which defines the behavior of latency. Be conscious of your monitoring and benchmarking tools and the data they report. You can’t average percentiles.

Remember that latency is not service time. If you plot your data with coordinated omission, there’s often a quick, high rise in the curve. Run a “CTRL+Z” test to see if you have this problem. A non-omitted test has a much smoother curve. Very few tools actually correct for coordinated omission.

Latency needs to be measured in the context of load, but constantly running your car into a pole in every test is not useful. This isn’t how you’re running in production, and if it is, you probably need to provision more machines. Use it to establish your limits and test the sustainable throughputs in between to determine if you’re meeting your SLAs. There are a lot of flawed tools out there, but HdrHistogram is one of the few that isn’t. It’s useful for benchmarking and, since histograms are additive and HdrHistogram uses log buckets, it can also be useful for capturing high-volume data in production.

You Own Your Availability

There’s been a lot of discussion around “availability” lately. It’s often trumpeted with phrases like “you own your availability,” meaning there is no buck-passing when it comes to service uptime. The AWS outage earlier this week served as a stark reminder that, while owning your availability is a commendable ambition, for many it’s still largely owned by Amazon and the like.

In order to “own” your availability, it’s important to first understand what “availability” really means. Within the context of distributed-systems theory, availability is usually discussed in relation to the CAP theorem. Formally, CAP defines availability as a liveness property: “every request received by a non-failing node in the system must result in a response.” This is a weak definition for two reasons. First, the proviso “every request received by a non-failing node” means that a system in which all nodes have failed is trivially available.  Second, Gilbert and Lynch stipulate no upper bound on latency, only that operations eventually return a response. This means an operation could take weeks to complete and availability would not be violated.

Martin Kleppmann points out these issues in his recent paper “A Critique of the CAP Theorem.” I don’t think there is necessarily a problem with the formalizations made by CAP, just a matter of engineering practicality. Kleppmann’s critique recalls a pertinent quote from Leslie Lamport on the topic of liveness:

Liveness properties are inherently problematic. The question of whether a real system satisfies a liveness property is meaningless; it can be answered only by observing the system for an infinite length of time, and real systems don’t run forever. Liveness is always an approximation to the property we really care about. We want a program to terminate within 100 years, but proving that it does would require the addition of distracting timing assumptions. So, we prove the weaker condition that the program eventually terminates. This doesn’t prove that the program will terminate within our lifetimes, but it does demonstrate the absence of infinite loops.

Despite the pop culture surrounding it, CAP is not meant to neatly classify systems. It’s meant to serve as a jumping-off point from which we can reason from the ground up about distributed systems and the inherent limitations associated with them. It’s a reality check.

Practically speaking, availability is typically described in terms of “uptime” or the proportion of time which requests are successfully served. Brewer refers to this as “yield,” which is the probability of completing a request. This is the metric that is normally measured in “nines,” such as “five-nines availability.”

In the presence of faults there is typically a tradeoff between providing no answer (reducing yield) and providing an imperfect answer (maintaining yield, but reducing harvest).

However, this definition is only marginally more useful than CAP’s since it still doesn’t provide an upper bound on computation.

CAP is better used as a starting point for system design and understanding trade-offs than as a tool for reasoning about availability because it doesn’t really account for real availability. “Harvest” and “yield” show that availability is really a probabilistic property and that the trade with consistency is usually a gradient. But availability is much more nuanced than CAP’s “are we serving requests?” and harvest/yield’s “how many requests?” In practice, availability equates to SLAs. How many requests are we serving? At what rate? At what latency? At what percentiles? These things can’t really be formalized into a theorem like CAP because they are empirically observed, not properties of an algorithm.

Availability is specified by an SLA but observed by outside users. Unlike consistency, which is a property of the system and maintained by algorithm invariants, availability is determined by the client. For example, one user’s requests are served but another user’s are not. To the first user, the system is completely available.

To truly own your availability, you have to own every piece of infrastructure from the client to you, in addition to the infrastructure your system uses. Therefore, you can’t own your availability anymore than you can own Comcast’s fiber or Verizon’s 4G network. This is obviously impractical, if not impossible, but it might also be taking “own your availability” a bit too literally.

What “you own your availability” actually means is “you own your decisions.” Plain and simple. You own the decision to use AWS. You own the decision to use DynamoDB. You own the decision to not use multiple vendors. Owning your availability means making informed decisions about technology and vendors. “What is the risk/reward for using this database?” “Does using a PaaS/IaaS incur vendor lock-in? What happens when that service goes down?” It also means making informed decisions about the business. “What is the cost of our providers not meeting their SLAs? Is it cost-effective to have redundant providers?”

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy. Use them to set expectations and make intelligent decisions, but don’t bank the business on them. Availability is not a timeshare. It’s not at will. You can’t just pawn it off, just like you can’t redirect your tech support to Amazon or Google.

It’s impossible to own your availability because there are too many things left to probability, too many unknowns, and too many variables outside of our control. Own as much as you can predict, as much as you can control, and as much as you can afford. The rest comes down to making informed decisions, hoping for the best, and planning for the worst.

What You Want Is What You Don’t: Understanding Trade-Offs in Distributed Messaging

If there’s one unifying theme of this blog, it’s that distributed systems are riddled with trade-offs. Specifically, with distributed messaging, you cannot have exactly-once delivery. However, messaging trade-offs don’t stop at delivery semantics. I want to talk about what I mean by this and explain why many developers often have the wrong mindset when it comes to building distributed applications.

The natural tendency is to build distributed systems as if they aren’t distributed at all—assuming data consistency, reliable messaging, and predictability. It’s much easier to reason about, but it’s also blatantly misleading.

The only thing guaranteed in messaging—and distributed systems in general—is that sooner or later, your guarantees are going to break down. If you assume these guarantees as axiomatic, everything built on them becomes unsound. Depending on the situation, this can range from mildly annoying to utterly catastrophic.

I recently ran across a comment from Apcera CEO Derek Collison on this topic which resonated with me:

On systems that do claim some form of guarantee, it’s best to look at what level that guarantee really runs out. Especially around persistence, exactly once delivery semantics, etc. I spent much of my career designing and building messaging systems that have those guarantees, and in turn developed many systems utilizing some of those features. For me, I found that depending on these guarantees was a bad pattern in distributed system design…

You should know how your system behaves when you reach the breaking point, but what’s less obvious is that providing these types of strong guarantees is usually very expensive. What price are we willing to pay, what level do our guarantees hold to, and what happens when they give out? In this sense, a “guarantee” is really no different from a SLA, yet stronger guarantees allow for stronger assumptions.

This all sounds quite vague, so let’s look at a specific example. With messaging, we’re often concerned with delivery reliability. In a perfect world, message delivery would be guaranteed and exactly once. Of course, I’ve talked at length why this is impossible, so let’s anchor ourselves in reality. We can look to TCP/IP for how this works.

IP is an unreliable delivery system which runs on unreliable network infrastructure. Packets can be delivered in order, out of order, or not at all. There are no acknowledgements, so the sender has no way of knowing if what they sent was received. TCP builds on IP by effectively making the transmission stateful and adding a layer of control. Through added complexity and performance costs, we achieve reliable delivery over an unreliable stack.

The key takeaway here is that we start with something primitive, like moving bits from point A to point B, and layer on abstractions to build stronger guarantees.  These abstractions almost always come at a price, tangible or not, which is why it’s important to push the costs up into the layers above. If not every use case demands reliable delivery, why force the cost onto everyone?

Exactly-once delivery is the Holy Grail of distributed messaging, and guaranteed delivery is the unicorn. The irony is that even if they were attainable, you likely wouldn’t want them. These types of strong guarantees demand expensive infrastructure which perform expensive coordination which require expensive administration. But what does all this expensive stuff really buy you at the end of the day?

A key problem is that there is a huge difference between message delivery and message processing. Sure, TCP can more or less ensure that your packet was either delivered or not, but what good is that actually in practice? How does the sender know that its message was successfully processed or that the receiver did what it needed to do? The only way to truly know is for the receiver to send a business-level acknowledgement. The low-level transport protocol doesn’t know about the application semantics, so the only way to go, really, is up. And if we assume that any guarantees will eventually give out, we have to account for that at the business level. To quote from a related article, “if reliability is important on the business level, do it on the business level.” It’s important not to conflate the transport protocol with the business-transaction protocol.

This is why systems like Akka don’t provide a notion of guaranteed delivery—because what does “guaranteed delivery” actually mean? Does it mean the message was handed to the transport layer? Does it mean the remote machine received the message? Does it mean the message was enqueued in the recipient’s mailbox?  Does it mean the recipient has started processing it? Does it mean the recipient has finished processing it? Each of these things has a very different set of requirements, constraints, and costs. Also, what does it even mean for a message to be “processed”? It depends on the business context. As such, it usually doesn’t make sense for the underlying infrastructure to make these decisions because the decisions usually impact the layers above significantly.

By providing only basic guarantees those use cases which do not need stricter guarantees do not pay the cost of their implementation; it is always possible to add stricter guarantees on top of basic ones, but it is not possible to retro-actively remove guarantees in order to gain more performance.

Distributed computation is inherently asynchronous and the network is inherently unreliable, so it’s better to embrace this asynchrony than to build on leaky abstractions. Rather than hide these inconveniences, make them explicit and force users to design around them. What you end up with is a more robust, more reliable, and often more performant system. This trade-off is highlighted in the paper “Exactly-once semantics in a replicated messaging system” by Huang et al. while studying the problem of exactly-once delivery:

Thus, server-centric algorithms cannot achieve exactly-once semantics. Instead, we will strive to achieve a weaker notion of correctness.

By relaxing our requirements, we end up with a solution that has less performance overhead and less complexity. Why bother pursuing the impossible? You’re paying a huge premium for something which is probably less reliable than you think while performing poorly. In many cases, it’s better to let the pendulum swing the other direction.

The network is not reliable, which means message delivery is never truly guaranteed—it can only be best-effort. The Two Generals’ Problem shows that it’s provenly impossible for two remote processes to safely agree on a decision. Similarly, the FLP impossibility result shows that, in an asynchronous environment, reliable failure detection is impossible. That is, there’s no way to tell if a process has crashed or is simply taking a long time to respond. Therefore, if it’s possible for a process to crash, it’s impossible for a set of processes to come to an agreement.

If message delivery is not guaranteed and consensus is impossible, is message ordering really that important? Some use cases might actually demand it, but I suspect, more often than not, it’s an artificial constraint. The fact that the network is unreliable, processes are faulty, and distributed communication is asynchronous makes reliable, in-order delivery surprisingly expensive. But doesn’t TCP solve this problem? At the transport level, yes, but that only gets you so far as I’ve been trying to demonstrate.

So you use TCP and process messages with a single thread. Most of the time, it just works. But what happens under heavy load? What happens when message delivery fails? What happens when you need to scale? If you are queuing messages or you have a dead-letter queue or you have network partitions or a crash-recovery model, you’re probably going to encounter duplicate, dropped, or out-of-order messages. Even if the infrastructure provides ordered delivery, these problems will likely manifest themselves at the application level.

If you’re distributed, forget about ordering and start thinking about commutativity. Forget about guaranteed delivery and start thinking about idempotence. Stop thinking about the messaging platform and start thinking about the messaging patterns and business semantics. A pattern which is commutative and idempotent will be far less brittle and more efficient than a system which is totally ordered and “guaranteed.” This is why CRDTs are becoming increasingly popular in the distributed space. Never write code which assumes messages will arrive in order when you can’t write code that will assume they arrive at all.

In the end, think carefully about the business case and what your requirements really are. Can you satisfy them without relying on costly and leaky abstractions or deceptive guarantees? If you can’t, what happens when those guarantees go out the window? This is very similar to understanding what happens when a SLA is not met. Are the performance and complexity trade-offs worth it? What about the operations and business overheads? In my experience, it’s better to confront the intricacies of distributed systems head-on than to sweep them under the rug. Sooner or later, they will rear their ugly heads.

Designed to Fail

When it comes to reliability engineering, people often talk about things like fault injection, monitoring, and operations runbooks. These are all critical pieces for building systems which can withstand failure, but what’s less talked about is the need to design systems which deliberately fail.

Reliability design has a natural progression which closely follows that of architectural design. With monolithic systems, we care more about preventing failure from occurring. With service-oriented architectures, controlling failure becomes less manageable, so instead we learn to anticipate it. With highly distributed microservice architectures where failure is all but guaranteed, we embrace it.

What does it mean to embrace failure? Anticipating failure is understanding the behavior when things go wrong, building systems to be resilient to it, and having a game plan for when it happens, either manual or automated. Embracing failure means making a conscious decision to purposely fail, and it’s essential for building highly available large-scale systems.

A microservice architecture typically means a complex web of service dependencies. One of SOA’s goals is to isolate failure and allow for graceful degradation. The key to being highly available is learning to be partially available. Frequently, one of the requirements for partial availability is telling the client “no.” Outright rejecting service requests is often better than allowing them to back up because, when dealing with distributed services, the latter usually results in cascading failure across dependent systems.

While designing our distributed messaging service at Workiva, we made explicit decisions to drop messages on the floor if we detect the system is becoming overloaded. As queues become backed up, incoming messages are discarded, a statsd counter is incremented, and a backpressure notification is sent to the client. Upon receiving this notification, the client can respond accordingly by failing fast, exponentially backing off, or using some other flow-control strategy. By bounding resource utilization, we maintain predictable performance, predictable (and measurable) lossiness, and impede cascading failure.

Other techniques include building kill switches into service calls and routers. If an overloaded service is not essential to core business, we fail fast on calls to it to prevent availability or latency problems upstream. For example, a spam-detection service is not essential to an email system, so if it’s unavailable or overwhelmed, we can simply bypass it. Netflix’s Hystrix has a set of really nice patterns for handling this.

If we’re not careful, we can often be our own worst enemy. Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves. By isolating and controlling it, we can prevent failure from becoming widespread and unpredictable. By building in backpressure mechanisms and other types of intentional “failure” modes, we can ensure better availability and reliability for our systems through graceful degradation. Sometimes it’s better to fight fire with fire and failure with failure.

Service-Disoriented Architecture

“You can have a second computer once you’ve shown you know how to use the first one.” -Paul Barham

The first rule of distributed systems is don’t distribute your system until you have an observable reason to. Teams break this rule on the regular. People have been talking about service-oriented architecture for a long time, but only recently have microservices been receiving the hype.

The problem, as Martin Fowler observes, is that teams are becoming too eager to adopt a microservice architecture without first understanding the inherent overheads. A contributing factor, I think, is you only hear the success stories from companies who did it right, like Netflix. However, what folks often fail to realize is that these companies—in almost all cases—didn’t start out that way. There was a long and winding path which led them to where they are today. The inverse of this, which some refer to as microservice envy, is causing teams to rush into microservice hell. I call this service-disoriented architecture (or sometimes disservice-oriented architecture when the architecture is DOA).

The term “monolith” has a very negative connotation—unscalable, unmaintainable, unresilient. These things are not intrinsically tied to each other, however, and there’s no reason a single system can’t be modular, maintainable, and fault tolerant at reasonable scale. It’s just less sexy. Refactoring modular code is much easier than refactoring architecture, and refactoring across service boundaries is equally difficult. Fowler describes this as monolith-first, and I think it’s the right approach (with some exceptions, of course).

Don’t even consider microservices unless you have a system that’s too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don’t try to separate it into separate services.

Service-oriented architecture is about organizational complexity and system complexity. If you have both, you have a case to distribute. If you have one of the two, you might have a case (although if you have organizational complexity without system complexity, you’ve probably scaled your organization improperly). If you have neither, you do not have a case to distribute. State, specifically distributed state, is hell, and some pundits argue SOA is satan—perhaps a necessary evil.

There are a lot of motivations for microservices: anti-fragility, fault tolerance, independent deployment and scaling, architectural abstraction, and technology isolation. When services are loosely coupled, the system as a whole tends to be less fragile. When instances are disposable and stateless, services tend to be more fault tolerant because we can spin them up and down, balance traffic, and failover. When responsibility is divided across domain boundaries, services can be independently developed, deployed, and scaled while allowing the right tools to be used for each.

We also need to acknowledge the disadvantages. Adopting a microservice architecture does not automatically buy you anti-fragility. Distributed systems are incredibly precarious. We have to be aware of things like asynchrony, network partitions, node failures, and the trade-off between availability and data consistency. We have to think about resiliency but also the business and UX implications. We have to consider the boundaries of distributed systems like CAP and exactly-once delivery.

When distributing, the emphasis should be on resilience engineering and adopting loosely coupled, stateless components—not microservices for microservices’ sake. We need to view eventual consistency as a tool, not a side effect. The problem I see is that teams often end up with what is essentially a complex, distributed monolith. Now you have two problems. If you’re building a microservice which doesn’t make sense outside the context of another system or isn’t useful on its own, stop and re-evaluate. If you’re designing something to be fast and correct, realize that distributing it will frequently take away both.

Like anti-fragility, microservices do not automatically buy you better maintainability or even scalability. Adopting them requires the proper infrastructure and organization to be in place. Without these, you are bound to fail. In theory, they are intended to increase development velocity, but in many cases the microservice premium ends up slowing it down while creating organizational dependencies and bottlenecks.

There are some key things which must be in place in order for a microservice architecture to be successful: a proper continuous-delivery pipeline, competent DevOps and Ops teams, and prudent service boundaries, to name a few. Good monitoring is essential. It’s also important we have a thorough testing and integration story. This isn’t even considering the fundamental development complexities associated with SOA mentioned earlier.

The better strategy is a bottom-up approach. Start with a monolith or small set of coarse-grained services and work your way up. Make sure you have the data model right. Break out new, finer-grained services as you need to and as you become more confident in your ability to maintain and deploy discrete services. It’s largely about organizational momentum. A young company jumping straight to a microservice architecture is like a golf cart getting on the freeway.

Microservices offer a number of advantages, but for many companies they are a bit of a Holy Grail. Developers are always looking for a silver bullet, but there is always a cost. What we need to do is minimize this cost, and with microservices, this typically means easing our way into it rather than diving into the deep end. Team autonomy and rapid iteration are noble goals, but if we’re not careful, we can end up creating an impedance. Microservices require organization and system maturity. Otherwise, they end up being a premature architectural optimization with a lot of baggage. They end up creating a service-disoriented architecture.

Distributed Systems Are a UX Problem

Distributed systems are not strictly an engineering problem. It’s far too easy to assume a “backend” development concern, but the reality is there are implications at every point in the stack. Often the trade-offs we make lower in the stack in order to buy responsiveness bubble up to the top—so much, in fact, that it rarely doesn’t impact the application in some way. Distributed systems affect the user. We need to shift the focus from system properties and guarantees to business rules and application behavior. We need to understand the limitations and trade-offs at each level in the stack and why they exist. We need to assume failure and plan for recovery. We need to start thinking of distributed systems as a UX problem.

The Truth is Prohibitively Expensive

Stop relying on strong consistency. Coordination and distributed transactions are slow and inhibit availability. The cost of knowing the “truth” is prohibitively expensive for many applications. For that matter, what you think is the truth is likely just a partial or outdated version of it.

Instead, choose availability over consistency by making local decisions with the knowledge at hand and design the UX accordingly. By making this trade-off, we can dramatically improve the user’s experience—most of the time.

Failure Is an Option

There are a lot of problems with simultaneity in distributed computing. As Justin Sheehy describes it, there is no “now” when it comes to distributed systems—that article, by the way, is a must-read for every engineer, regardless of where they work in the stack.

While some things about computers are “virtual,” they still must operate in the physical world and cannot ignore the challenges of that world.

Even though computers operate in the real world, they are disconnected from it. Imagine an inventory system. It may place orders to its artificial heart’s desire, but if the warehouse burns down, there’s no fulfilling them. Even if the system is perfect, its state may be impossible. But the system is typically not perfect because the truth is prohibitively expensive. And not only do warehouses catch fire or forklifts break down, as rare as this may be, but computers fail and networks partition—and that’s far less rare.

The point is, stop trying to build perfect systems because one of two things will happen:

1. You have a false sense of security because you think the system is perfect, and it’s not.

or

2. You will never ship because perfection is out of reach or exorbitantly expensive.

Either case can be catastrophic, depending on the situation. With systems, failure is not only an option, it’s an inevitability, so let’s plan for it as such. We have a lot to gain by embracing failure. Eric Brewer articulated this idea in a recent interview:

So the general answer is you allow things to be inconsistent and then you find ways to compensate for mistakes, versus trying to prevent mistakes altogether. In fact, the financial system is actually not based on consistency, it’s based on auditing and compensation. They didn’t know anything about the CAP theorem, that was just the decision they made in figuring out what they wanted, and that’s actually, I think, the right decision.

We can look to ATMs, and banks in general, as the canonical example for how this works. When you withdraw money, the bank could choose to first coordinate your account, calculating your available balance at that moment in time, before issuing the withdrawal. But what happens when the ATM is temporarily disconnected from the bank? The bank loses out on revenue.

Instead, they make a calculated risk. They choose availability and compensate the risk of overdraft with interest and charges. Likewise, banks use double-entry bookkeeping to provide an audit trail. Every credit has a corresponding debit. Mistakes happen—accounts are debited twice, an account is credited without another being debited—the failure modes are virtually endless. But we audit and compensate, detect and recover. Banks are loosely coupled systems. Accountants don’t use erasers. Why should programmers?

When you find yourself saying “this is important data or people’s money, it has to be correct,” consider how the problem was solved before computers. Building on Quicksand by Dave Campbell and Pat Helland is a great read on this topic:

Whenever the authors struggle with explaining how to implement loosely-coupled solutions, we look to how things were done before computers. In almost every case, we can find inspiration in paper forms, pneumatic tubes, and forms filed in triplicate.

Consider the lost request and its idempotent execution. In the past, a form would have multiple carbon copies with a printed serial number on top of them. When a purchase-order request was submitted, a copy was kept in the file of the submitter and placed in a folder with the expected date of the response. If the form and its work were not completed by the expected date, the submitter would initiate an inquiry and ask to locate the purchase-order form in question. Even if the work was lost, the purchase-order would be resubmitted without modification to ensure a lack of confusion in the processing of the work. You wouldn’t change the number of items being ordered as that may cause confusion. The unique serial number on the top would act as a mechanism to ensure the work was not performed twice.

Computers allow us to greatly improve the user experience, but many of the same fail-safes still exist, just slightly rethought.

The idea of compensation is actually a common theme within distributed systems. The Saga pattern is a great example of this. Large-scale systems often have to coordinate resources across disparate services.  Traditionally, we might solve this problem using distributed transactions like two-phase commit. The problem with this approach is it doesn’t scale very well, it’s slow, and it’s not particularly fault tolerant. With 2PC, we have deadlock problems and even 3PC is still susceptible to network partitions.

Sagas split a long-lived transaction into individual, interleaved sub-transactions. Each sub-transaction in the sequence has a corresponding compensating transaction which reverses its effects. The compensating transactions must be idempotent so they can be safely retried. In the event of a partial execution, the compensating transactions are run and the Saga is effectively rolled back.

The commonly used example for Sagas is booking a trip. We need to ensure flight, car rental, and hotel are all booked or none are booked. If booking the flight fails, we cancel the hotel and car, etc. Sagas trade off atomicity for availability while still allowing us to manage failure, a common occurrence in distributed systems.

Compensation has a lot of applications as a UX principle because it’s really the only way to build loosely coupled, highly available services.

Calculated Recovery

Pat Helland describes computing as nothing more than “memories, guesses, and apologies.” Computers always have partial knowledge. Either there is a disconnect with the real world (warehouse is on fire) or there is a disconnect between systems (System A sold a Foo Widget but, unbeknownst to it, System B just sold the last one in inventory—oops!). Systems don’t make decisions, they make guesses. The guess might be good or it might be bad, but rarely is there certainty. We can wait to collect as much information as possible before making a guess, but it means progress can’t be made until the system is confident enough to do so.

Computers have memory. This means they remember facts they have learned and guesses they have made. Memories help systems make better guesses in the future, and they can share those memories with other systems to help in their guesses. We can store more memories at the cost of more money, and we can survey other systems’ memories at the cost of more latency.

It is a business decision how much money, latency, and energy should be spent on reducing forgetfulness. To make this decision, the costs of the increased probability of remembering should be weighed against the costs of occasionally forgetting stuff.

Generally speaking, the more forgetfulness we can tolerate, the more responsive our systems will be, provided we know how to handle the situations where something is forgotten.

Sooner or later, a system guesses wrong. It sucks. It might mean we lose out on revenue; the business isn’t happy. It might mean the user loses out on what they want; the customer isn’t happy. But we calculate the impact of these wrong guesses, we determine when the trade-offs do and don’t make sense, we compensate, and—when shit hits the fan—we apologize.

Business realities force apologies.  To cope with these difficult realities, we need code and, frequently, we need human beings to apologize. It is essential that businesses have both code and people to manage these apologies.

Distributed systems are as much about failure modes and recovery as they are about being operationally correct. It’s critical that we can recover gracefully when something goes wrong, and often that affects the UX.

We could choose to spend extraordinary amounts of money and countless man-hours laboring over a system which provides the reliability we want. We could construct a data center. We could deploy big, expensive machines. We could install redundant fiber and switches. We could drudge over infallible code. Or we could stop, think for a moment, and realize maybe “sorry” is a more effective alternative. Knowing when to make that distinction can be the difference between a successful business and a failed one. The implications of distributed systems may be wider reaching than you thought.