What You Want Is What You Don’t: Understanding Trade-Offs in Distributed Messaging

If there’s one unifying theme of this blog, it’s that distributed systems are riddled with trade-offs. Specifically, with distributed messaging, you cannot have exactly-once delivery. However, messaging trade-offs don’t stop at delivery semantics. I want to talk about what I mean by this and explain why many developers often have the wrong mindset when it comes to building distributed applications.

The natural tendency is to build distributed systems as if they aren’t distributed at all—assuming data consistency, reliable messaging, and predictability. It’s much easier to reason about, but it’s also blatantly misleading.

The only thing guaranteed in messaging—and distributed systems in general—is that sooner or later, your guarantees are going to break down. If you assume these guarantees as axiomatic, everything built on them becomes unsound. Depending on the situation, this can range from mildly annoying to utterly catastrophic.

I recently ran across a comment from Apcera CEO Derek Collison on this topic which resonated with me:

On systems that do claim some form of guarantee, it’s best to look at what level that guarantee really runs out. Especially around persistence, exactly once delivery semantics, etc. I spent much of my career designing and building messaging systems that have those guarantees, and in turn developed many systems utilizing some of those features. For me, I found that depending on these guarantees was a bad pattern in distributed system design…

You should know how your system behaves when you reach the breaking point, but what’s less obvious is that providing these types of strong guarantees is usually very expensive. What price are we willing to pay, what level do our guarantees hold to, and what happens when they give out? In this sense, a “guarantee” is really no different from a SLA, yet stronger guarantees allow for stronger assumptions.

This all sounds quite vague, so let’s look at a specific example. With messaging, we’re often concerned with delivery reliability. In a perfect world, message delivery would be guaranteed and exactly once. Of course, I’ve talked at length why this is impossible, so let’s anchor ourselves in reality. We can look to TCP/IP for how this works.

IP is an unreliable delivery system which runs on unreliable network infrastructure. Packets can be delivered in order, out of order, or not at all. There are no acknowledgements, so the sender has no way of knowing if what they sent was received. TCP builds on IP by effectively making the transmission stateful and adding a layer of control. Through added complexity and performance costs, we achieve reliable delivery over an unreliable stack.

The key takeaway here is that we start with something primitive, like moving bits from point A to point B, and layer on abstractions to build stronger guarantees.  These abstractions almost always come at a price, tangible or not, which is why it’s important to push the costs up into the layers above. If not every use case demands reliable delivery, why force the cost onto everyone?

Exactly-once delivery is the Holy Grail of distributed messaging, and guaranteed delivery is the unicorn. The irony is that even if they were attainable, you likely wouldn’t want them. These types of strong guarantees demand expensive infrastructure which perform expensive coordination which require expensive administration. But what does all this expensive stuff really buy you at the end of the day?

A key problem is that there is a huge difference between message delivery and message processing. Sure, TCP can more or less ensure that your packet was either delivered or not, but what good is that actually in practice? How does the sender know that its message was successfully processed or that the receiver did what it needed to do? The only way to truly know is for the receiver to send a business-level acknowledgement. The low-level transport protocol doesn’t know about the application semantics, so the only way to go, really, is up. And if we assume that any guarantees will eventually give out, we have to account for that at the business level. To quote from a related article, “if reliability is important on the business level, do it on the business level.” It’s important not to conflate the transport protocol with the business-transaction protocol.

This is why systems like Akka don’t provide a notion of guaranteed delivery—because what does “guaranteed delivery” actually mean? Does it mean the message was handed to the transport layer? Does it mean the remote machine received the message? Does it mean the message was enqueued in the recipient’s mailbox?  Does it mean the recipient has started processing it? Does it mean the recipient has finished processing it? Each of these things has a very different set of requirements, constraints, and costs. Also, what does it even mean for a message to be “processed”? It depends on the business context. As such, it usually doesn’t make sense for the underlying infrastructure to make these decisions because the decisions usually impact the layers above significantly.

By providing only basic guarantees those use cases which do not need stricter guarantees do not pay the cost of their implementation; it is always possible to add stricter guarantees on top of basic ones, but it is not possible to retro-actively remove guarantees in order to gain more performance.

Distributed computation is inherently asynchronous and the network is inherently unreliable, so it’s better to embrace this asynchrony than to build on leaky abstractions. Rather than hide these inconveniences, make them explicit and force users to design around them. What you end up with is a more robust, more reliable, and often more performant system. This trade-off is highlighted in the paper “Exactly-once semantics in a replicated messaging system” by Huang et al. while studying the problem of exactly-once delivery:

Thus, server-centric algorithms cannot achieve exactly-once semantics. Instead, we will strive to achieve a weaker notion of correctness.

By relaxing our requirements, we end up with a solution that has less performance overhead and less complexity. Why bother pursuing the impossible? You’re paying a huge premium for something which is probably less reliable than you think while performing poorly. In many cases, it’s better to let the pendulum swing the other direction.

The network is not reliable, which means message delivery is never truly guaranteed—it can only be best-effort. The Two Generals’ Problem shows that it’s provenly impossible for two remote processes to safely agree on a decision. Similarly, the FLP impossibility result shows that, in an asynchronous environment, reliable failure detection is impossible. That is, there’s no way to tell if a process has crashed or is simply taking a long time to respond. Therefore, if it’s possible for a process to crash, it’s impossible for a set of processes to come to an agreement.

If message delivery is not guaranteed and consensus is impossible, is message ordering really that important? Some use cases might actually demand it, but I suspect, more often than not, it’s an artificial constraint. The fact that the network is unreliable, processes are faulty, and distributed communication is asynchronous makes reliable, in-order delivery surprisingly expensive. But doesn’t TCP solve this problem? At the transport level, yes, but that only gets you so far as I’ve been trying to demonstrate.

So you use TCP and process messages with a single thread. Most of the time, it just works. But what happens under heavy load? What happens when message delivery fails? What happens when you need to scale? If you are queuing messages or you have a dead-letter queue or you have network partitions or a crash-recovery model, you’re probably going to encounter duplicate, dropped, or out-of-order messages. Even if the infrastructure provides ordered delivery, these problems will likely manifest themselves at the application level.

If you’re distributed, forget about ordering and start thinking about commutativity. Forget about guaranteed delivery and start thinking about idempotence. Stop thinking about the messaging platform and start thinking about the messaging patterns and business semantics. A pattern which is commutative and idempotent will be far less brittle and more efficient than a system which is totally ordered and “guaranteed.” This is why CRDTs are becoming increasingly popular in the distributed space. Never write code which assumes messages will arrive in order when you can’t write code that will assume they arrive at all.

In the end, think carefully about the business case and what your requirements really are. Can you satisfy them without relying on costly and leaky abstractions or deceptive guarantees? If you can’t, what happens when those guarantees go out the window? This is very similar to understanding what happens when a SLA is not met. Are the performance and complexity trade-offs worth it? What about the operations and business overheads? In my experience, it’s better to confront the intricacies of distributed systems head-on than to sweep them under the rug. Sooner or later, they will rear their ugly heads.