The CAP theorem is widely discussed and often misunderstood within the world of distributed systems. It states that any networked, shared-data system can, at most, guarantee two of three properties: consistency, availability, and partition tolerance. I won’t go into detail on CAP since the literature is abundant, but the notion of “two of three”—while conceptually accessible—is utterly misleading. Brewer has indicated this, echoed by many more, but there still seems to be a lot of confusion when the topic is brought up. The bottom line is you can’t sacrifice partition tolerance, but it seems CAP is a bit more nuanced than that.
On the surface, CAP presents three categories of systems. CA implies one which maintains consistency and availability given a perfectly reliable network. CP provides consistency and partition tolerance at the expense of availability, and AP gives us availability and partition tolerance without linearizability. Clearly, CA suggests that the system guarantees consistency and availability only when there are no network partitions. However, to say that there will never be network partitions is blatantly dishonest. This is where the source of much confusion lies.
Partitions happen. They happen for countless reasons. Switches fail, NICs fail, link layers fail, servers fail, processes fail. Partitions happen even when systems don’t fail due to GC pauses or prolonged I/O latency for example. Let’s accept this as fact and move on. What this means is that a “CA” system is CA only until it’s not. Once that partition happens, all your assumptions and all your guarantees hit the fan in spectacular fashion. Where does this leave us?
At its core, CAP is about trade-offs, but it’s an exclusion principle. It tells us what our systems cannot do given the nature of reality. The distinction here is that not all systems fit nicely into these archetypes. If Jepsen has taught us anything, it’s that the majority of systems don’t fit into any of these categories, even when the designers state otherwise. CAP isn’t as black and white as people paint it.
There’s a really nice series on CAP written recently by Nicolas Liochon. It does an excellent job of explaining the terminology (far better than I could), which is often overloaded and misused, and it makes some interesting points. Nicolas suggests that CA should really be thought of as a specification for an operating range, while CP and AP are descriptions of behavior. I would tend to agree, but my concern is that this eschews the trade-off that must be made.
We know that we cannot avoid network partition. What if we specify our application like this: “this application does not handle network partition. If it happens, the application will be partly unavailable, the data may be corrupted, and you may have to fix the data manually.” In other words, we’re really asking to be CA here, but if a partition occurs we may be CP, or, if we are unlucky, both not available and not consistent.
As an operating range, CA basically means when a partition occurs, the system throws up its hands and says, “welp, see ya later!” If we specify that the system does not work well under network partitions, we’re saying partitions are outside its operating range. What good is a specification for a spaceship designed to fly the upper atmosphere of planet Terah when we’re down here on Earth? We live in a world where partitions are the norm, so surely we need to include them in our operating range. CA does specify an operating range, but it’s not one you can put in an SLA and hand to a customer. Colloquially, it’s just a mode of “undefined behavior”—the system is consistent and available—until it’s not.
CAP isn’t a perfect metaphor, but in my mind, it does a decent job of highlighting the fundamental trade-offs involved in building distributed systems. Either we have linearizable writes or we don’t. If we do, we can’t guarantee availability. It’s true that CAP seems to imply a binary choice between consistency and availability in the face of partitions. In fact, it’s not a binary choice. You have AP, CP, or None of the Above. The problem with None of the Above is that it’s difficult to reason about and even more difficult to define. Ultimately, it ends up being more an illusion of choice since we cannot sacrifice partition tolerance.
Follow @tyler_treat
Great post. I though again about a good way to communicating this “operating range” / “unspecified behavior” effectively:
CA means consistency, availability and partition intolerance.
Partition intolerance should be understood as gluten intolerance: you *should not* eat gluten, but it *can* happen. If it does, that’s bad, and you must do something about it. And you should make clear that you’re gluten intolerant if you go to a restaurant.
That’s a good analogy. :)
Hi Tyler,
– Nice site!
Long before Brewer presented his CAP theorem, unix people working with distributed systems dismissed the idea of consistency, availability and persistence. They talk about “global versus local” state and set up “session handlers” to assure global state where needed and broadcasted semaphores.
The local entity was modelled as a state-machine: When a transaction was performed locally, it logged the transaction (in case of local failure), opened a session, broadcasted the transaction and suspended execution until a “session completion” message returned from each of the remote entities. Once the count of remote flags was complete, it closed the session and wrote to the local log. If any jobs where pending, they had to wait until the semaphore was pulled off the object (or account) where a session was active. Conflicts caused by crossing transactions where handled with a chosen logic. A negative transaction could only be done after a positive had completed, etc.
This was banking anno 1970’ies with semaphores and remote sessions. Today the same problem persists even within NUMA-architectures where thread-access to in-memory objects is polled cross-socket via remote cores. I have the spent the last 2.5 years on rediscovering what people did all the way back to the 60’ies and people like Brewer are just causing us to re-invent the wheel.
My best hint:
#1 Ignore Brewer
#2 Stick to message parsing and set object affinity at runtime – even on NUMA – because it just works.
” broadcasted the transaction and suspended execution until a “session completion” message returned from each of the remote entities”
So basically your system can become unavailable under high load. This is just CP and fits into the CAP theorem.
I think you misunderstand: Let me try more formally:
A system X is in state S, when an event E arrives to the system.
The event is caught by a sessions handler H which isolates the event and the propagation path through X which the event will cause, using semaphores.
This isolated part “I” reduces X to X’ which remains in state S.
X’ is still consistent, available and may be in any number of partitions, all in state S.
I is also consistent, available and in any number of partitions as it is reserved to H through semaphores.
If another event arrives which should affect the same parts of the X as E did, then it _not_ sent to X but to H, as the semaphores indicate that the session handler H has operational precedence for I.
At this time both X’ and I are available. The key is that separation of the topology of X into X’ and I assures that each subsystem is consistent, available in two major partitions governing any number of sub partitions of the system X.
Once E (and any E’, E”, …) has been processed in “I” the session handler H pulls the semaphores, whereby X’ and I may merge into X in another persistent state S’.
Hereby the system – as a whole – as Brewer describes it, only exists when nothing happens. Whenever some Event occurs, the system – as a whole – is always in transition.
The point is thereby that as long as an event is knocking on the door of the system and you don’t have a session handler, then there is no “global state”. There are “local” partitions, and they will all be consistent and available. The many local partitions are never unavailable as the session handler extends the topology of the system while it is in transition.
Better explained now?
Not really. Do you mean something similar to what’s described under vector clocks here?
No. Knowledge of Vector clocks is not required to understand the difference between the global and local state in a distributed system.
I didn’t necessarily mean vector clocks themselves, just the generic mechanism of keeping distinct, divergent versions. Can you point me at a more detailed online post describing the mechanism you try to explain? I’m just curious about anything distributed.