You Own Your Availability

There’s been a lot of discussion around “availability” lately. It’s often trumpeted with phrases like “you own your availability,” meaning there is no buck-passing when it comes to service uptime. The AWS outage earlier this week served as a stark reminder that, while owning your availability is a commendable ambition, for many it’s still largely owned by Amazon and the like.

In order to “own” your availability, it’s important to first understand what “availability” really means. Within the context of distributed-systems theory, availability is usually discussed in relation to the CAP theorem. Formally, CAP defines availability as a liveness property: “every request received by a non-failing node in the system must result in a response.” This is a weak definition for two reasons. First, the proviso “every request received by a non-failing node” means that a system in which all nodes have failed is trivially available.  Second, Gilbert and Lynch stipulate no upper bound on latency, only that operations eventually return a response. This means an operation could take weeks to complete and availability would not be violated.

Martin Kleppmann points out these issues in his recent paper “A Critique of the CAP Theorem.” I don’t think there is necessarily a problem with the formalizations made by CAP, just a matter of engineering practicality. Kleppmann’s critique recalls a pertinent quote from Leslie Lamport on the topic of liveness:

Liveness properties are inherently problematic. The question of whether a real system satisfies a liveness property is meaningless; it can be answered only by observing the system for an infinite length of time, and real systems don’t run forever. Liveness is always an approximation to the property we really care about. We want a program to terminate within 100 years, but proving that it does would require the addition of distracting timing assumptions. So, we prove the weaker condition that the program eventually terminates. This doesn’t prove that the program will terminate within our lifetimes, but it does demonstrate the absence of infinite loops.

Despite the pop culture surrounding it, CAP is not meant to neatly classify systems. It’s meant to serve as a jumping-off point from which we can reason from the ground up about distributed systems and the inherent limitations associated with them. It’s a reality check.

Practically speaking, availability is typically described in terms of “uptime” or the proportion of time which requests are successfully served. Brewer refers to this as “yield,” which is the probability of completing a request. This is the metric that is normally measured in “nines,” such as “five-nines availability.”

In the presence of faults there is typically a tradeoff between providing no answer (reducing yield) and providing an imperfect answer (maintaining yield, but reducing harvest).

However, this definition is only marginally more useful than CAP’s since it still doesn’t provide an upper bound on computation.

CAP is better used as a starting point for system design and understanding trade-offs than as a tool for reasoning about availability because it doesn’t really account for real availability. “Harvest” and “yield” show that availability is really a probabilistic property and that the trade with consistency is usually a gradient. But availability is much more nuanced than CAP’s “are we serving requests?” and harvest/yield’s “how many requests?” In practice, availability equates to SLAs. How many requests are we serving? At what rate? At what latency? At what percentiles? These things can’t really be formalized into a theorem like CAP because they are empirically observed, not properties of an algorithm.

Availability is specified by an SLA but observed by outside users. Unlike consistency, which is a property of the system and maintained by algorithm invariants, availability is determined by the client. For example, one user’s requests are served but another user’s are not. To the first user, the system is completely available.

To truly own your availability, you have to own every piece of infrastructure from the client to you, in addition to the infrastructure your system uses. Therefore, you can’t own your availability anymore than you can own Comcast’s fiber or Verizon’s 4G network. This is obviously impractical, if not impossible, but it might also be taking “own your availability” a bit too literally.

What “you own your availability” actually means is “you own your decisions.” Plain and simple. You own the decision to use AWS. You own the decision to use DynamoDB. You own the decision to not use multiple vendors. Owning your availability means making informed decisions about technology and vendors. “What is the risk/reward for using this database?” “Does using a PaaS/IaaS incur vendor lock-in? What happens when that service goes down?” It also means making informed decisions about the business. “What is the cost of our providers not meeting their SLAs? Is it cost-effective to have redundant providers?”

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy. Use them to set expectations and make intelligent decisions, but don’t bank the business on them. Availability is not a timeshare. It’s not at will. You can’t just pawn it off, just like you can’t redirect your tech support to Amazon or Google.

It’s impossible to own your availability because there are too many things left to probability, too many unknowns, and too many variables outside of our control. Own as much as you can predict, as much as you can control, and as much as you can afford. The rest comes down to making informed decisions, hoping for the best, and planning for the worst.

Designed to Fail

When it comes to reliability engineering, people often talk about things like fault injection, monitoring, and operations runbooks. These are all critical pieces for building systems which can withstand failure, but what’s less talked about is the need to design systems which deliberately fail.

Reliability design has a natural progression which closely follows that of architectural design. With monolithic systems, we care more about preventing failure from occurring. With service-oriented architectures, controlling failure becomes less manageable, so instead we learn to anticipate it. With highly distributed microservice architectures where failure is all but guaranteed, we embrace it.

What does it mean to embrace failure? Anticipating failure is understanding the behavior when things go wrong, building systems to be resilient to it, and having a game plan for when it happens, either manual or automated. Embracing failure means making a conscious decision to purposely fail, and it’s essential for building highly available large-scale systems.

A microservice architecture typically means a complex web of service dependencies. One of SOA’s goals is to isolate failure and allow for graceful degradation. The key to being highly available is learning to be partially available. Frequently, one of the requirements for partial availability is telling the client “no.” Outright rejecting service requests is often better than allowing them to back up because, when dealing with distributed services, the latter usually results in cascading failure across dependent systems.

While designing our distributed messaging service at Workiva, we made explicit decisions to drop messages on the floor if we detect the system is becoming overloaded. As queues become backed up, incoming messages are discarded, a statsd counter is incremented, and a backpressure notification is sent to the client. Upon receiving this notification, the client can respond accordingly by failing fast, exponentially backing off, or using some other flow-control strategy. By bounding resource utilization, we maintain predictable performance, predictable (and measurable) lossiness, and impede cascading failure.

Other techniques include building kill switches into service calls and routers. If an overloaded service is not essential to core business, we fail fast on calls to it to prevent availability or latency problems upstream. For example, a spam-detection service is not essential to an email system, so if it’s unavailable or overwhelmed, we can simply bypass it. Netflix’s Hystrix has a set of really nice patterns for handling this.

If we’re not careful, we can often be our own worst enemy. Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves. By isolating and controlling it, we can prevent failure from becoming widespread and unpredictable. By building in backpressure mechanisms and other types of intentional “failure” modes, we can ensure better availability and reliability for our systems through graceful degradation. Sometimes it’s better to fight fire with fire and failure with failure.

Service-Disoriented Architecture

“You can have a second computer once you’ve shown you know how to use the first one.” -Paul Barham

The first rule of distributed systems is don’t distribute your system until you have an observable reason to. Teams break this rule on the regular. People have been talking about service-oriented architecture for a long time, but only recently have microservices been receiving the hype.

The problem, as Martin Fowler observes, is that teams are becoming too eager to adopt a microservice architecture without first understanding the inherent overheads. A contributing factor, I think, is you only hear the success stories from companies who did it right, like Netflix. However, what folks often fail to realize is that these companies—in almost all cases—didn’t start out that way. There was a long and winding path which led them to where they are today. The inverse of this, which some refer to as microservice envy, is causing teams to rush into microservice hell. I call this service-disoriented architecture (or sometimes disservice-oriented architecture when the architecture is DOA).

The term “monolith” has a very negative connotation—unscalable, unmaintainable, unresilient. These things are not intrinsically tied to each other, however, and there’s no reason a single system can’t be modular, maintainable, and fault tolerant at reasonable scale. It’s just less sexy. Refactoring modular code is much easier than refactoring architecture, and refactoring across service boundaries is equally difficult. Fowler describes this as monolith-first, and I think it’s the right approach (with some exceptions, of course).

Don’t even consider microservices unless you have a system that’s too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don’t try to separate it into separate services.

Service-oriented architecture is about organizational complexity and system complexity. If you have both, you have a case to distribute. If you have one of the two, you might have a case (although if you have organizational complexity without system complexity, you’ve probably scaled your organization improperly). If you have neither, you do not have a case to distribute. State, specifically distributed state, is hell, and some pundits argue SOA is satan—perhaps a necessary evil.

There are a lot of motivations for microservices: anti-fragility, fault tolerance, independent deployment and scaling, architectural abstraction, and technology isolation. When services are loosely coupled, the system as a whole tends to be less fragile. When instances are disposable and stateless, services tend to be more fault tolerant because we can spin them up and down, balance traffic, and failover. When responsibility is divided across domain boundaries, services can be independently developed, deployed, and scaled while allowing the right tools to be used for each.

We also need to acknowledge the disadvantages. Adopting a microservice architecture does not automatically buy you anti-fragility. Distributed systems are incredibly precarious. We have to be aware of things like asynchrony, network partitions, node failures, and the trade-off between availability and data consistency. We have to think about resiliency but also the business and UX implications. We have to consider the boundaries of distributed systems like CAP and exactly-once delivery.

When distributing, the emphasis should be on resilience engineering and adopting loosely coupled, stateless components—not microservices for microservices’ sake. We need to view eventual consistency as a tool, not a side effect. The problem I see is that teams often end up with what is essentially a complex, distributed monolith. Now you have two problems. If you’re building a microservice which doesn’t make sense outside the context of another system or isn’t useful on its own, stop and re-evaluate. If you’re designing something to be fast and correct, realize that distributing it will frequently take away both.

Like anti-fragility, microservices do not automatically buy you better maintainability or even scalability. Adopting them requires the proper infrastructure and organization to be in place. Without these, you are bound to fail. In theory, they are intended to increase development velocity, but in many cases the microservice premium ends up slowing it down while creating organizational dependencies and bottlenecks.

There are some key things which must be in place in order for a microservice architecture to be successful: a proper continuous-delivery pipeline, competent DevOps and Ops teams, and prudent service boundaries, to name a few. Good monitoring is essential. It’s also important we have a thorough testing and integration story. This isn’t even considering the fundamental development complexities associated with SOA mentioned earlier.

The better strategy is a bottom-up approach. Start with a monolith or small set of coarse-grained services and work your way up. Make sure you have the data model right. Break out new, finer-grained services as you need to and as you become more confident in your ability to maintain and deploy discrete services. It’s largely about organizational momentum. A young company jumping straight to a microservice architecture is like a golf cart getting on the freeway.

Microservices offer a number of advantages, but for many companies they are a bit of a Holy Grail. Developers are always looking for a silver bullet, but there is always a cost. What we need to do is minimize this cost, and with microservices, this typically means easing our way into it rather than diving into the deep end. Team autonomy and rapid iteration are noble goals, but if we’re not careful, we can end up creating an impedance. Microservices require organization and system maturity. Otherwise, they end up being a premature architectural optimization with a lot of baggage. They end up creating a service-disoriented architecture.

Distributed Systems Are a UX Problem

Distributed systems are not strictly an engineering problem. It’s far too easy to assume a “backend” development concern, but the reality is there are implications at every point in the stack. Often the trade-offs we make lower in the stack in order to buy responsiveness bubble up to the top—so much, in fact, that it rarely doesn’t impact the application in some way. Distributed systems affect the user. We need to shift the focus from system properties and guarantees to business rules and application behavior. We need to understand the limitations and trade-offs at each level in the stack and why they exist. We need to assume failure and plan for recovery. We need to start thinking of distributed systems as a UX problem.

The Truth is Prohibitively Expensive

Stop relying on strong consistency. Coordination and distributed transactions are slow and inhibit availability. The cost of knowing the “truth” is prohibitively expensive for many applications. For that matter, what you think is the truth is likely just a partial or outdated version of it.

Instead, choose availability over consistency by making local decisions with the knowledge at hand and design the UX accordingly. By making this trade-off, we can dramatically improve the user’s experience—most of the time.

Failure Is an Option

There are a lot of problems with simultaneity in distributed computing. As Justin Sheehy describes it, there is no “now” when it comes to distributed systems—that article, by the way, is a must-read for every engineer, regardless of where they work in the stack.

While some things about computers are “virtual,” they still must operate in the physical world and cannot ignore the challenges of that world.

Even though computers operate in the real world, they are disconnected from it. Imagine an inventory system. It may place orders to its artificial heart’s desire, but if the warehouse burns down, there’s no fulfilling them. Even if the system is perfect, its state may be impossible. But the system is typically not perfect because the truth is prohibitively expensive. And not only do warehouses catch fire or forklifts break down, as rare as this may be, but computers fail and networks partition—and that’s far less rare.

The point is, stop trying to build perfect systems because one of two things will happen:

1. You have a false sense of security because you think the system is perfect, and it’s not.

or

2. You will never ship because perfection is out of reach or exorbitantly expensive.

Either case can be catastrophic, depending on the situation. With systems, failure is not only an option, it’s an inevitability, so let’s plan for it as such. We have a lot to gain by embracing failure. Eric Brewer articulated this idea in a recent interview:

So the general answer is you allow things to be inconsistent and then you find ways to compensate for mistakes, versus trying to prevent mistakes altogether. In fact, the financial system is actually not based on consistency, it’s based on auditing and compensation. They didn’t know anything about the CAP theorem, that was just the decision they made in figuring out what they wanted, and that’s actually, I think, the right decision.

We can look to ATMs, and banks in general, as the canonical example for how this works. When you withdraw money, the bank could choose to first coordinate your account, calculating your available balance at that moment in time, before issuing the withdrawal. But what happens when the ATM is temporarily disconnected from the bank? The bank loses out on revenue.

Instead, they make a calculated risk. They choose availability and compensate the risk of overdraft with interest and charges. Likewise, banks use double-entry bookkeeping to provide an audit trail. Every credit has a corresponding debit. Mistakes happen—accounts are debited twice, an account is credited without another being debited—the failure modes are virtually endless. But we audit and compensate, detect and recover. Banks are loosely coupled systems. Accountants don’t use erasers. Why should programmers?

When you find yourself saying “this is important data or people’s money, it has to be correct,” consider how the problem was solved before computers. Building on Quicksand by Dave Campbell and Pat Helland is a great read on this topic:

Whenever the authors struggle with explaining how to implement loosely-coupled solutions, we look to how things were done before computers. In almost every case, we can find inspiration in paper forms, pneumatic tubes, and forms filed in triplicate.

Consider the lost request and its idempotent execution. In the past, a form would have multiple carbon copies with a printed serial number on top of them. When a purchase-order request was submitted, a copy was kept in the file of the submitter and placed in a folder with the expected date of the response. If the form and its work were not completed by the expected date, the submitter would initiate an inquiry and ask to locate the purchase-order form in question. Even if the work was lost, the purchase-order would be resubmitted without modification to ensure a lack of confusion in the processing of the work. You wouldn’t change the number of items being ordered as that may cause confusion. The unique serial number on the top would act as a mechanism to ensure the work was not performed twice.

Computers allow us to greatly improve the user experience, but many of the same fail-safes still exist, just slightly rethought.

The idea of compensation is actually a common theme within distributed systems. The Saga pattern is a great example of this. Large-scale systems often have to coordinate resources across disparate services.  Traditionally, we might solve this problem using distributed transactions like two-phase commit. The problem with this approach is it doesn’t scale very well, it’s slow, and it’s not particularly fault tolerant. With 2PC, we have deadlock problems and even 3PC is still susceptible to network partitions.

Sagas split a long-lived transaction into individual, interleaved sub-transactions. Each sub-transaction in the sequence has a corresponding compensating transaction which reverses its effects. The compensating transactions must be idempotent so they can be safely retried. In the event of a partial execution, the compensating transactions are run and the Saga is effectively rolled back.

The commonly used example for Sagas is booking a trip. We need to ensure flight, car rental, and hotel are all booked or none are booked. If booking the flight fails, we cancel the hotel and car, etc. Sagas trade off atomicity for availability while still allowing us to manage failure, a common occurrence in distributed systems.

Compensation has a lot of applications as a UX principle because it’s really the only way to build loosely coupled, highly available services.

Calculated Recovery

Pat Helland describes computing as nothing more than “memories, guesses, and apologies.” Computers always have partial knowledge. Either there is a disconnect with the real world (warehouse is on fire) or there is a disconnect between systems (System A sold a Foo Widget but, unbeknownst to it, System B just sold the last one in inventory—oops!). Systems don’t make decisions, they make guesses. The guess might be good or it might be bad, but rarely is there certainty. We can wait to collect as much information as possible before making a guess, but it means progress can’t be made until the system is confident enough to do so.

Computers have memory. This means they remember facts they have learned and guesses they have made. Memories help systems make better guesses in the future, and they can share those memories with other systems to help in their guesses. We can store more memories at the cost of more money, and we can survey other systems’ memories at the cost of more latency.

It is a business decision how much money, latency, and energy should be spent on reducing forgetfulness. To make this decision, the costs of the increased probability of remembering should be weighed against the costs of occasionally forgetting stuff.

Generally speaking, the more forgetfulness we can tolerate, the more responsive our systems will be, provided we know how to handle the situations where something is forgotten.

Sooner or later, a system guesses wrong. It sucks. It might mean we lose out on revenue; the business isn’t happy. It might mean the user loses out on what they want; the customer isn’t happy. But we calculate the impact of these wrong guesses, we determine when the trade-offs do and don’t make sense, we compensate, and—when shit hits the fan—we apologize.

Business realities force apologies.  To cope with these difficult realities, we need code and, frequently, we need human beings to apologize. It is essential that businesses have both code and people to manage these apologies.

Distributed systems are as much about failure modes and recovery as they are about being operationally correct. It’s critical that we can recover gracefully when something goes wrong, and often that affects the UX.

We could choose to spend extraordinary amounts of money and countless man-hours laboring over a system which provides the reliability we want. We could construct a data center. We could deploy big, expensive machines. We could install redundant fiber and switches. We could drudge over infallible code. Or we could stop, think for a moment, and realize maybe “sorry” is a more effective alternative. Knowing when to make that distinction can be the difference between a successful business and a failed one. The implications of distributed systems may be wider reaching than you thought.

Product Development is a Trust Fall

A couple weeks ago, Marty Cagan gave an outstanding talk at CraftConf on why products fail despite having great engineering teams. In it, he calls out many of the common mistakes made by teams, and I think there is an underlying theme: trust.

Product development is a trust fall. In order to be successful, a chain of trust must be established from the business all the way down to the engineers. If any point in that chain is compromised, the integrity of the product—and specifically its success—is put in jeopardy.

Engineers will innovate. Trust them. Engineers will discover requirements. Trust them. Engineers will identify risks. Trust them.

This trust must be symmetrical. The business must trust its product managers, who must trust the business. Product managers must trust the engineers, who must trust the product managers. Each level in this hierarchy must act as its own trust anchor. Trust is assumed, not derived. To say the opposite would imply your system of hiring and firing is fundamentally flawed. If you do not trust your teams, your teams will not trust you.

Engineers will prioritize. Trust them. Engineers will deliver. Trust them. Engineers will fail. Let them.

Product development is a trust fall. When you let go, your team will catch you. And when they don’t, they’ll pick you up, dust you off, and say, “we’ll make an adjustment.” Fail fast but recover faster. The more times you fall and hit the ground, the more adjustments we make. The quicker we repeat this process, the less time you spend on the ground. Shame on the teams that spend days, weeks, months planning their fall, ensuring everything is in place, only to find the ground has moved.

It’s inexcusable to say you fail fast when it’s really a slow, prolonged death in product, in technology, or in execution, yet it’s surprisingly common. In order to innovate, you have to fail first. In order to build an effective team, you have to fail first. In order to produce a successful product, you have to fail first. The number one fatal mistake teams make is not recognizing when they’ve failed or being too proud to admit it. This is what Agile is actually about. It’s not about roadmaps or requirements gathering or user stories or stand-ups. It’s about failing and adjusting, failing and adjusting, failing and adjusting. Agile is micro failure on a macro level. As Cagan remarks, the biggest visible distinguisher of a great team is no roadmap.

A roadmap is essentially dooming your team to get out a small number of things that will almost certainly—most of them—not work.

Developers need to be part of the ideation process from day one. The customer is usually wrong. They often don’t know what they want or what’s possible. Developers are invested in the technology and understand its capabilities and limitations. Trust them.

If you’re just using your developers to code, you’re only getting about half their value.

I’ve said it before but without a focused vision, a product will fail. Without embracing new ideas and technology, a company will become irrelevant. Developers must be closely involved with both aspects in order to be successful. Innovate, fail, adjust, deliver. Repeat.

Product development is a trust fall. The key is letting go.

Writing Good Code

There’s no shortage of people preaching the importance of good code. Indeed, many make a career of it. The resources available are equally endless, but lately I’ve been wondering how to extract the essence of building high-quality systems into a shorter, more concise narrative. This is actually something I’ve thought about for a while, but I’m just now starting to formulate some ideas into a blog post. The ideas aren’t fully developed, but my hope is to flesh them out further in the future. You can talk about design patterns, abstraction, encapsulation, and cohesion until you’re blue in the face, but what is the essence of good code?

Like any other engineering discipline, quality control is a huge part of building software. This isn’t just ensuring that it “works”—it’s ensuring it works under the complete range of operating conditions, ensuring it’s usable, ensuring it’s maintainable, ensuring it performs well, and ensuring a number of other characteristics. Verifying it “works” is just a small part of a much larger picture. Anybody can write code that works, but there’s more to it than that. Software is more malleable than most other things. Not only does it require longevity, it requires giving in to that malleability. If it doesn’t, you end up with something that’s brittle and broken. Because of this, it’s vital we test for correctness and measure for quality.

SCRAP for Quality

Quality is a very subjective thing. How can one possibly measure it? Code complexity and static analysis tooling come to mind, and these are deservedly valued, but it really just scratches the surface. How do we narrow an immensely broad topic like “quality” into a set of tangible, quantifiable goals? This is really the crux of the problem, but we can start by identifying a sort of checklist or guidelines for writing software. This breaks that larger problem into smaller, more digestible pieces. The checklist I’ve come up with is called SCRAP, an acronym defined below. It’s unlikely to be comprehensive, but I think it covers most, if not all, of the key areas.

Scalability Plan for growth
Complexity Plan for humans
Resiliency Plan for failure
API Plan for integration
Performance Plan for execution

Each of these items is itself a blog post, so this is only a brief explanation. There is definitely overlap between some of these facets, and there are also multiple dimensions to some.

Scalability is a plan for growth—in code, in organization, in architecture, and in workload. Without it, you reach a point where your system falls over, whether it’s because of a growing userbase, a growing codebase, or any number of other reasons. It’s also worth pointing out that without the ‘S’, all you have is CRAP. This also helps illustrate some of the overlap between these areas of focus as it leads into Complexity, which is a plan for humans. Scalability is about technology scale and demand scale, but it’s also about people scale. As your team grows or as your company grows, how do you manage that growth at the code level?

Planning for people doesn’t just mean managing growth, it also means managing complexity. If code is overly complex, it’s difficult to maintain, it’s difficult to extend, and it’s difficult to fix. If systems are overly complex, they’re difficult to deploy, difficult to manage, and difficult to monitor. Plan for humans, not machines.

Resiliency is a strategy for fault tolerance. It’s a plan for failure. What happens when you crash? What happens when a service you depend on crashes? What happens when the database is unavailable? What happens when the network is unreliable? Systems of all kind need to be designed with the expectation of failure. If you’re not thinking about failure at the code level, you’re not thinking about it enough.

One thing you should be noticing is that “people” is a cross-cutting concern. After all, it’s people who design the systems, and it’s people who write the code. While API is a plan for integration, it’s people who integrate the pieces. This is about making your API a first-class citizen. It doesn’t matter if it’s an internal API, a library API, or a RESTful API. It doesn’t matter if it’s for first parties or third parties. As a programmer, your API is your user interface. It needs to be clean. It needs to be sensible. It needs to be well-documented. If those integration points aren’t properly thought out, the integration will be more difficult than it needs to be.

The last item on the checklist is Performance. I originally defined this as a plan for speed, but I realized there’s a lot more to performance than doing things fast. It’s about doing things well, which is why I call Performance a plan for execution. Again, this has some overlap with Resiliency and Scalability, but it’s also about measurement. It’s about benchmarking and profiling. It’s about testing at scale and under failure because testing in a vacuum doesn’t mean much. It’s about optimization.

This brings about the oft-asked question: how do I know when and where to optimize? While premature optimization might be the root of all evil, it’s not a universal law. Optimize along the critical path and outward from there only as necessary. The further you get from that critical path, the more wasted effort it’s going to end up being. It depreciates quickly, so don’t lose sight of your optimization ROI. This will enable you to ship quickly and ship quality code. But once you ship, you’re not done measuring! It’s more important than ever that you continue to measure in production. Use performance and usage-pattern data to drive intelligent decisions and intelligent iteration. The payoff is that this doesn’t just apply to code decisions, it applies to all decisions. This is where the real value of measuring comes through. Decisions that aren’t backed by data aren’t decisions, they’re impulses. Don’t be impulsive, be empirical.

Going Forward

There is work to be done with respect to quantifying the items on this checklist. However, I strongly suspect even just thinking about them, formally or informally, will improve the overall quality of your code by an equally-unmeasurable order of magnitude. If your code doesn’t pass this checklist, it’s tech debt. Sometimes that’s okay, but remember that tech debt has compounding interest. If you don’t pay it off, you will eventually go bankrupt.

It’s not about being a 10x developer. It’s about being a 1x developer who writes 10x code. By that I mean the quality of your code is far more important than its quantity. Quality will outlast and outperform quantity. These guidelines tend to have a ripple effect. Legacy code often breeds legacy-like code. Instilling these rules in your developer culture helps to make engineers cognizant of when they should break the mold, introduce new patterns, or improve existing ones. Bad code begets bad code, and bad code is the atrophy of good developers.

If State Is Hell, SOA Is Satan

More and more companies are describing their success stories regarding the switch to a service-oriented architecture. As with any technological upswing, there’s a clear and palpable hype factor involved (Big Data™ or The Cloud™ anyone?), but obviously it’s not just puff.

While microservices and SOA have seen a staggering rate of adoption in recent years, the mindset of developers often seems to be stuck in the past. I think this is, at least in part, because we seek a mental model we can reason about. It’s why we build abstractions in the first place. In a sense, I would argue there’s a comparison to be made between the explosion of OOP in the early 90’s and today’s SOA trend. After all, SOA is as much about people scale as it is about workload scale, so it makes sense from an organizational perspective.

The Perils of Good Abstractions

While systems are becoming more and more distributed, abstractions are attempting to make them less and less complex. Mesosphere is a perfect example of this, attempting to provide the “datacenter operating system.” Apache Mesos allows you to “program against your datacenter like it’s a single pool of resources.” It’s an appealing proposition to say the least. PaaS like Google App Engine and Heroku offer similar abstractions—write your code without thinking about scale. The problem is you absolutely have to think about scale or you’re bound to run into problems down the road. And while these abstractions are nice, they can be dangerous just the same. Welcome to the perils of good abstractions.

I like to talk about App Engine because I have firsthand experience with it. It’s an easy sell for startups. It handles spinning up instances when you need them, turning them down when you don’t. It’s your app server, database, caching, job scheduler, task queue all in one, and it does it at scale. There’s vendor lock-in, sure, yet it means no ops, no sysadmins, no overhead. Push to deploy. But it’s a leaky abstraction. It has to be. App Engine scales because it’s distributed, but it allows—no, encourages—you to write your system as a monolith. The datastore, memcache, and task queue accesses are masked as RPCs. This is great for our developer mental model, but it will bite you if you’re not careful. App Engine imposes certain limitations to encourage good design; for instance, front-end requests and datastore calls are limited to 60 seconds (it used to be much less), but the leakiness goes beyond that.

RPC is consistently at odds with distributed systems. I would go so far as to say it’s an anti-pattern in many cases. RPC encourages writing synchronous code, but distributed systems are inherently asynchronous. The network is not reliable. The network is not fast. The network is not your friend. Developers who either don’t understand this or don’t realize what’s happening when they make an RPC will write code as if they were calling a function. It will sure as hell look like just calling a function. When we think synchronously, we end up with systems that are slow, fault intolerant, and generally not scalable. To be quite honest, however, this is perfectly acceptable for 90% of startups as they are getting off the ground because they don’t have workloads at meaningful scale.

There’s certainly some irony here. One of the selling points of App Engine is its ability to scale to large amounts of traffic, yet the vast majority of startups would be perfectly suited to scaling up rather than out, perhaps with some failover in place for good measure. Stack Overflow is the poster child of scale-up architecture. In truth, your architecture should be a function of your access patterns, not the other way around (and App Engine is very much tailored to a specific set of access patterns). Nonetheless, it shows that vertical scaling can work. I would bet a lot of startups could sufficiently run on a large, adequately specced machine or maybe a small handful of them.

The cruel irony is that once you hit a certain scale with App Engine, both in terms of your development organization and user base, you’ve reached a point where you have to migrate off it. And if your data model isn’t properly thought out, you will without a doubt hit scale problems. It’s to the point where you need someone with deep knowledge of how App Engine works in order to build quality systems on it. Good luck hiring a team of engineers who understand it. GAE is great at accelerating you to 100 mph, but you better have some nice airbags for the brick wall it launches you into. In fairness, this is a problem every org hits—Conway’s law is very much a reality and every startup has growing pains. To be clear, this isn’t a jab at GAE, which is actually very effective at accelerating a product using little capital and can sustain long-term success given the right use case. Instead, I use it to illustrate a point.

Peering Through the Abstraction

Eventually SOA makes sense, but our abstractions can cause problems if we don’t understand what’s going on behind the curtain (hence the leakiness). Partial failure is all but guaranteed, and latency, partitioning, and other network pressure happens all the time.

Ken Arnold is famed with once saying “state is hell” in reference to designing distributed systems. In the past, I’ve written how scaling shared data is hard, but with SOA it’s practically a requirement. Ken is right though—state is hell, and SOA is fundamentally competing with consistency. The FLP Impossibility result and the CAP theorem can prove it formally, but really this should be intuitively obvious if we accept the laws of physics.

On the other hand, if you store information that I can’t reconstruct, then a whole host of questions suddenly surface. One question is, “Are you now a single point of failure?” I have to talk to you now. I can’t talk to anyone else. So what happens if you go down?

To deal with that, you could be replicated. But now you have to worry about replication strategies. What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived there? What is the replication strategy? What kind of consistency do you need—tight or loose? What happens if the network gets partitioned and the replicants can’t talk to each other? Can anybody proceed?

Essentially, the more stateful your system is, the harder it’s going to be to scale it because distributing that state introduces a rich tapestry of problems. In practice, we often can’t eliminate state wholesale, but basically everything that can be stateless should be stateless.

Making servers disposable allows you a great deal of flexibility. Former Netflix Cloud Architect Adrian Cockcroft articulates this idea well:

You want to think of servers like cattle, not pets. If you have a machine in production that performs a specialized function, and you know it by name, and everyone gets sad when it goes down, it’s a pet. Instead you should think of your servers like a herd of cows. What you care about is how many gallons of milk you get. If one day you notice you’re getting less milk than usual, you find out which cows aren’t producing well and replace them.

This is effectively how App Engine achieves its scalability. With lightweight, stateless, and disposable instances, it can spin them up and down on the fly without worrying about being in an invalid state.

App Engine also relies on eventual consistency as the default model for datastore interactions. This makes queries fast and highly available, while snapshot isolation can be achieved using entity-group transactions if necessary. The latter, of course, can result in a lot of contention and latency. Yet, people seem to have a hard time grappling with the reality of eventual consistency in distributed systems. State is hell, but calling SOA “satan” is clearly a hyperbole. It is a tough problem nevertheless.

A State of Mind

In the situations where we need state, we have to reconcile with the realities of distributed systems. This means understanding the limitations and accepting the complexities, not papering over them. It doesn’t mean throwing away abstractions. Fortunately, distributed computing is the focus of a lot of great research, so there are primitives with which we can build: immutability, causal ordering, eventual consistency, CRDTs, and other ideas.

As long as we recognize the trade-offs, we can design around them. The crux is knowing they exist in the first place. We can’t have ACID semantics while remaining highly available, but we can use Highly Available Transactions to provide strong-enough guarantees. At the same time, not all operations require coordination or concurrency control. The sooner we view eventual consistency as a solution and not a consequence, the sooner we can let go of this existential crisis. Other interesting research includes BOOM, which seeks to provide a high-level, declarative approach to distributed programming.

State might be hell, but it’s a hell we have to live. I don’t advocate an all-out microservice architecture for a company just getting its start. The complications far outweigh any benefits to be gained, but it becomes a necessity at a certain point. The key is having an exit strategy. PaaS providers make this difficult due to vendor lock-in and architectural constraints. Weigh their advantages carefully.

Once you do transition to a SOA, make as many of those services, or the pieces backing them, as stateless as possible. For those which aren’t stateless, know that the problem typically isn’t novel. These problems have been solved or are continuing to be solved in new and interesting ways. Academic research is naturally at the bleeding edge with industry often lagging behind. OOP concepts date back to as early as the 60’s but didn’t gain widespread adoption until several decades later. Distributed computing is no different. SOA is just a state of mind.

Not Invented Here

Engineers love engineering things. The reason is self-evident (and maybe self-fulfilling—why else would you be an engineer?). We like to think we’re pretty good at solving problems. Unfortunately, this mindset can, on occasion, yield undesirable consequences which might not be immediately apparent but all the while damaging.

Developers are all in tune with the idea of “don’t reinvent the wheel,” but it seems to be eschewed sometimes, deliberately or otherwise. People don’t generally write their own merge sort, so why would they write their own consensus protocol? Anecdotally speaking, they do.

Not-Invented-Here Syndrome is a very real thing. In many cases, consciously or not, it’s a cultural problem. In others, it’s an engineering one. Camille Fournier’s blog post on ZooKeeper helps to illustrate this point and provide some context. In it, she describes why some distributed systems choose to rely on external services, such as ZooKeeper, for distributed coordination, while others build in their own coordination logic.

We draw a parallel between distributed systems and traditional RDBMSs, which typically implement their own file system and other low-level facilities. Why? Because it’s their competitive advantage. SQL databases sell because they offer finely tuned performance, and in order to do that, they need to control these things that the OS otherwise provides. Distributed databases like Riak sell because they own the coordination logic, which helps promote their competitive advantage. This follows what Joel Spolsky says about NIH Syndrome in that “if it’s a core business function—do it yourself, no matter what.”

If you’re developing a computer game where the plot is your competitive advantage, it’s OK to use a third party 3D library. But if cool 3D effects are going to be your distinguishing feature, you had better roll your own.

This makes a lot of sense. My sorting algorithm is unlikely to provide me with a competitive edge, but something else might, even if it’s not particularly novel.

So in some situations, homegrown is justifiable, but that’s not always the case. Redis’ competitive advantage is its predictably low latencies and data structures. Does it make sense for it to implement its own clustering and leader election protocols? Maybe, but this is where NIH can bite you. If what you’re doing is important and there’s precedent, lean on existing research and solutions. Most would argue write safety is important, and there is certainly precedent for leader election. Why not leverage that work? Things like Raft, Paxos, and Zab provide solutions which are proven using formal methods and are peer reviewed. That doesn’t mean new solutions can’t be developed, but they generally require model checking and further scrutiny to ensure correctness. Otherwise, you’ll inevitably run into problems. Implementing our own solutions can provide valuable insight, but leave them at home if they’re not rigorously approached. Rolling your own and calling it “good enough” is dishonest to your users if it’s not properly communicated.

Elasticsearch is another interesting case to look at. You might say Elasticsearch’s competitive advantage is its full-text search engine, but it’s not. Like Solr, it’s built on Lucene. Elasticsearch was designed from the ground-up to be distributed. This is what gives it a leg up over Solr and other similar search servers where horizontal scaling and fault tolerance were essentially tacked on. In a way, this resembles what happened with Redis, where failover and clustering were introduced as an afterthought. However, unlike Redis, which chose to implement its own failover coordination and cluster-membership protocol, Solr opted to use ZooKeeper as an external coordinator.

We see that Elasticsearch’s core advantage is its distributed nature. Following that notion, it makes sense for it to own that coordination, which is why its designers chose to implement their own internal cluster membership, ZenDisco. But it turns out writing cluster-membership protocols is really fucking hard, and unless you’ve written proofs for it, you probably shouldn’t do it at all. The analogy here would be writing your own encryption algorithm—there’s tons of institutional knowledge which has laid the groundwork for solutions which are well-researched and well-understood. That knowledge should be embraced in situations like this.

I don’t mean to pick on Redis and Elasticsearch. They’re both excellent systems, but they serve as good examples for this discussion. The problem is that users of these systems tend to overlook the issues exposed by this mentality. Frankly, few people would know problems exist unless they are clearly documented by vendors (and not sales people) and even then, how many people actually read the docs cover-to-cover? It’s essential we know a system’s shortcomings and edge cases so we can recognize which situations to apply it and, more important, which we should not.

You don’t have to rely on an existing third-party library or service. Believe it or not, this isn’t a sales pitch for ZooKeeper. If it’s a core business function, it probably makes sense to build it yourself as Joel describes. What doesn’t make sense, however, is to build out whatever that is without being cognizant of conventional wisdom. I’m amazed at how often people are willing to throw away institutional knowledge, either because they don’t seek it out or they think they can do better (without formal verification). If I have seen further, it is by standing on the shoulders of giants.

Sometimes Kill -9 Isn’t Enough

If there’s one thing to know about distributed systems, it’s that they have to be designed with the expectation of failure. It’s also safe to say that most software these days is, in some form, distributed—whether it’s a database, mobile app, or enterprise SaaS. If you have two different processes talking to each other, you have a distributed system, and it doesn’t matter if those processes are local or intergalactically displaced.

Marc Hedlund recently had a great post on Stripe’s game-day exercises where they block off an afternoon, take a blunt instrument to their servers, and see what happens. We’re talking like abruptly killing instances here—kill -9, ec2-terminate-instances, yanking on the damn power cord—that sort of thing. Everyone should be doing this type of stuff. You really don’t know how your system behaves until you see it under failure conditions.

Netflix uses Chaos Monkey to randomly terminate instances, and they do it in production. That takes some balls, but you know you have a pretty solid system when you’re comfortable killing live production servers. At Workiva, we have a middleware we use to inject datastore and other RPC errors into Google App Engine. Building resilient systems is an objective concern, but we still have a ways to go.

We need to be pessimists and design for failure, but injecting failure isn’t enough. Sure, every so often shit hits the proverbial fan, and we need to be tolerant of that. But more often than not, that fan is just a strong headwind.

Simulating failure is a necessary element for building reliable distributed systems, but system behavior isn’t black and white, it’s a continuum. We build our system in a vacuum and (hopefully) test it under failure, but we should also be observing it in this gray area. How does it perform with unreliable network connections? Low bandwidth? High latency? Dropped packets? Out-of-order packets? Duplicate packets? Not only do our systems need to be fault-tolerant, they need to be pressure-tolerant.

Simulating Pressure

There are a lot of options to do these types of “pressure” simulations. On Linux, we can use iptables to accomplish this.

This will drop incoming and outgoing packets with a 10% probability. Alternatively, we can use tc to simulate network latency, limited bandwidth, and packet loss.

The above adds an additional 250ms of latency with 10% packet loss and a bandwidth limit of 1Mbps. Likewise, on OSX and BSD we can use ipfw or pfctl.

Here we inject 500ms of latency while limiting bandwidth to 1Mbps and dropping 10% of packets.

These are just some very simple traffic-shaping examples. Several of these tools allow you to perform even more advanced testing, like adding variation and correlation values. This would allow you to emulate burst packet loss and other situations we often encounter. For instance, with tc, we can add jitter to the network latency.

This adds 50±20ms of latency. Since network latency typically isn’t uniform, we can apply a normal distribution to achieve a more realistic simulation.

Now we get a nice bell curve which is probably more representative of what we see in practice. We can also use tc to re-order, duplicate, and corrupt packets.

I’ve been working on an open-source tool which attempts to wrap these controls up so you don’t have to memorize the options or worry about portability. It’s pretty primitive and doesn’t support much yet, but it provides a thin layer of abstraction.

Conclusion

Injecting failure is crucial to understanding systems and building confidence, but like good test coverage, it’s important to examine suboptimal-but-operating scenarios. This isn’t even 99th-percentile stuff—this is the type of shit your users deal with every single day. If you can’t handle sustained latency and sporadic network partitions, who cares if you tolerate instance failure? The tools are at our disposal, they just need to be leveraged.

Iris Decentralized Cloud Messaging

A couple weeks ago, I published a rather extensive analysis of numerous message queues, both brokered and brokerless. Brokerless messaging is really just another name for peer-to-peer communication. As we saw, the difference in message latency and throughput between peer-to-peer systems and brokered ones is several orders of magnitude. ZeroMQ and nanomsg are able to reliably transmit millions of messages per second at the expense of guaranteed delivery.

Peer-to-peer messaging is decentralized, scalable, and fast, but it brings with it an inherent complexity. There is a dichotomy between how brokerless messaging is conceptualized and how distributed systems are actually built. Distributed systems are composed of services like applications, databases, caches, etc. Services are composed of instances or nodes—individually addressable hosts, either physical or virtual. The key observation is that, conceptually, the unit of interaction lies at the service level, not the instance level. We don’t care about which database server we interact with, we just want to talk to database server (or perhaps multiple). We’re concerned with logical groups of nodes.

While traditional socket-queuing systems like ZeroMQ solve the problem of scaling, they bring about a certain coupling between components. System designers are forced to build applications which communicate with nodes, not services. We can introduce load balancers like HAProxy, but we’re still addressing specific locations while creating potential single points of failure. With lightweight VMs and the pervasiveness of elastic clouds, IP addresses are becoming less and less static—they come and go. The canonical way of dealing with this problem is to use distributed coordination and service discovery via ZooKeeper, et al., but this introduces more configuration, more moving parts, and more headaches.

The reality is that distributed systems are not built with the instance as the smallest unit of composition in mind, they’re built with services in mind. As discussed earlier, a service is simply a logical grouping of nodes. This abstraction is what we attempt to mimic with things like etcd, ZooKeeper and HAProxy. These assemblies are proven, but there are alternative solutions that offer zero configuration, minimal network management, and overall less complexity. One such solution that I want to explore is a distributed messaging framework called Iris.

Decentralized Messaging with Iris

Iris is posited as a decentralized approach to backend messaging middleware. It looks to address several of the fundamental issues with traditional brokerless systems, like tight coupling and security.

In order to avoid the problem of addressing instances, Iris considers clusters to be the smallest logical blocks of which systems are composed. A cluster is a collection of zero or more nodes which are responsible for a certain service sub-task. Clusters are then assembled into services such that they can communicate with each other without any regard as to which instance is servicing their requests or where it’s located. Lastly, services are composed into federations, which allow them to communicate across different clouds.

instances_vs_clusters

This form of composition allows Iris to use semantic or logical addressing instead of the standard physical addressing. Nodes specify the name of the cluster they wish to participate in, while Iris handles the intricacies of routing and balancing. For example, you might have three database servers which belong to a single cluster called “databases.” The cluster is reached by its name and requests are distributed across the three nodes. Iris also takes care of service discovery, detecting new clusters as they are created on the same cloud.

physical_vs_semantic

With libraries like ZeroMQ, security tends to be an afterthought. Iris has been built from the ground-up with security in mind, and it provides a security model that is simple and fast.

Iris uses a relaxed security model that provides perfect secrecy whilst at the same time requiring effectively zero configuration. This is achieved through the observation that if a node of a service is compromised, the whole system is considered undermined. Hence, the unit of security is a service – opposed to individual instances – where any successfully authenticated node is trusted by all. This enables full data protection whilst maintaining the loosely coupled nature of the system.

In practice, what this means is that each cluster uses a single private key. This encryption scheme not only makes deployment trivial, it minimizes the effect security has on speed.

authentication_and_encryption

Like ZeroMQ and nanomsg, Iris offers a few different messaging patterns. It provides the standard request-reply and publish-subscribe schemes, but it’s important to remember that the smallest addressable unit is the cluster, not the node. As such, requests are targeted at a cluster and subsequently relayed on to a member in a load-balanced fashion. Publish-subscribe, on the other hand, is not targeted at a single cluster. It allows members of any cluster to subscribe and publish to a topic.

Iris also implements two patterns called “broadcast” and “tunnel.” While request-reply forwards a message to one member of a cluster, broadcast forwards it to all members. The caveat is that there is no way to listen for responses to a broadcast.

Tunnel is designed to address the problem of stateful or streaming transactions where a communication between two endpoints may consist of multiple data exchanges which need to occur as an atomic operation. It provides the guarantee of in-order and throttled message delivery by establishing a channel between a client and a node.

schemes

Performance Characteristics

According to its author, Iris is still in a “feature phase” and hasn’t been optimized for speed. Since it’s written in Go, I’ve compared its pub/sub benchmark performance with other Go messaging libraries, NATS and NSQ. As before, these benchmarks shouldn’t be taken as gospel, the code is available here, and pull requests are welcome.

We can see that Iris is comparable to NSQ on the sending side and about 4x on the receiving side, at least out of the box.

Conclusion

Brokerless systems like ZeroMQ and nanomsg offer considerably higher throughput and less latency than classical message-oriented middleware but require greater orchestration of network topologies. They offer high scalability but can lead to tighter coupling between components. Traditional brokered message queues, like those of the AMQP variety, tend to be slower while providing more guarantees and reduced coupling. However, they are also more prone to scale problems like availability and partitioning.

In terms of its qualities, Iris appears to be a reasonable compromise between the decentralized nature of the brokerless systems and the minimal-configuration and management of the brokered ones. Its intrinsic value lies in its ability to hide the complexities of the underlying infrastructure behind distributed systems. Rather, Iris lends itself to building large-scale systems the way we conceptualize and reason about them—by using services as the building blocks, not instances.