Engineering Empathy

This was a talk I gave at an internal R&D conference my last week at Workiva. I got a lot of positive feedback on the talk, so I figured I would share it with a wider audience. Be warned: it’s long. Feel free to read each section separately, though they largely tie together.

Why do you work where you work? For many in tech, the answer is probably culture. When you tell a friend about your job, the culture is probably the first thing you describe. It’s culture that can be a company’s biggest asset—and its biggest downfall. But what is it?

Culture isn’t a list of values or a mission statement. It’s not a casual dress code or a beer fridge. Culture is what you reward and what you don’t. More importantly, it’s what you reward and what you punish. That’s an important distinction to make because when you don’t punish behavior that’s inconsistent with your culture, you send a message: you don’t care about it.

So culture is what you live day in and day out. It’s not what you say, it’s what you do. Put yourself in the shoes of a new hire at your company. A new hire doesn’t walk in automatically knowing your culture. They walk in—filled with anxiety—hoping for success, fearing failure, and they look around. They observe their environment. They see who is succeeding, and they try to emulate that behavior. They see who is failing, and they try to avoid that behavior. They ask the question: what makes someone successful here?

Jim Rohn was credited with saying we are the average of the five people we spend the most time with, and I think there’s a lot of truth to this idea. The people around you shape who you are. They shape your behavior, your habits, your thoughts, your opinions, your worldview. Culture is really a feedback loop, and we all contribute to it. It’s not mandated or dictated down (or up through grassroots programs), it’s emergent, but we have to be deliberate about our behaviors because that’s what shapes our culture (note that this doesn’t mean culture doesn’t start with leadership—it most certainly does).

In my opinion, a strong engineering culture is comprised of three parts: the right people, the right processes, and the right priorities. The right people means people that align with and protect your values. Processes are how you execute—how you communicate, develop, deliver, etc. And priorities are your values—the skills or behaviors you value in fellow employees. It’s also your vision. These help you to make decisions—what gets done today and what goes to the bottom of the list. For example, many companies claim a customer-first principle, but how many of them actually use it to drive their day-to-day decisions? This is the difference between a list of values and a culture.

What about technology? Experience? Customer relationships? These are all important competitive advantages, but I think they largely emerge from having the right people, processes, and priorities. The three are deeply intertwined. A culture is the unique combination of processes and values within an organization, and it’s those processes and values that enable you to replicate your success. I’m a big fan of Reed Hastings’ Netflix culture slide deck, but there are some things with which I fundamentally disagree. Hastings says, “The more talent density you have the less process you need. The more process you create the less talent you retain.” This is wrong on a number of levels, which I will talk about later.

Empathy is the common thread throughout each of these three areas—the people, the processes, and the priorities—and we’ll see how it applies to each.

The Ultimate Complex System: People

We largely think about software development as a purely technical feat, one which requires skill and creativity and ingenuity. I think for many, it’s why we became engineers in the first place. We like solving problems. But when all is said and done, computers do what you tell them to do. Computers don’t have opinions or biases or agendas or egos. The technical challenges are really a small part of any sufficiently complicated piece of software. Having been an individual contributor, tech lead, and manager, I’ve come to the realization that it’s people that is the ultimate complex system.

There’s a quote from The Five Dysfunctions of a Team which I’ve referenced before on this blog that I think captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” Teamwork is powerful is intuitive, but teamwork is rare is more profound. Teamwork is a competitive advantage because it’s rare—that’s a pretty strong statement. Let’s unpack it a bit.

It’s the difference between technical architecture and social architecture. We tend to focus on the former while neglecting the latter, but software engineering is more about collaboration than code. It wasn’t until I became a manager that I realized good managers are force multipliers, but social architecture is everyone’s responsibility. Remember, culture is a feedback loop which we all contribute to, so everyone is a social architect.

A key part of social architecture is communication empathy. Back in the 90’s, an evolutionary psychologist by the name of Robin Dunbar proposed the idea that humans can only maintain about 150 stable social relationships. The limit is referred to as Dunbar’s number. He drew a correlation between primate brain size and the average size of cohesive social groups. Informally, Dunbar describes this as “the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” This number includes all of the relationships in your life, both personal and professional, past and present.

What’s interesting about Dunbar’s number is how it applies to our jobs as software engineers and the interplay with cognitive biases we have as humans. When you don’t understand what someone else does, you’re automatically biased against them. In your head, you’re king. You understand exactly what you do, what your job is, the value that you add, but outside your head—outside your Dunbar’s number—those people are all a mystery. And humans have this funny tendency to mock what they don’t understand. We have lots of these cognitive biases.

Building those types of stable relationships is really hard when you can’t just walk over to someone’s desk and talk to them face-to-face. This was something we struggled a lot with at Workiva, being a company of a few hundred engineers spread out across 11 or so offices. Not only is split-brain an inevitability, but just a general lack of rapport. That stage is super hard to go through for a lot of companies—going from dozens of engineers to hundreds in a few short years. The cruel irony is no matter how agile a company claims to be, the culture—of which structure and processes are a part of—is usually the slowest to adjust. No longer is decision making done by standing up and telling a roomful of people, “I’m going to do this. Please tell me now if that’s a bad idea.” And no longer are decisions often made unanimously and rapidly.

Building stable relationships is much harder without the random hallway error correction. The right people aren’t always bumping into each other at the right time, whereas before a lot of decisions could be made just-in-time. Instead, communication has to be more deliberate and no longer organic. Decisions are made by the Jeff Bezos philosophy of disagree and commit. But nevertheless, building empathy is hard without face-to-face communication, and you miss out on a lot of the nuance of communication.

Again, communication is highly nuanced, and nuance is hard to convey over HipChat. The role of emotions plays a big part. Imagine the situation where you’re walking down a sidewalk or a hallway and someone accidentally walks in front of you. You might sidestep or do a little dance to get around each other, smile or nod, and get on with your day. This minor inconvenience becomes almost this tiny, pleasant interaction between two people. Now, take the same scenario but between two drivers, and you probably have some kind of road rage type situation. The only real difference is the steel cage surrounding the drivers blocking out the verbal and non-verbal communication. How many times has a HipChat conversation gone completely off the rails only to be resolved by a quick Google Hangout or in-person conversation? It’s the exact same thing.

One last note with respect to cognitive biases: once again, humans have a funny tendency where, in a vacuum of information, people will create their own. We will manufacture information just to fill the void, and often it’s not just creating information but taking information we’ve heard somewhere else and applying it to our own—and often the wrong—context. The extreme example of this is “We heard microservices worked well for Netflix, so we should use them at our growing startup” or “Google does monorepo, so we should too.” You’re ignoring all of the context—the path those companies followed to reach those points, the trade-offs made, the organizational differences and competencies, the pain points. When the cognitive biases and opinions we have as humans are added in, the problems amplify and compound. You get a frustrating game of telephone.

You can’t kill The Grapevine anymore than you can change human nature, so you have to address it head-on where and when you can. Don’t allow the vacuum of information to form in the first place, and be cognizant of when you’re applying information you’ve heard from someone else. Is the problem the same? Do you have the same amount of information? Are your constraints the same? What is the full context?

I tend to look at communication through two different lenses: push and pull. In order to be an empathetic communicator, I think it’s important to look at these while thinking about the cognitive biases discussed earlier, starting with push.

I’ve been in meetings where someone called out someone else as a “blocker”, and there was visible wincing in the room. I think for some, the word probably triggers some sort of PTSD. When you’re depending on something that’s outside of your direct control to get something else done, it’s hard not to drop the occasional B-word. It happens out of frustration. It happens because you’re just trying to ship. It also happens because you don’t want to look bad. And it might seem innocuous in the moment, but it has impact. By invoking it, you are tempting fate.

There’s a really good book that was written way back in 1944 called The Unwritten Laws of Engineering. It was published by the American Society of Mechanical Engineers and the language in the book is quite dated (parts of it read like a crotchety old guy yelling at kids to get off his lawn), but the ideas in the book still apply very much today and even to software engineering. It’s really about people, and fundamentally, people don’t change. One of the ideas in the book is this notion which I call communication impact. I’ve taken a quote from the book which I think highlights this idea (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

I see this a lot. Not just in emails (née “memos”) but in meetings or reviews, someone will—inadvertently or not—throw someone else under the proverbial bus, i.e. “broadcasting memorandum containing damaging or embarrassing statements”, “stepping too heavily upon someone’s toes”, or “revealing a serious shortcoming on somebody’s part.” The problem with this is, by doing it, you immediately put the other party on the defensive and also create a cognitive bias for everyone else in the room. You create a negative predisposition, which may or may not be warranted, toward them. Similarly, I liken “if it concerns manufacturing or customer difficulties” to production postmortems. This is why they need to be blameless. Why is it that retros on production issues are blameless while, at the same time, the development process is full of blame-assigning? It might seem innocuous, but your communication has impact. Push with respect and under the assumption the other person is probably doing the right thing. Don’t be willing to throw anyone under that bus. Likewise, be quick to take responsibility but slow to assign it. Don’t be willing to practice Cover Your Ass Engineering.

I’ve been in meetings where someone would get called out as a blocker, literally articulated in that way, and the person wasn’t even aware they were blocking anything. I’ve seen people create JIRA tickets on another team’s board and then immediately call them blockers. It’s important to call out dependencies ahead of time, and when someone is “blocking” your progress, speak to them about it individually and before it reaches a critical point. No one should be getting caught off guard by these things. Be careful about how and where you articulate these types of problems.

On the same topic of communication impact, I’ve seen engineers develop detailed and extravagant plans like “We’re going to move the entire company to a monorepo while simultaneously switching from Git to Mercurial” or “We’re going to build our own stream-processing framework from the ground up”, and then distribute them widely to the organization (“wide distribution” as referenced in The Unwritten Laws of Engineering passage above). The proposals are usually well-intentioned and maybe even compelling sometimes, but it’s the way in which they are communicated that is problematic. Recall The Grapevine: people see it, assume it’s reality, and then spread misinformation. “Did you know the company is switching to Mercurial?”

An effective way to build rapport between teams is genuinely celebrating the successes of other teams, even the small ones. I think for many organizations, it’s common to celebrate victories within a team—happy hour for shipping a new feature or a team outing for signing a major account—but celebrating another team’s win is more rare, especially when a company grows in size. The operative word is “genuine” though. Don’t just do it for the sake of doing it, be genuine about it. This is a compelling way to build the stable relationships needed to unlock the rarity of teamwork described earlier.

Equally important to understanding communication impact is understanding decision impact. I’ve already written about this, so I’ll keep it brief: your decisions impact others. How does adopting X affect Operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? Doing something might save you time, but does it create work or slow others down?

Teams operate in a way that minimizes the amount of pain they feel. It’s a natural instinct and a phenomenon I call pain displacement. Pain-driven development is following the path of least resistance. By doing this, we end up moving the pain somewhere else or deferring it until later (i.e. tech debt). Where the problems start to happen is when multiple teams or functions are involved. This is when the political and other organizational issues start to seep in. Patrick Lencioni, author of The Five Dysfunctions of a Team, has a book that touches on this subject called Silos, Politics, and Turf Wars. 

I believe the solution is multifaceted. First, teams need to think holistically, widening their vision beyond the deliverable immediately in front of them. They need to have a sense of organizational awareness. Second, teams—and especially leaders—need to be able to take off their job’s “hat” periodically in order to solve a shared problem. Lencioni observes that much of what causes organizational dysfunction is siloing, and this typically stems from strong intra-team loyalties. For example, within an engineering organization you might have development, operations, QA, security, and other functional teams. Empathy is being able to look at something through someone else’s perspective, and this requires removing your functional hat from time to time. Lastly, teams need to be able to rally around a common cause. This is a shared, compelling vision that motivates and mobilizes people and helps break down the silos. A shared vision aligns teams and enables them to work more autonomously. This is how decisions get made.

Pull communication is pretty much just how to ask questions without making people hate you, a skill that is very important to be an effective and empathetic communicator.

The single most common communication issue I see in engineering organizations is The XY Problem. It’s when someone focuses on a particular solution to their problem instead of describing the problem itself.

  • User wants to do X.
  • User doesn’t know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
  • User doesn’t know how to do Y either.
  • User asks for help with Y.
  • Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
  • After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn’t even a suitable solution for X.

The problem occurs when people get stuck on what they believe is the solution and are unable to step back and explain the issue in full. The solution to The XY Problem is simple: always provide the full context of what you’re trying to do. Describe the problem, don’t just prescribe the solution.

Part of being an effective communicator is being able to extract information from people and getting help without being a mental and emotional drain. This is especially true when it comes to debugging. I often see this “murder-mystery debugging” where someone basically tries to push off the blame for something that’s wrong with their code onto someone or something else. This flies in the face of the principle discussed earlier—be quick to take responsibility and slow to assign it. The first step when it comes to debugging anything is assume it’s your fault by default. When you run some code you’re writing and the compiler complains, you don’t blame the compiler, you assume you screwed up. It’s just taking this same mindset and applying it to everything else that we do.

And when you do need to seek help from others—just like with The XY Problem—provide as much context as possible. So much of what I see is this sort of information trickle, where the person seeking help drips information to the people trying to provide it. Don’t make it an interrogation. Lastly, provide a minimal working example that reproduces the problem. Don’t make people build a massive project with 20 dependencies just to reproduce your bug. It’s such a common problem for Stack Overflow that they actually have a name for it: MCVE—Minimal, Complete, Verifiable Example. Do your due diligence before taking time out of someone else’s day because the only thing worse than a bug report is a poorly described, hastily written accusation.

Another thing I see often is swoop-and-poop engineering. This is when someone comes to you with something they need help with—maybe a bug in a library you own, a feature request, something along these lines (this is especially true in open source). They have a sense of urgency; they say it either explicitly or just give off that vibe. You offer to setup a meeting to get more information or work through the problem with them only to find they aren’t available or willing to set aside some time with you. They’re heads down on “something more important,” yet their manager is ready to bite your head off weeks later, often without any documentation or warning. They’ve effectively dumped this on you, said the world’s on fire, and left as quickly as they came. You’re left confused and disoriented. You scratch your head and forget about it, then days or weeks later, they return, horrified that the world is still burning. I call these drive-by questions.

First, it’s important to have an appropriate sense of urgency. If you’re not willing to hop on a Hangout to work through a problem or provide additional information, it’s probably not that important, especially if you can’t even take the time to follow up. With few exceptions, it’s not fair to expect a team to drop everything they’re doing to help you at a moment’s notice, but if they do, you need to meet them halfway. It’s essential to realize that if you’re piling onto a team, others probably are too. If you submit a ticket with another team and then turn around and immediately call it a blocker, that just means you failed to plan accordingly. Having empathy is being cognizant that every team has its own set of priorities, commitments, and work that it’s juggling. By creating that ticket and calling it a blocker, you’re basically saying none of that stuff matters as much. Empathy is understanding that shit rolls downhill. For those who find themselves facing drive-by questions: document everything and be proactive about communicating.

There’s a really good essay by Eric Raymond called How To Ask Questions The Smart Way. It’s something I think every engineer should read and take to heart. My number one pet peeve is Help Vampires. These are people who refuse to take the time to ask coherent, specific questions and really aren’t interested in having their questions answered so much as getting someone else to do their work. They ask the same, tired questions over and over again without really retaining information or thinking critically. It’s question, answer, question, answer, question, answer, ad infinitum.

This is often a hard-earned lesson for junior engineers, but it’s an important one: when you ask a question, you’re not entitled to an answer, you earn the answer. Hasty sounding questions get hasty answers. As engineers, we should not operate like a tech support hotline that Grandma calls when her internet stops working. We need to put in a higher level of effort. We need to apply our technical and problem-solving aptitude as engineers. This is the only way you can scale this kind of support structure within an engineering organization. If teams are just constantly bombarding each other with low-effort questions, nothing will get done and people will get burnt out.

Avoid being a Help Vampire. Before asking a question, do your due diligence. Think carefully about where to ask your question. If it’s on HipChat, what is the appropriate room in which to ask? Also be mindful of doing things like @all or @here in a large room. Doing that is like walking into a crowded room, throwing your hands up in the air, and shouting at everyone to look at you. Be precise and informative about your problem, but also keep in mind that volume is not precision. Just dumping a bunch of log messages is noise. Don’t rush to claim that you’ve found a bug. As a first step, take responsibility. And just like with The XY Problem, describe the goal, not the step you took—describe vs. prescribe. Lastly, follow up on the solution. Everyone has been in this situation: you’ve found someone that asked the exact same question as you only to find they never followed up with how they fixed it. Even if it’s just in HipChat or Slack, drop a note indicating the issue was resolved and what the fix was so others can find it. This also helps close the loop when you’ve asked a question to a team and they are actively investigating it. Don’t leave them hanging.

In many ways, being an empathetic communicator just comes down to having self-awareness.

Codifying Values and Priorities: Processes

“Process” has a lot of negative connotations associated with it because it usually becomes this thing done on ceremony. But “process” should be a means of documenting and codifying your values. This is why I disagree with the Reed Hastings quote about process from earlier. Process is about repeatability and error correction. Camille Fournier’s new book on engineering management, The Manager’s Path, has a great section on “bootstrapping culture.” I particularly like the way she frames organizational structure and process:

When talking about structure and process with skeptics, I try to reframe the discussion. Instead of talking about structure, I talk about learning. Instead of talking about process, I talk about transparency. We don’t set up systems because structure and process have inherent value. We do it because we want to learn from our successes and our mistakes, and to share those successes and encode the lessons we learn from failures in a transparent way. This learning and sharing is how organizations become more stable and more scalable over time.

When a process “feels” wrong, it’s probably because it doesn’t reflect your organization’s values. For example, if a process feels heavy, it’s because you value velocity. If a process feels rigid, it’s because you value agility. If a process feels risky, it’s because you value safety. We have a hard time articulating this so instead it becomes “process is bad.”

Somewhere along the line, someone decides to document how stuff gets done. Things get standardized. Tools get made. Processes get established. But process becomes dogma when it’s interpreted as documentation of how rather than an explanation of why. Processes should tell the story of an organization: here’s what we value, here’s why we value it, and here’s how we protect and scale those values. The story is constantly evolving, so processes should be flexible. They shouldn’t be set in stone.

Michael Lopp’s book Managing Humans also provides a useful perspective on culture:

It’s entirely possible that too much process or the wrong process is developed during this build-out, but when this inevitable debate occurs, it should not be about the process. It’s a debate about values. The first question isn’t, “Is this a good, bad, or efficient process?” The first question is, “How does this process reflect our values?”

Processes should be traceable back to values. Each process should have a value or set of values associated with it. Understanding the why helps to develop empathy. It’s the difference between “here’s how we do things” and “here’s why we do things.” It’s much harder to develop a sense of empathy with just the how.

What We Value: Priorities

As engineers, we need to be curious. We need to have a “let’s go see!” attitude. When someone comes to you with a question—and hopefully it’s a well-formulated question based on the earlier discussion—your first reaction should be, “let’s go see!” Use it as an opportunity for both of you to learn. Even if you know the answer, sometimes it’s better to show, not tell, and as the person asking the question, you should be eager to learn. This is the reason I love Julia Evans’ blog so much. It’s oozing with wonder, curiosity, and intrigue. Being an engineer should mean having an innate curiosity. It’s not throwing up your hands at the first sign of an API boundary and saying, “not my problem!” It’s a willingness to roll up your sleeves and dig in to a problem but also a capacity for knowing how and when to involve others. Figure out what you don’t know and push beyond it.

Be humble. There’s a book that was written in the 70’s called The Psychology of Computer Programming, and it’s interesting because it focuses on the human elements of software development rather than the purely technical ones that we normally think about. In the book, it presents The 10 Commandments of Egoless Programming, which I think contain a powerful set of guiding principles for software engineers:

  1. Understand and accept that you will make mistakes.
  2. You are not your code.
  3. No matter how much “karate” you know, someone else will always know more.
  4. Don’t rewrite code without consultation.
  5. Treat people who know less than you with respect, deference, and patience.
  6. The only constant in the world is change. Be open to it and accept it with a smile.
  7. The only true authority stems from knowledge, not from position.
  8. Fight for what you believe, but gracefully accept defeat.
  9. Don’t be “the coder in the corner.”
  10. Critique code instead of people—be kind to the coder, not to the code.

Part of being a humble engineer is giving away all the credit. This is especially true for leaders or managers. A manager I once had put it this way: “As a manager, you should never say ‘I’ during a review unless shit went wrong and you’re in the process of taking responsibility for it.” A good leader gives away all the credit and takes all of the blame.

Be engaged. Coding is actually a very small part of our job as software engineers. Our job is to be engaged with the organization. Engage with stakeholder meetings and reviews. Engage with cross-trainings and workshops. Engage with your company’s engineering blog. Engage with other teams. Engage with recruiting and company outreach through conferences or meetups. People dramatically underestimate the value of developing their network, both to their employer and to themselves. You don’t have to do all of these things, but my point is engineers get overly fixated on coding and deliverables. Code is just the byproduct. We’re not paid to write code, we’re paid to add value to the business, and a big part of that is being engaged with the organization.

And of course I’d be remiss not to talk about empathy. Empathy is having a deep understanding of what problems someone is trying to solve. John Allspaw puts it best:

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Changing your perspective is a powerful way to deepen your relationships. Once again, it comes back to Dunbar’s number: we have a limited number of stable relationships, but developing and maintaining those relationships is the key to figuring out the rarity of teamwork.

A former coworker of mine passed away late last year. In going through some of his old files, we came across some notes he had on leadership. There was one quote in particular that I thought really captured the essence of this post nicely:

All music is made from the same 12 notes. All culture is made from the same five components: behaviors, relationships, attitudes, values, and environment. It’s the way those notes or components are put together that makes things sing.

This is what it takes to build a strong engineering culture and really just a healthy culture in general. The technology and everything else is secondary. It really starts with the people.

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.

Decision Impact

I think a critical part of building an empathetic engineering culture is understanding decision impact. This is a blindspot that I see happening a lot: a deliberate effort to understand the effects caused by a decision. How does adopting X affect operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? A particular decision might save you time, but does it create work or slow others down? Are we just displacing pain somewhere else?

What’s needed is a broad understanding of the net effects. Pain displacement is an indication that we’re not thinking beyond the path of least resistance. The problem is if we lack a certain empathy, we aren’t aware the pain displacement is occurring in the first place. It’s important we widen our vision beyond the deliverable in front of us. We have to think holistically—like a systems person—and think deeply about the interactions between decisions. Part of this is having an organizational awareness.

Tech debt is the one exception to this because it’s pain displacement we feel ourselves—it’s pain deferral. This is usually a decision we can make ourselves, but when we’re dealing with pain displacement involving multiple teams, that’s when problems start happening. And that’s where empathy becomes critical because software engineering is more about collaboration than code and shit has this natural tendency to roll downhill.

The first sentence of The Five Dysfunctions of a Team captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” The powerful part is obvious, but the bit about rarity is interesting when we think about teams holistically. The cause, I think, is deeply rooted in the silos or fiefdoms that naturally form around teams. The difficulty comes as an organization scales. What I see happening frequently are goals that diverge or conflict. The fix is rallying teams around a shared cause—a single, compelling vision. Likewise, it’s thinking holistically and having empathy. Understanding decision impact and pain displacement is one step to developing that empathy. This is what unlocks the rarity part of teamwork.

Fast Topic Matching

A common problem in messaging middleware is that of efficiently matching message topics with interested subscribers. For example, assume we have a set of subscribers, numbered 1 to 3:

Subscriber Match Request
1 forex.usd
2 forex.*
3 stock.nasdaq.msft

And we have a stream of messages, numbered 1 to N:

Message Topic
1 forex.gbp
2 stock.nyse.ibm
3 stock.nyse.ge
4 forex.eur
5 forex.usd
N stock.nasdaq.msft

We are then tasked with routing messages whose topics match the respective subscriber requests, where a “*” wildcard matches any word. This is frequently a bottleneck for message-oriented middleware like ZeroMQ, RabbitMQ, ActiveMQ, TIBCO EMS, et al. Because of this, there are a number of well-known solutions to the problem. In this post, I’ll describe some of these solutions, as well as a novel one, and attempt to quantify them through benchmarking. As usual, the code is available on GitHub.

The Naive Solution

The naive solution is pretty simple: use a hashmap that maps topics to subscribers. Subscribing involves adding a new entry to the map (or appending to a list if it already exists). Matching a message to subscribers involves scanning through every entry in the map, checking if the match request matches the message topic, and returning the subscribers for those that do.

Inserts are approximately O(1) and lookups approximately O(n*m) where n is the number of subscriptions and m is the number of words in a topic. This means the performance of this solution is heavily dependent upon how many subscriptions exist in the map and also the access patterns (rate of reads vs. writes). Since most use cases are heavily biased towards searches rather than updates, the naive solution—unsurprisingly—is not a great option.

The microbenchmark below compares the performance of subscribe, unsubscribe, and lookup (matching) operations, first using an empty hashmap (what we call cold) and then with one containing 1,000 randomly generated 5-word topic subscriptions (what we call hot). With the populated subscription map, lookups are about three orders of magnitude slower, which is why we have to use a log scale in the chart below.

subscribe unsubscribe lookup
cold 172ns 51.2ns 787ns
hot 221ns 55ns 815,787ns


Inverted Bitmap

The inverted bitmap technique builds on the observation that lookups are more frequent than updates and assumes that the search space is finite. Consequently, it shifts some of the cost from the read path to the write path. It works by storing a set of bitmaps, one per topic, or criteria, in the search space. Subscriptions are then assigned an increasing number starting at 0. We analyze each subscription to determine the matching criteria and set the corresponding bits in the criteria bitmaps to 1. For example, assume our search space consists of the following set of topics:

  • forex.usd
  • forex.gbp
  • forex.jpy
  • forex.eur
  • stock.nasdaq
  • stock.nyse

We then have the following subscriptions:

  • 0 = forex.* (matches forex.usd, forex.gbp, forex.jpy, and forex.eur)
  • 1 = stock.nyse (matches stock.nyse)
  • 2 = *.* (matches everything)
  • 3 = stock.* (matches stock.nasdaq and stock.nyse)

When we index the subscriptions above, we get the following set of bitmaps:

 Criteria 0 1 2 3
forex.usd 1 0 1 0
forex.gbp 1 0 1 0
forex.jpy 1 0 1 0
forex.eur 1 0 1 0
stock.nasdaq 0 0 1 1
stock.nyse 0 1 1 1

When we match a message, we simply need to lookup the corresponding bitmap and check the set bits. As we see below, subscribe and unsubscribe are quite expensive with respect to the naive solution, but lookups now fall well below half a microsecond, which is pretty good (the fact that the chart below doesn’t use a log scale like the one above should be an indictment of the naive hashmap-based solution).

subscribe unsubscribe lookup
cold 3,795ns 198ns 380ns
hot 3,863ns 198ns 395ns

The inverted bitmap is a better option than the hashmap when we have a read-heavy workload. One limitation is it requires us to know the search space ahead of time or otherwise requires reindexing which, frankly, is prohibitively expensive.

Optimized Inverted Bitmap

The inverted bitmap technique works well enough, but only if the topic space is fairly static. It also falls over pretty quickly when the topic space and number of subscriptions are large, say, millions of topics and thousands of subscribers. The main benefit of topic-based routing is it allows for faster matching algorithms in contrast to content-based routing, which can be exponentially slower. The truth is, to be useful, your topics probably consist of stock.nyse.ibm, stock.nyse.ge, stock.nasdaq.msft, stock.nasdaq.aapl, etc., not stock.nyse and stock.nasdaq. We could end up with an explosion of topics and, even with efficient bitmaps, the memory consumption tends to be too high despite the fact that most of the bitmaps are quite sparse.

Fortunately, we can reduce the amount of memory we consume using a fairly straightforward optimization. Rather than requiring the entire search space a priori, we simply require the max topic size, in terms of words, e.g. stock.nyse.ibm has a size of 3. We can handle topics of the max size or less, e.g. stock.nyse.bac, stock.nasdaq.txn, forex.usd, index, etc. If we see a message with more words than the max, we can safely assume there are no matching subscriptions.

The optimized inverted bitmap works by splitting topics into their constituent parts. Each constituent position has a set of bitmaps, and we use a technique similar to the one described above on each part. We end up with a bitmap for each constituent which we perform a logical AND on to give a resulting bitmap. Each 1 in the resulting bitmap corresponds to a subscription. This means if the max topic size is n, we only AND at most n bitmaps. Furthermore, if we come across any empty bitmaps, we can stop early since we know there are no matching subscribers.

Let’s say our max topic size is 2 and we have the following subscriptions:

  • 0 = forex.*
  • 1 = stock.nyse
  • 2 = index
  • 3 = stock.*

The inverted bitmap for the first constituent looks like the following:

forex.* stock.nyse index stock.*
null 0 0 0 0
forex 1 0 0 0
stock 0 1 0 1
index 0 0 1 0
other 0 0 0 0

And the second constituent bitmap:

forex.* stock.nyse index stock.*
null 0 0 1 0
nyse 0 1 0 0
other 1 0 0 1

The “null” and “other” rows are worth pointing out. “Null” simply means the topic has no corresponding constituent.  For example, “index” has no second constituent, so “null” is marked. “Other” allows us to limit the number of rows needed such that we only need the ones that appear in subscriptions.  For example, if messages are published on forex.eur, forex.usd, and forex.gbp but I merely subscribe to forex.*, there’s no need to index eur, usd, or gbp. Instead, we just mark the “other” row which will match all of them.

Let’s look at an example using the above bitmaps. Imagine we want to route a message published on forex.eur. We split the topic into its constituents: “forex” and “eur.” We get the row corresponding to “forex” from the first constituent bitmap, the one corresponding to “eur” from the second (other), and then AND the rows.

forex.* stock.nyse index stock.*
1 = forex 1 0 0 0
2 = other 1 0 0 1
AND 1 0 0 0

The forex.* subscription matches.

Let’s try one more example: a message published on stock.nyse.

forex.* stock.nyse index stock.*
1 = stock 0 1 0 1
2 = nyse 0 1 0 1
AND 0 1 0 1

In this case, we also need to OR the “other” row for the second constituent. This gives us a match for stock.nyse and stock.*.

Subscribe operations are significantly faster with the space-optimized inverted bitmap compared to the normal inverted bitmap, but lookups are much slower. However, the optimized version consumes roughly 4.5x less memory for every subscription. The increased flexibility and improved scalability makes the optimized version a better choice for all but the very latency-sensitive use cases.

subscribe unsubscribe lookup
cold 1,053ns 330ns 2,724ns
hot 1,076ns 371ns 3,337ns

Trie

The optimized inverted bitmap improves space complexity, but it does so at the cost of lookup efficiency. Is there a way we can reconcile both time and space complexity? While inverted bitmaps allow for efficient lookups, they are quite wasteful for sparse sets, even when using highly compressed bitmaps like Roaring bitmaps.

Tries can often be more space efficient in these circumstances. When we add a subscription, we descend the trie, adding nodes along the way as necessary, until we run out of words in the topic. Finally, we add some metadata containing the subscription information to the last node in the chain. To match a message topic, we perform a similar traversal. If a node doesn’t exist in the chain, we know there are no subscribers. One downside of this method is, in order to support wildcards, we must backtrack on a literal match and check the “*” branch as well.

For the given set of subscriptions, the trie would look something like the following:

  • forex.*
  • stock.nyse
  • index
  • stock.*

You might be tempted to ask: “why do we even need the “*” nodes? When someone subscribes to stock.*, just follow all branches after “stock” and add the subscriber.” This would indeed move the backtracking cost from the read path to the write path, but—like the first inverted bitmap we looked at—it only works if the search space is known ahead of time. It would also largely negate the memory-usage benefits we’re looking for since it would require pre-indexing all topics while requiring a finite search space.

It turns out, this trie technique is how systems like ZeroMQ and RabbitMQ implement their topic matching due to its balance between space and time complexity and overall performance predictability.

subscribe unsubscribe lookup
cold 406ns 221ns 2,145ns
hot 443ns 257ns 2,278ns

We can see that, compared to the optimized inverted bitmap, the trie performs much more predictably with relation to the number of subscriptions held.

Concurrent Subscription Trie

One thing we haven’t paid much attention to so far is concurrency. Indeed, message-oriented middleware is typically highly concurrent since they have to deal with heavy IO (reading messages from the wire, writing messages to the wire, reading messages from disk, writing messages to disk, etc.) and CPU operations (like topic matching and routing). Subscribe, unsubscribe, and lookups are usually all happening in different threads of execution. This is especially important when we want to talk advantage of multi-core processors.

It wasn’t shown, but all of the preceding algorithms used global locks to ensure thread safety between read and write operations, making the data structures safe for concurrent use. However, the microbenchmarks don’t really show the impact of this, which we will see momentarily.

Lock-freedom, which I’ve written about, allows us to increase throughput at the expense of increased tail latency.

Lock-free concurrency means that while a particular thread of execution may be blocked, all CPUs are able to continue processing other work. For example, imagine a program that protects access to some resource using a mutex. If a thread acquires this mutex and is subsequently preempted, no other thread can proceed until this thread is rescheduled by the OS. If the scheduler is adversarial, it may never resume execution of the thread, and the program would be effectively deadlocked. A key point, however, is that the mere lack of a lock does not guarantee a program is lock-free. In this context, “lock” really refers to deadlock, livelock, or the misdeeds of a malevolent scheduler.

The concurrent subscription trie, or CS-trie,  is a new take on the trie-based solution described earlier. It combines the idea of the topic-matching trie with that of a Ctrie, or concurrent trie, which is a non-blocking concurrent hash trie.

The fundamental problem with the trie, as it relates to concurrency, is it requires a global lock, which severely limits throughput. To address this, the CS-trie uses indirection nodes, or I-nodes, which remain present in the trie even as the nodes above and below change. Subscriptions are then added or removed by creating a copy of the respective node, and performing a CAS on its parent I-node. This allows us to add, remove, and lookup subscriptions concurrently and in a lock-free, linearizable manner.

For the given set of subscribers, labeled x, y, and z, the CS-trie would look something like the following:

  • x = foo, bar, bar.baz
  • y = foo, bar.qux
  • z = bar.*

Lookups on the CS-trie perform, on average, better than the standard trie, and the CS-trie scales better with respect to concurrent operations.

subscribe unsubscribe lookup
cold 412ns 245ns 1,615ns
hot 471ns 280ns 1,637ns

Latency Comparison

The chart below shows the topic-matching operation latencies for all of the algorithms side-by-side. First, we look at the performance of a cold start (no subscriptions) and then the performance of a hot start (1,000 subscriptions).

Throughput Comparison

So far, we’ve looked at the latency of individual topic-matching operations. Next, we look at overall throughput of each of the algorithms and their memory footprint.

 algorithm msg/sec
naive  4,053.48
inverted bitmap  1,052,315.02
optimized inverted bitmap  130,705.98
trie  248,762.10
cs-trie  340,910.64

On the surface, the inverted bitmap looks like the clear winner, clocking in at over 1 million matches per second. However, we know the inverted bitmap does not scale and, indeed, this becomes clear when we look at memory consumption, underscored by the fact that the below chart uses a log scale.

Scalability with Respect to Concurrency

Lastly, we’ll look at how each of these algorithms scales with respect to concurrency. We do this by performing concurrent operations and varying the level of concurrency and number of operations. We start with a 50-50 split between reads and writes. We vary the number of goroutines from 2 to 16 (the benchmark was run using a 2.6 GHz Intel Core i7 processor with 8 logical cores). Each goroutine performs 1,000 reads or 1,000 writes. For example, the 2-goroutine benchmark performs 1,000 reads and 1,000 writes, the 4-goroutine benchmark performs 2,000 reads and 2,000 writes, etc. We then measure the total amount of time needed to complete the workload.

We can see that the tries hardly even register on the scale above, so we’ll plot them separately.

The tries are clearly much more efficient than the other solutions, but the CS-trie in particular scales well to the increased workload and concurrency.

Since most workloads are heavily biased towards reads over writes, we’ll run a separate benchmark that uses a 90-10 split reads and writes. This should hopefully provide a more realistic result.

The results look, more or less, like what we would expect, with the reduced writes improving the inverted bitmap performance. The CS-trie still scales quite well in comparison to the global-lock trie.

Conclusion

As we’ve seen, there are several approaches to consider to implement fast topic matching. There are also several aspects to look at: read/write access patterns, time complexity, space complexity, throughput, and latency.

The naive hashmap solution is generally a poor choice due to its prohibitively expensive lookup time. Inverted bitmaps offer a better solution. The standard implementation is reasonable if the search space is finite, small, and known a priori, especially if read latency is critical. The space-optimized version is a better choice for scalability, offering a good balance between read and write performance while keeping a small memory footprint. The trie is an even better choice, providing lower latency than the optimized inverted bitmap and consuming less memory. It’s particularly good if the subscription tree is sparse and topics are not known a priori. Lastly, the concurrent subscription trie is the best option if there is high concurrency and throughput matters. It offers similar performance to the trie but scales better. The only downside is an increase in implementation complexity.

Take It to the Limit: Considerations for Building Reliable Systems

Complex systems usually operate in failure mode. This is because a complex system typically consists of many discrete pieces, each of which can fail in isolation (or in concert). In a microservice architecture where a given function potentially comprises several independent service calls, high availability hinges on the ability to be partially available. This is a core tenet behind resilience engineering. If a function depends on three services, each with a reliability of 90%, 95%, and 99%, respectively, partial availability could be the difference between 99.995% reliability and 84% reliability (assuming failures are independent). Resilience engineering means designing with failure as the normal.

Anticipating failure is the first step to resilience zen, but the second is embracing it. Telling the client “no” and failing on purpose is better than failing in unpredictable or unexpected ways. Backpressure is another critical resilience engineering pattern. Fundamentally, it’s about enforcing limits. This comes in the form of queue lengths, bandwidth throttling, traffic shaping, message rate limits, max payload sizes, etc. Prescribing these restrictions makes the limits explicit when they would otherwise be implicit (eventually your server will exhaust its memory, but since the limit is implicit, it’s unclear exactly when or what the consequences might be). Relying on unbounded queues and other implicit limits is like someone saying they know when to stop drinking because they eventually pass out.

Rate limiting is important not just to prevent bad actors from DoSing your system, but also yourself. Queue limits and message size limits are especially interesting because they seem to confuse and frustrate developers who haven’t fully internalized the motivation behind them. But really, these are just another form of rate limiting or, more generally, backpressure. Let’s look at max message size as a case study.

Imagine we have a system of distributed actors. An actor can send messages to other actors who, in turn, process the messages and may choose to send messages themselves. Now, as any good software engineer knows, the eighth fallacy of distributed computing is “the network is homogenous.” This means not all actors are using the same hardware, software, or network configuration. We have servers with 128GB RAM running Ubuntu, laptops with 16GB RAM running macOS, mobile clients with 2GB RAM running Android, IoT edge devices with 512MB RAM, and everything in between, all running a hodgepodge of software and network interfaces.

When we choose not to put an upper bound on message sizes, we are making an implicit assumption (recall the discussion on implicit/explicit limits from earlier). Put another way, you and everyone you interact with (likely unknowingly) enters an unspoken contract of which neither party can opt out. This is because any actor may send a message of arbitrary size. This means any downstream consumers of this message, either directly or indirectly, must also support arbitrarily large messages.

How can we test something that is arbitrary? We can’t. We have two options: either we make the limit explicit or we keep this implicit, arbitrarily binding contract. The former allows us to define our operating boundaries and gives us something to test. The latter requires us to test at some undefined production-level scale. The second option is literally gambling reliability for convenience. The limit is still there, it’s just hidden. When we don’t make it explicit, we make it easy to DoS ourselves in production. Limits become even more important when dealing with cloud infrastructure due to their multitenant nature. They prevent a bad actor (or yourself) from bringing down services or dominating infrastructure and system resources.

In our heterogeneous actor system, we have messages bound for mobile devices and web browsers, which are often single-threaded or memory-constrained consumers. Without an explicit limit on message size, a client could easily doom itself by requesting too much data or simply receiving data outside of its control—this is why the contract is unspoken but binding.

Let’s look at this from a different kind of engineering perspective. Consider another type of system: the US National Highway System. The US Department of Transportation uses the Federal Bridge Gross Weight Formula as a means to prevent heavy vehicles from damaging roads and bridges. It’s really the same engineering problem, just a different discipline and a different type of infrastructure.

The August 2007 collapse of the Interstate 35W Mississippi River bridge in Minneapolis brought renewed attention to the issue of truck weights and their relation to bridge stress. In November 2008, the National Transportation Safety Board determined there had been several reasons for the bridge’s collapse, including (but not limited to): faulty gusset plates, inadequate inspections, and the extra weight of heavy construction equipment combined with the weight of rush hour traffic.

The DOT relies on weigh stations to ensure trucks comply with federal weight regulations, fining those that exceed restrictions without an overweight permit.

The federal maximum weight is set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country’s highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.

Weight limits need to be enforced so civil engineers have a defined operating range for the roads, bridges, and other infrastructure they build. Computers are no different. This is the reason many systems enforce these types of limits. For example, Amazon clearly publishes the limits for its Simple Queue Service—the max in-flight messages for standard queues is 120,000 messages and 20,000 messages for FIFO queues. Messages are limited to 256KB in size. Amazon KinesisApache KafkaNATS, and Google App Engine pull queues all limit messages to 1MB in size. These limits allow the system designers to optimize their infrastructure and ameliorate some of the risks of multitenancy—not to mention it makes capacity planning much easier.

Unbounded anything—whether its queues, message sizes, queries, or traffic—is a resilience engineering anti-pattern. Without explicit limits, things fail in unexpected and unpredictable ways. Remember, the limits exist, they’re just hidden. By making them explicit, we restrict the failure domain giving us more predictability, longer mean time between failures, and shorter mean time to recovery at the cost of more upfront work or slightly more complexity.

It’s better to be explicit and handle these limits upfront than to punt on the problem and allow systems to fail in unexpected ways. The latter might seem like less work at first but will lead to more problems long term. By requiring developers to deal with these limitations directly, they will think through their APIs and business logic more thoroughly and design better interactions with respect to stability, scalability, and performance.

Benchmarking Commit Logs

In this article, we look at Apache Kafka and NATS Streaming, two messaging systems based on the idea of a commit log. We’ll compare some of the features of both but spend less time talking about Kafka since by now it’s quite well known. Similar to previous studies, we’ll attempt to quantify their general performance characteristics through careful benchmarking.

The purpose of this benchmark is to test drive the newly released NATS Streaming system, which was made generally available just in the last few months. NATS Streaming doesn’t yet support clustering, so we try to put its performance into context by looking at a similar configuration of Kafka.

Unlike conventional message queues, commit logs are an append-only data structure. This results in several nice properties like total ordering of messages, at-least-once delivery, and message-replay semantics. Jay Kreps’ blog post The Log is a great introduction to the concept and particularly why it’s so useful in the context of distributed systems and stream processing (his book I Heart Logs is an extended version of the blog post and is a quick read).

Kafka, which originated at LinkedIn, is by far the most popular and most mature implementation of the commit log (AWS offers their own flavor of it called Kinesis, and imitation is the sincerest form of flattery). It’s billed as a “distributed streaming platform for building real-time data pipelines and streaming apps.” The much newer NATS Streaming is actually a data-streaming layer built on top of Apcera’s high-performance publish-subscribe system NATS. It’s billed as “real-time streaming for Big Data, IoT, Mobile, and Cloud Native Applications.” Both have some similarities as well as some key differences.

Fundamental to the notion of a log is a way to globally order events. Neither NATS Streaming nor Kafka are actually a single log but many logs, each totally ordered using a sequence number or offset, respectively.

In Kafka, topics are partitioned into multiple logs which are then replicated across a number of servers for fault tolerance, making it a distributed commit log. Each partition has a server that acts as the leader. Cluster membership and leader election is managed by ZooKeeper.

NATS Streaming’s topics are called “channels” which are globally ordered. Unlike Kafka, NATS Streaming does not support replication or partitioning of channels, though my understanding is clustering support is slated for Q1 2017. Its message store is pluggable, so it can provide durability using a file-backed implementation, like Kafka, or simply an in-memory store.

NATS Streaming is closer to a hybrid of traditional message queues and the commit log. Like Kafka, it allows replaying the log from a specific offset, the beginning of time, or the newest offset, but it also exposes an API for reading from the log at a specific physical time offset, e.g. all messages from the last 30 seconds. Kafka, on the other hand, only has a notion of logical offsets (correction: Kafka added support for offset lookup by timestamp in 0.10.1.0) . Generally, relying on physical time is an anti-pattern in distributed systems due to clock drift and the fact that clocks are not always monotonic. For example, imagine a situation where a NATS Streaming server is restarted and the clock is changed. Messages are still ordered by their sequence numbers but their timestamps might not reflect that. Developers would need to be aware of this while implementing their business logic.

With Kafka, it’s strictly on consumers to track their offset into the log (or the high-level consumer which stores offsets in ZooKeeper (correction: Kafka itself can now store offsets which is used by the new Consumer API, meaning clients do not have to manage offsets directly or rely on ZooKeeper)). NATS Streaming allows clients to either track their sequence number or use a durable subscription, which causes the server to track the last acknowledged message for a client. If the client restarts, the server will resume delivery starting at the earliest unacknowledged message. This is closer to what you would expect from a traditional message-oriented middleware like RabbitMQ.

Lastly, NATS Streaming supports publisher and subscriber rate limiting. This works by configuring the maximum number of in-flight (unacknowledged) messages either from the publisher to the server or from the server to the subscriber. Starting in version 0.9, Kafka supports a similar rate limiting feature that allows producer and consumer byte-rate thresholds to be defined for groups of clients with its Quotas protocol.

Kafka was designed to avoid tracking any client state on the server for performance and scalability reasons. Throughput and storage capacity scale linearly with the number of nodes. NATS Streaming provides some additional features over Kafka at the cost of some added state on the server. Since clustering isn’t supported, there isn’t really any scale or HA story yet, so it’s unclear how that will play out. That said, once replication is supported, there’s a lot of work going into verifying its correctness (which is a major advantage Kafka has).

Benchmarks

Since NATS Streaming does not support replication at this time (0.3.1), we’ll compare running a single instance of it with file-backed persistence to running a single instance of Kafka (0.10.1.0). We’ll look at both latency and throughput running on commodity hardware (m4.xlarge EC2 instances) with load generation and consumption each running on separate instances. In all of these benchmarks, the systems under test have not been tuned at all and are essentially in their “off-the-shelf” configurations.

We’ll first look at latency by publishing messages of various sizes, ranging from 256 bytes to 1MB, at a fixed rate of 50 messages/second for 30 seconds. Message contents are randomized to account for compression. We then plot the latency distribution by percentile on a logarithmic scale from the 0th percentile to the 99.9999th percentile. Benchmarks are run several times in an attempt to produce a “normalized” result. The benchmark code used is open source.

First, to establish a baseline and later get a feel for the overhead added by the file system, we’ll benchmark NATS Streaming with in-memory storage, meaning messages are not written to disk.

Unsurprisingly, the 1MB configuration has much higher latencies than the other configurations, but everything falls within single-digit-millisecond latencies.nats_mem

NATS Streaming 0.3.1 (in-memory persistence)

 Size 99% 99.9% 99.99% 99.999% 99.9999% 
256B 0.3750ms 1.0367ms 1.1257ms 1.1257ms 1.1257ms
1KB 0.38064ms 0.8321ms 1.3260ms 1.3260ms 1.3260ms
5KB 0.4408ms 1.7569ms 2.1465ms 2.1465ms 2.1465ms
1MB 6.6337ms 8.8097ms 9.5263ms 9.5263ms 9.5263ms

Next, we look at NATS Streaming with file-backed persistence. This provides the same durability guarantees as Kafka running with a replication factor of 1. By default, Kafka stores logs under /tmp. Many Unix distributions mount /tmp to tmpfs which appears as a mounted file system but is actually stored in volatile memory. To account for this and provide as level a playing field as possible, we configure NATS Streaming to also store its logs in /tmp.

As expected, latencies increase by about an order of magnitude once we start going to disk.

nats_file_fsync

NATS Streaming 0.3.1 (file-backed persistence)

 Size 99% 99.9% 99.99% 99.999% 99.9999% 
256B 21.7051ms 25.0369ms 27.0524ms 27.0524ms 27.0524ms
1KB 20.6090ms 23.8858ms 24.7124ms 24.7124ms 24.7124ms
5KB 22.1692ms 35.7394ms 40.5612ms 40.5612ms 40.5612ms
1MB 45.2490ms 130.3972ms 141.1564ms 141.1564ms 141.1564ms

Since we will be looking at Kafka, there is an important thing to consider relating to fsync behavior. As of version 0.8, Kafka does not call fsync directly and instead relies entirely on the background flush performed by the OS. This is clearly indicated by their documentation:

We recommend using the default flush settings which disable application fsync entirely. This means relying on the background flush done by the OS and Kafka’s own background flush. This provides the best of all worlds for most uses: no knobs to tune, great throughput and latency, and full recovery guarantees. We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the paranoid still may prefer having both and application level fsync policies are still supported.

However, NATS Streaming calls fsync every time a batch is written to disk by default. This can be disabled through the use of the –file_sync flag. By setting this flag to false, we put NATS Streaming’s persistence behavior closer in line with Kafka’s (again assuming a replication factor of 1).

As an aside, the comparison between NATS Streaming and Kafka still isn’t completely “fair”. Jay Kreps points out that Kafka relies on replication as the primary means of durability.

Kafka leaves [fsync] off by default because it relies on replication not fsync for durability, which is generally faster. If you don’t have replication I think you probably need fsync and maybe some kind of high integrity file system.

I don’t think we can provide a truly fair comparison until NATS Streaming supports replication, at which point we will revisit this.

To no one’s surprise, setting –file_sync=false has a significant impact on latency, shown in the distribution below.

nats_file_no_fsync

In fact, it’s now in line with the in-memory performance as before for 256B, 1KB, and 5KB messages, shown in the comparison below.

nats_file_mem

For a reason I have yet to figure out, the latency for 1MB messages is roughly an order of magnitude faster when fsync is enabled after the 95th percentile, which seems counterintuitive. If anyone has an explanation, I would love to hear it. I’m sure there’s a good debug story there. The distribution below shows the 1MB configuration for NATS Streaming with and without fsync enabled and just how big the difference is at the 95th percentile and beyond.

nats_file_mem_1mb

NATS Streaming 0.3.1 (file-backed persistence, –file_sync=false)

 Size 99% 99.9% 99.99% 99.999% 99.9999% 
256B 0.4304ms 0.8577ms 1.0706ms 1.0706ms 1.0706ms
1KB 0.4372ms 1.5987ms 1.8651ms 1.8651ms 1.8651ms
5KB 0.4939ms 2.0828ms 2.2540ms 2.2540ms 2.2540ms
1MB 1296.1464ms 1556.1441ms 1596.1457ms 1596.1457ms 1596.1457ms

Kafka with replication factor 1 tends to have higher latencies than NATS Streaming with –file_sync=false. There was one potential caveat here Ivan Kozlovic pointed out to me in that NATS Streaming uses a caching optimization for reads that may put it at an advantage.

Now, there is one side where NATS Streaming *may* be looking better and not fair to Kafka. By default, the file store keeps everything in memory once stored. This means look-ups will be fast. There is only a all-or-nothing mode right now, which means either cache everything or nothing. With caching disabled (–file_cache=false), every lookup will result in disk access (which when you have 1 to many subscribers will be bad). I am working on changing that. But if you do notice that in Kafka, consuming results in a disk read (given the other default behavior described above, they actually may not ;-)., then you could disable NATS Streaming file caching.

Fortunately, we can verify if Kafka is actually going to disk to read messages back from the log during the benchmark using iostat. We see something like this for the majority of the benchmark duration:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.53    0.00   11.28    0.00    0.00   75.19

Device:    tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda      0.00         0.00         0.00          0          0

Specifically, we’re interested in Blk_read, which indicates the total number of blocks read. It appears that Kafka does indeed make heavy use of the operating system’s page cache as Blk_wrtn and Blk_read rarely show any activity throughout the entire benchmark. As such, it seems fair to leave NATS Streaming’s –file_cache=true, which is the default.

One interesting point is Kafka offloads much of its caching to the page cache and outside of the JVM heap, clearly in an effort to minimize GC pauses. I’m not clear if the cache Ivan refers to in NATS Streaming is off-heap or not (NATS Streaming is written in Go which, like Java, is a garbage-collected language).

Below is the distribution of latencies for 256B, 1KB, and 5KB configurations in Kafka.

kafka

Similar to NATS Streaming, 1MB message latencies tend to be orders of magnitude worse after about the 80th percentile. The distribution below compares the 1MB configuration for NATS Streaming and Kafka.

nats_kafka_1mb

Kafka 0.10.1.0 (replication factor 1)

 Size 99% 99.9% 99.99% 99.999% 99.9999% 
256B 0.9230ms 1.4575ms 1.6596ms 1.6596ms 1.6596ms
1KB 0.5942ms 1.3123ms 17.6556ms 17.6556ms 17.6556ms
5KB 0.7203ms 5.7236ms 18.9334ms 18.9334ms 18.9334ms
1MB 5337.3174ms 5597.3315ms 5617.3199ms 5617.3199ms 5617.3199ms

The percentile distributions below compare NATS Streaming and Kafka for the 256B, 1KB, and 5KB configurations, respectively.

nats_kafka_256b

nats_kafka_1kb

nats_kafka_5kb

Next, we’ll look at overall throughput for the two systems. This is done by publishing 100,000 messages using the same range of sizes as before and measuring the elapsed time. Specifically, we measure throughput at the publisher and the subscriber.

Despite using an asynchronous publisher in both the NATS Streaming and Kafka benchmarks, we do not consider the publisher “complete” until it has received acks for all published messages from the server. In Kafka, we do this by setting request.required.acks to 1, which means the leader replica has received the data, and consuming the received acks. This is important because the default value is 0, which means the producer never waits for an ack from the broker. In NATS Streaming, we provide an ack callback on every publish. We use the same benchmark configuration as the latency tests, separating load generation and consumption on different EC2 instances. Note the log scale in the following charts.

Once again, we’ll start by looking at NATS Streaming using in-memory persistence. The truncated 1MB send and receive throughputs are 93.01 messages/second.

nats_mem_throughput

For comparison, we now look at NATS Streaming with file persistence and –file_sync=false. As before, this provides the closest behavior to Kafka’s default flush behavior. The second chart shows a side-by-side comparison between NATS Streaming with in-memory and file persistence.

nats_file_throughput

nats_compare_throughput

Lastly, we look at Kafka with replication factor 1. Throughput significantly deteriorates when we set request.required.acks = 1 since the producer must wait for all acks from the server. This is important though because, by default, the client does not require an ack from the server. If this were the case, the producer would have no idea how much data actually reached the server once it finished—it could simply be buffered in the client, in flight over the wire, or in the server but not yet on disk. Running the benchmark with request.required.acks = 0 yields much higher throughput on the sender but is basically an exercise in how fast you can write to a channel using the Sarama Go client—slightly misleading.

kafka_throughput

Looking at some comparisons of Kafka and NATS Streaming, we can see that NATS Streaming has higher throughput in all but a few cases.

nats_kafka_throughput

nats_kafka_send_throughput

I want to repeat the disclaimer from before: the purpose of this benchmark is to test drive the newly released NATS Streaming system (which as mentioned earlier, doesn’t yet support clustering), and put its performance into context by looking at a similar configuration of Kafka.

Kafka generally scales very well, so measuring the throughput of a single broker with a single producer and single consumer isn’t particularly meaningful. In reality, we’d be running a cluster with several brokers and partitioning our topics across them.

For as young as it is, NATS Streaming has solid performance (which shouldn’t come as much of a surprise considering the history of NATS itself), and I imagine it will only get better with time as the NATS team continues to optimize. In some ways, NATS Streaming bridges the gap between the commit log as made popular by Kafka and the conventional message queue as made popular by protocols like JMS, AMQP, STOMP, and the like.

The bigger question at this point is how NATS Streaming will tackle scaling and replication (a requirement for true production-readiness in my opinion). Kafka was designed from the ground up for high scalability and availability through the use of external coordination (read ZooKeeper). Naturally, there is a lot of complexity and cost that comes with that. NATS Streaming attempts to keep NATS’ spirit of simplicity, but it’s yet to be seen how it will reconcile that with the complex nature of distributed systems. I’m excited to see where Apcera takes NATS Streaming and generally the NATS ecosystem in the future since the team has a lot of experience in this area.

You Are Not Paid to Write Code

“Taco Bell Programming” is the idea that we can solve many of the problems we face as software engineers with clever reconfigurations of the same basic Unix tools. The name comes from the fact that every item on the menu at Taco Bell, a company which generates almost $2 billion in revenue annually, is simply a different configuration of roughly eight ingredients.

Many people grumble or reject the notion of using proven tools or techniques. It’s boring. It requires investing time to learn at the expense of shipping code.  It doesn’t do this one thing that we need it to do. It won’t work for us. For some reason—and I continue to be completely baffled by this—everyone sees their situation as a unique snowflake despite the fact that a million other people have probably done the same thing. It’s a weird form of tunnel vision, and I see it at every level in the organization. I catch myself doing it on occasion too. I think it’s just human nature.

I was able to come to terms with this once I internalized something a colleague once said: you are not paid to write code. You have never been paid to write code. In fact, code is a nasty byproduct of being a software engineer.

Every time you write code or introduce third-party services, you are introducing the possibility of failure into your system.

I think the idea of Taco Bell Programming can be generalized further and has broader implications based on what I see in industry. There are a lot of parallels to be drawn from The Systems Bible by John Gall, which provides valuable commentary on general systems theory. Gall’s Fundamental Theorem of Systems is that new systems mean new problems. I think the same can safely be said of code—more code, more problems. Do it without a new system if you can.

Systems are seductive and engineers in particular seem to have a predisposition for them. They promise to do a job faster, better, and more easily than you could do it by yourself or with a less specialized system. But when you introduce a new system, you introduce new variables, new failure points, and new problems.

But if you set up a system, you are likely to find your time and effort now being consumed in the care and feeding of the system itself. New problems are created by its very presence. Once set up, it won’t go away, it grows and encroaches. It begins to do strange and wonderful things. Breaks down in ways you never thought possible. It kicks back, gets in the way, and opposes its own proper function. Your own perspective becomes distorted by being in the system. You become anxious and push on it to make it work. Eventually you come to believe that the misbegotten product it so grudgingly delivers is what you really wanted all the time. At that point encroachment has become complete. You have become absorbed. You are now a systems person.

The last systems principle we look at is one I find particularly poignant: almost anything is easier to get into than out of. When we introduce new systems, new tools, new lines of code, we’re with them for the long haul. It’s like a baby that doesn’t grow up.

We’re not paid to write code, we’re paid to add value (or reduce cost) to the business. Yet I often see people measuring their worth in code, in systems, in tools—all of the output that’s easy to measure. I see it come at the expense of attending meetings. I see it at the expense of supporting other teams. I see it at the expense of cross-training and personal/professional development. It’s like full-bore coding has become the norm and we’ve given up everything else.

Another area I see this manifest is with the siloing of responsibilities. Product, Platform, Infrastructure, Operations, DevOps, QA—whatever the silos, it’s created a sort of responsibility lethargy. “I’m paid to write software, not tests” or “I’m paid to write features, not deploy and monitor them.” Things of that nature.

I think this is only addressed by stewarding a strong engineering culture and instilling the right values and expectations. For example, engineers should understand that they are not defined by their tools but rather the problems they solve and ultimately the value they add. But it’s important to spell out that this goes beyond things like commits, PRs, and other vanity metrics. We should embrace the principles of systems theory and Taco Bell Programming. New systems or more code should be the last resort, not the first step. Further, we should embody what it really means to be an engineer rather than measuring raw output. You are not paid to write code.

Shit Rolls Downhill

Building software of significant complexity is tough because a lot of pieces have to come together and a lot of teams have to work in concert to be successful. It can be extraordinarily difficult to get everyone on the same page and moving in tandem toward a common goal. Product development is largely an exercise in trust (or perhaps more accurately, hiring), but even if you have the “right” people—people you can trust and depend on to get things done—you’re only halfway there.

Trust is an important quality to screen for, difficult though it may be. However, a person’s trustworthiness or dependability doesn’t really tell you much about that person as an engineer. The engineering culture is something that must be cultivated. Etsy’s CTO, John Allspaw, said it best in a recent interview:

Post-mortem debriefings every day are littered with the artifacts of people insisting, the second before an outage, that “I don’t have to care about that.”

If “abstracting away” is nothing for you but a euphemism for “Not my job,” “I don’t care about that,” or “I’m not interested in that,” I think Etsy might not be the place for you. Because when things break, when things don’t behave the way they’re expected to, you can’t hold up your arms and say “Not my problem.” That’s what I could call “covering your ass” engineering, and it may work at other companies, but it doesn’t work here.

Allspaw calls this the distinction between hiring software developers and software engineers. This perception often results in heated debate, but I couldn’t agree with it more. There is a very real distinction to be made. Abstraction is not about boundaries of concern, it’s about boundaries of focus. Engineers need to have an intimate understanding of this.

Engineering, as a discipline and as an activity, is multi-disciplinary. It’s just messy. And that’s actually the best part of engineering. It’s not about everyone knowing everything. It’s about paying attention to the shared, mutual understanding.

But engineering is more than just technical aptitude and a willingness to “dig in” to the guts of something. It’s about having an acute awareness of the delicate structure upon which software is built. More succinctly, it’s about having empathy. It’s recognizing the fact that shit rolls downhill.

Shit Rolls Downhill

For things to work, the entire structure has to hold, and no one point is any more or less important than the others. It almost always starts off with good intentions at the top, but the shit starts to compound and accelerate as it rolls effortlessly and with abandon toward the bottom. There are a few aspects to this I want to explore.

Understand the Relationships

This isn’t to say that folks near the top are less susceptible to shit. Everyone has to shovel it, but the way it manifests is different depending on where you find yourself on the hill. The key point is that the people above you are effectively your customers, either directly or indirectly, and if you’re toward the top, maybe literally.

And, as all customers do, they make demands. This is a very normal thing and is to be expected. Some of these demands are reasonable, others not so much. Again, this is normal, but what do we make of these demands?

There are some interesting insights we can take from The Innovator’s Dilemma (which, by the way, is an essential read for anyone looking to build, run, or otherwise contribute to a successful business), which are especially relevant toward the top of the hill. Mainly, we should not merely take the customer’s word as gospel. When it comes to products, feature requests, and “the way things should be done,” the customer tends to have a very narrow and predisposed view. I find the following passage to be particularly poignant:

Indeed, the power and influence of leading customers is a major reason why companies’ product development trajectories overshoot demands of mainstream markets.

Essentially, too much emphasis can be placed on the current or perceived needs of the customer, resulting in a failure to meet their unstated or future needs (or if we’re talking about internal customers, the current or future needs of the business). Furthermore, we can spend too much time focusing on the customer’s needs—often perceived needs—culminating in a paralysis to ship. This is very anti-continuous-delivery. Get things out fast, see where they land, and make appropriate adjustments on the fly.

Giving in to customer demands is a judgement game, but depending on the demand, it can have profound impact on the people further down the hill. Thus, these decisions should be made accordingly and in a way that involves a cross section of the hill. If someone near the top is calling all the shots, things are not going to work out, and in all likelihood, someone else is going to end up getting covered in shit.

An interesting corollary is the relationship between leadership and engineers. Even a single, seemingly innocuous question asked in passing by a senior manager can change the entire course of a development team. In fact, the manager was just trying to gain information, but the team interpreted the question as a statement suggesting “this thing needs to be done.” It’s important to recognize this interaction for what it is.

Set Appropriate Expectations

In truth, the relationship between teams is not equivalent to the relationship between actual customers and the business. You may depend on another team in order to provide a certain feature or to build a certain product. If the business is lagging, the customer might take their money elsewhere. If the team you depend on is lagging, you might not have the same liberty. This leads to the dangerous “us versus them” trap teams fall in as an organization grows. The larger a company gets, the more fingers get pointed because “they’re no longer us, they’re them.” There are more teams, they are more isolated, and there are more dependencies. It doesn’t matter how great your culture is, changing human nature is hard. And when pressure builds from above, the finger-pointing only intensifies.

Therefore, it’s critical to align yourself with the teams you depend on. Likewise, align yourself with the teams that depend on you, don’t alienate them. In part, this means have a realistic sense of urgency, have realistic expectations, and plan accordingly. It’s not reasonable to submit a work item to another team and turn around and call it a blocker. Doing so means you failed to plan, but now to outside observers, it’s the other team which is the problem. As we prioritize the work precipitated by our customers, so do the rest of our teams. With few exceptions, you cannot expect a team to drop everything they’re doing to focus on your needs. This is the aforementioned “us versus them” mentality. Instead, align. Speak with the team you depend on, understand where your needs fit within their current priorities, and if it’s a risk, be willing to roll up your sleeves and help out. This is exactly what Allspaw was getting at when he described what a “software engineer” is.

Setting realistic expectations is vital. Just as products ship with bugs, so does everything else in the stack. Granted, some bugs are worse than others, but no amount of QA will fully prevent them from going to production. Bugs will only get worked out if the code actually gets used. You cannot wait until something is perfect before adopting it. You will wait forever. Remember that Agile is micro failure on a macro level. Adopt quickly, deploy quickly, fail quickly, adjust quickly. As Jay Kreps once said, “The only way to really know if a system design works in the real world is to build it, deploy it for real applications, and see where it falls short.”

While it’s important to set appropriate expectations downward, it’s also important to communicate upward. Ensure that the teams relying on you have the correct expectations. Establish what the team’s short-term and long-term goals are and make them publicly available. Enable those teams to plan accordingly, and empower them so that they can help out when needed. Provide adequate documentation such that another engineer can jump in at any time with minimal handoff.

Be Curious

This largely gets back to the quote by John Allspaw. The point is that we want to hire and develop software engineers, not programmers. Being an engineer should mean having an innate curiosity. Figure out what you don’t know and push beyond it.

Understand, at least on some level, the things that you depend on. Own everything. Similarly, if you built it and it’s running in production, it’s on you to support it. Throwing code over the wall is no longer acceptable. When there’s a problem with something you depend on, don’t just throw up your hands and say “not my problem.” Investigate it. If you’re certain it’s a problem in someone else’s system, bring it to them and help root cause it. Provide context. When did it start happening? What were the related events? What were the effects? Don’t just send an error message from the logs.

This is the engineering culture that gets you the rest of the way there. The people are important, especially early on, but it’s the core values and practices that will carry you. The Innovator’s Dilemma again provides further intuition:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

Summary

There will always be gravity. As such, shit will always roll downhill. It’s important to embrace this structure, to understand the relationships, and to set appropriate expectations. Equally important is fostering an engineering culture—a culture of curiosity, ownership, and mutual understanding. Having the right people is essential, but it’s only half the problem. The other half is instilling the right values and practices. Shit rolls downhill, but if you have the right people, values, and practices in place, that manure might just grow something amazing.

Abstraction Considered Harmful

“Abstraction is sometimes harmful,” he proclaims to the sound of anxious whooping and subdued applause from the audience. Peter Alvaro’s 2015 Strange Loop keynote, I See What You Mean, remains one of my favorite talks—not just because of its keen insight on distributed computing and language design, but because of a more fundamental, almost primordial, understanding of systems thinking.

Abstraction is what we use to manage complexity. We build something of significant complexity, we mask its inner workings, and we expose what we think is necessary for interacting with it.

Programmers are lazy, and abstractions help us be lazy. The builders of abstractions need not think about how their abstractions will be used—this would be far too much effort. Likewise, the users of abstractions need not think about how their abstractions work—this would be far too much effort. And now we have a nice, neatly wrapped package we can use and reuse to build all kinds of applications—after all, duplicating it would be far too much effort and goes against everything we consider sacred as programmers.

It usually works like this: in order to solve a problem, a programmer first needs to solve a sub-problem. This sub-problem is significant enough in complexity or occurs frequently enough in practice that the programmer doesn’t want to solve it for the specific case—an abstraction is born. Now, this can go one of two ways. Either the abstraction is rock solid and the programmer never has to think about the mundane details again (think writing loops instead of writing a bunch of jmp statements)—success—or the abstraction is leaky because the underlying problem is sufficiently complex (think distributed database transactions). Infinite sadness.

It’s kind of a cruel irony. The programmer complains that there’s not enough abstraction for a hard sub-problem. Indeed, the programmer doesn’t care about solving the sub-problem. They are focused on solving the greater problem at hand. So, as any good programmer would do, we build an abstraction for the hard sub-problem, mask its inner workings, and expose what we think is necessary for interacting with it. But then we discover that the abstraction leaks and complain that it isn’t perfect. It turns out, hard problems are hard. The programmer then simply does away with the abstraction and solves the sub-problem for their specific case, handling the complexity in a way that makes sense for their application.

Abstraction doesn’t magically make things less hard. It just attempts to hide that fact from you. Just because the semantics are simple doesn’t mean the solution is. In fact, it’s often the opposite, yet this seems to be a frequently implied assumption.

Duplication is far cheaper than the wrong abstraction. Just deciding which little details we need to expose on our abstractions can be difficult, particularly when we don’t know how they will be used. The truth is, we can’t know how they will be used because some of the uses haven’t even been conceived yet. Abstraction is a delicate balance of precision and granularity. To quote Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

But, as we know, requirements are fluid. Too precise and we lose granularity, hindering our ability to adapt in the future. Too granular and we weaken the abstraction. But a strong abstraction for a hard problem isn’t really strong at all when it leaks.

The key takeaway is that abstractions leak, and we have to deal with that. There is never a silver bullet for problems of sufficient complexity. Peter ends his talk on a polemic against the way we currently view abstraction:

[Let’s] not make concrete, static abstractions. Trust ourselves to let ourselves peer below the facade. There’s a lot of complexity down there, but we need to engage with that complexity. We need tools that help us engage with the complexity, not a fire blanket. Abstractions are going to leak, so make the abstractions fluid.

Abstraction, in and of itself, is not harmful. On the contrary, it’s necessary for progress. What’s harmful is relying on impenetrable barriers to protect our precious programmers from hard problems. After all, the 21st century engineer understands that in order to play in the sand, we all need to be comfortable getting our feet a little wet from time to time.