Brave New Geek

June 30, 2017

You Cannot Have Exactly-Once Delivery Redux

A couple years ago I wrote You Cannot Have Exactly-Once Delivery. It stirred up quite a bit of discussion and was even referenced in a book, which I found rather surprising considering I’m not exactly an academic. Recently, the topic of exactly-once delivery has again become a popular point of discussion, particularly with the release of Kafka 0.11, which introduces support for idempotent producers, transactional writes across multiple partitions, and—wait for it—exactly-once semantics.

Naturally, when this hit Hacker News, I received a lot of messages from people asking me, “what gives?” There’s literally a TechCrunch headline titled, Confluent achieves holy grail of “exactly once” delivery on Kafka messaging service (Jay assures me, they don’t write the headlines). The myth has been disproved!

First, let me say what Confluent has accomplished with Kafka is an impressive achievement and one worth celebrating. They made a monumental effort to implement these semantics, and it paid off. The intention of this post is not to minimize any of that work but to try to clarify a few key points and hopefully cut down on some of the misinformation and noise.

“Exactly-once delivery” is a poor term. The word “delivery” is overloaded. Frankly, I think it’s a marketing word. The better term is “exactly-once processing.” Some call the distinction pedantic, but I think it’s important and there is some nuance. Kafka did not solve the Two Generals Problem. Exactly-once delivery, at the transport level, is impossible. It doesn’t exist in any meaningful way and isn’t all that interesting to talk about. “We have a word for infinite packet delay—outage,” as Jay puts it. That’s why TCP exists, but TCP doesn’t care about your application semantics. And in the end, that’s what’s interesting—application semantics. My problem with “exactly-once delivery” is it assumes too much, which causes a lot of folks to make bad assumptions. “Delivery” is a transport semantic. “Processing” is an application semantic.

All is not lost, however. We can still get correct results by having our application cooperate with the processing pipeline. This is essentially what Kafka does, exactly-once processing, and Confluent makes note of that in the blog post towards the end. What does this mean?

To achieve exactly-once processing semantics, we must have a closed system with end-to-end support for modeling input, output, and processor state as a single, atomic operation. Kafka supports this by providing a new transaction API and idempotent producers. Any state changes in your application need to be made atomically in conjunction with Kafka. You must commit your state changes and offsets together. It requires architecting your application in a specific way. State changes in external systems must be part of the Kafka transaction. Confluent’s goal is to make this as easy as possible by providing the platform around Kafka with its streams and connector APIs. The point here is it’s not just a switch you flip and, magically, messages are delivered exactly once. It requires careful construction, application logic coordination, isolating state change and non-determinism, and maintaining a closed system around Kafka. Applications that use the consumer API still have to do this themselves. As Neha puts it in the post, it’s not “magical pixie dust.” This is the most important part of the post and, if it were up to me, would be at the very top.

Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well. If you are using the consumer API, this means ensuring that you commit changes to your application state concordant with your offsets as described here.

Side effects into downstream systems with no support for idempotency or distributed transactions make this really difficult in practice I suspect. The argument is that most people are using relational databases that support transactions, but I think there’s still a reasonably large, non-obvious assumption here. Making your event processing atomic might not be easy in all cases. Moreover, every part in your system needs to participate to ensure end-to-end, exactly-once semantics.

Several other messaging systems like TIBCO EMS and Azure Service Bus have provided similar transactional processing guarantees. Kafka, as I understand it, attempts to make it easier and with less performance overhead. That’s a great accomplishment.

What’s really worth drawing attention to is the effort made by Confluent to deliver a correct solution. Achieving exactly-once processing, in and of itself, is relatively “easy” (I use that word loosely). What’s hard is dealing with the range of failures. The announcement shows they’ve done extensive testing, likely much more than most other systems, and have shown that it works and with minimal performance impact.

Kafka provides exactly-once processing semantics because it’s a closed system. There is still a lot of difficulty in ensuring those semantics are maintained across external services, but Confluent attempts to ameliorate this through APIs and tooling. But that’s just it: it’s not exactly-once semantics in a building block that’s the hard thing, it’s building loosely coupled systems that agree on the state of the world. Nevertheless, there is no holy grail here, just some good ole’ fashioned hard work.

Special thanks to Jay Kreps and Sean T. Allen for their feedback on an early draft of this post. Any inaccuracies or opinions are mine alone.

Follow @tyler_treat

June 29, 2017June 30, 2017

Smart Endpoints, Dumb Pipes

I read an interesting article recently called How do you cut a monolith in half? There are a lot of thoughts in the article that resonate with me and some that I disagree with, prompting this response.

The overall message of the article is don’t use a message broker to break apart a monolith because it’s like a cross between a load balancer and a database, with the disadvantages of both and the advantages of neither. The author argues that message brokers are a popular way to pull apart components over a network because they have low setup cost and provide easy service discovery, but they come at a high operational cost. My response to that is the same advice the author puts forward: it depends.

I think it’s important not to conflate “message broker” and “message queue.” The article uses them interchangeably, but it’s really talking about the latter, which I see as a subset of the former. Queues provide, well, queuing semantics. They try to ensure delivery of a message or, more generally, distribution of work. As the author puts it: “In practice, a message broker is a service that transforms network errors and machine failures into filled disks.” Replace “broker” with “queue” and I agree with this statement. This is really describing systems like RabbitMQ, Amazon SQS, TIBCO EMS, IronMQ, and maybe even Kafka fits into that category.

People are easily seduced by “fat” middleware—systems with more features, more capabilities, more responsibilities—because they think it makes their lives easier, and it might at first. Pushing off more responsibility onto the infrastructure makes the application simpler, but it also makes the infrastructure more complex, more fragile, and more slow. Take exactly-once message delivery, for example. Lots of people want it, but the pursuit of it introduces a host of complexity, overhead (in terms of development, operations, and performance), and risk. The end result is something that, in addition to these things, requires all downstream systems to not introduce duplicates and be mindful about their side effects. That is, everything in the processing pipeline must be exactly-once or nothing is. So typically what you end up with is an approximation of exactly-once delivery. You make big investments to lower the likelihood of duplicates, but you still have to deal with the problem. This might make sense if the cost of having duplicates is high, but that doesn’t seem like the common case. My advice is to always opt for the simple solution. We tend to think of engineering challenges as technical problems when, in reality, they’re often just mindset problems. Usually the technical problems have already been solved if we can just adjust our mindset.

There are a couple things to keep in mind here. The first thing to consider is simply capability lock-in. As you push more and more logic off onto more and more specialized middleware, you make it harder to move off it or change things. The second is what we already hinted at. Even with smart middleware, problems still leak out and you have to handle them at the edge—you’re now being taxed twice. This is essentially the end-to-end argument. Push responsibility to the edges, smart endpoints, dumb pipes, etc. It’s the idea that if you need business-level guarantees, build them into the business layer because the infrastructure doesn’t care about them.

The article suggests for short-lived tasks, use a load balancer because with a queue, you’ll end up building a load balancer along with an ad-hoc RPC system, with extra latency. For long-lived tasks, use a database because with a queue, you’ll be building a lock manager, a database, and a scheduler.

A persistent message queue is not bad in itself, but relying on it for recovery, and by extension, correct behaviour, is fraught with peril.

So why the distinction between message brokers and message queues? The point is not all message brokers need to be large, complicated pieces of infrastructure like most message queues tend to be. This was the reason I gravitated towards NATS while architecting Workiva’s messaging platform and why last month I joined Apcera to work on NATS full time.

When Derek Collison originally wrote NATS it was largely for the reasons stated in the article and for the reasons I talk about frequently. It was out of frustration with the current state of the art. In my opinion, NATS was the first system in the space that really turned the way we did messaging on its head (outside of maybe ZeroMQ). It didn’t provide any strong delivery guarantees, transactions, message persistence, or other responsibilities usually assumed by message brokers (there is a layer that provides some of these things, but it’s not baked into the core technology). Instead, NATS prioritized availability, simplicity, and performance over everything else. A simple technology in a vast sea of complexity (my marketing game is strong).

NATS is no-frills pub/sub. It solves the problem of service discovery and work assignment, assumes no other responsibilities, and gets out of your way. It’s designed to be easy to use, easy to operate, and add minimal latency even at scale so that, unlike many other brokers, it is a good way to integrate your microservices. What makes NATS interesting is what it doesn’t do and what it gains by not doing them. Simplicity is a feature—the ultimate sophistication, according to da Vinci. I call it looking at the negative space.

The article reads:

A protocol is the rules and expectations of participants in a system, and how they are beholden to each other. A protocol defines who takes responsibility for failure.

The problem with message brokers, and queues, is that no-one does.

NATS plays to the strengths of the end-to-end principle. It’s a dumb pipe. Handle failures and retries at the client and NATS will do everything it can to remain available and fast. Don’t rely on fragile guarantees or semantics. Instead, face complexity head-on. The author states what you really want is request/reply, which is one point I disagree on. RPC is a bad abstraction for building distributed systems. Use simple, versatile primitives and embrace asynchrony and messaging.

So yes, be careful about relying on message brokers. How smart should the pipes really be? More to the point, be careful about relying on strong semantics because experience shows few things are guaranteed when working with distributed systems at scale. Err to the side of simple. Make few assumptions of your middleware. Push work out of your infrastructure and to the edges if you care about performance and scalability because nothing is harder to scale (or operate) than slow infrastructure that tries to do too much.

Follow @tyler_treat

June 3, 2017April 26, 2018

Engineering Empathy

This was a talk I gave at an internal R&D conference my last week at Workiva. I got a lot of positive feedback on the talk, so I figured I would share it with a wider audience. Be warned: it’s long. Feel free to read each section separately, though they largely tie together.

Why do you work where you work? For many in tech, the answer is probably culture. When you tell a friend about your job, the culture is probably the first thing you describe. It’s culture that can be a company’s biggest asset—and its biggest downfall. But what is it?

Culture isn’t a list of values or a mission statement. It’s not a casual dress code or a beer fridge. Culture is what you reward and what you don’t. More importantly, it’s what you reward and what you punish. That’s an important distinction to make because when you don’t punish behavior that’s inconsistent with your culture, you send a message: you don’t care about it.

So culture is what you live day in and day out. It’s not what you say, it’s what you do. Put yourself in the shoes of a new hire at your company. A new hire doesn’t walk in automatically knowing your culture. They walk in—filled with anxiety—hoping for success, fearing failure, and they look around. They observe their environment. They see who is succeeding, and they try to emulate that behavior. They see who is failing, and they try to avoid that behavior. They ask the question: what makes someone successful here?

Jim Rohn was credited with saying we are the average of the five people we spend the most time with, and I think there’s a lot of truth to this idea. The people around you shape who you are. They shape your behavior, your habits, your thoughts, your opinions, your worldview. Culture is really a feedback loop, and we all contribute to it. It’s not mandated or dictated down (or up through grassroots programs), it’s emergent, but we have to be deliberate about our behaviors because that’s what shapes our culture (note that this doesn’t mean culture doesn’t start with leadership—it most certainly does).

In my opinion, a strong engineering culture is comprised of three parts: the right people, the right processes, and the right priorities. The right people means people that align with and protect your values. Processes are how you execute—how you communicate, develop, deliver, etc. And priorities are your values—the skills or behaviors you value in fellow employees. It’s also your vision. These help you to make decisions—what gets done today and what goes to the bottom of the list. For example, many companies claim a customer-first principle, but how many of them actually use it to drive their day-to-day decisions? This is the difference between a list of values and a culture.

What about technology? Experience? Customer relationships? These are all important competitive advantages, but I think they largely emerge from having the right people, processes, and priorities. The three are deeply intertwined. A culture is the unique combination of processes and values within an organization, and it’s those processes and values that enable you to replicate your success. I’m a big fan of Reed Hastings’ Netflix culture slide deck, but there are some things with which I fundamentally disagree. Hastings says, “The more talent density you have the less process you need. The more process you create the less talent you retain.” This is wrong on a number of levels, which I will talk about later.

Empathy is the common thread throughout each of these three areas—the people, the processes, and the priorities—and we’ll see how it applies to each.

The Ultimate Complex System: People

We largely think about software development as a purely technical feat, one which requires skill and creativity and ingenuity. I think for many, it’s why we became engineers in the first place. We like solving problems. But when all is said and done, computers do what you tell them to do. Computers don’t have opinions or biases or agendas or egos. The technical challenges are really a small part of any sufficiently complicated piece of software. Having been an individual contributor, tech lead, and manager, I’ve come to the realization that it’s people that is the ultimate complex system.

There’s a quote from The Five Dysfunctions of a Team which I’ve referenced before on this blog that I think captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” Teamwork is powerful is intuitive, but teamwork is rare is more profound. Teamwork is a competitive advantage because it’s rare—that’s a pretty strong statement. Let’s unpack it a bit.

It’s the difference between technical architecture and social architecture. We tend to focus on the former while neglecting the latter, but software engineering is more about collaboration than code. It wasn’t until I became a manager that I realized good managers are force multipliers, but social architecture is everyone’s responsibility. Remember, culture is a feedback loop which we all contribute to, so everyone is a social architect.

A key part of social architecture is communication empathy. Back in the 90’s, an evolutionary psychologist by the name of Robin Dunbar proposed the idea that humans can only maintain about 150 stable social relationships. The limit is referred to as Dunbar’s number. He drew a correlation between primate brain size and the average size of cohesive social groups. Informally, Dunbar describes this as “the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” This number includes all of the relationships in your life, both personal and professional, past and present.

What’s interesting about Dunbar’s number is how it applies to our jobs as software engineers and the interplay with cognitive biases we have as humans. When you don’t understand what someone else does, you’re automatically biased against them. In your head, you’re king. You understand exactly what you do, what your job is, the value that you add, but outside your head—outside your Dunbar’s number—those people are all a mystery. And humans have this funny tendency to mock what they don’t understand. We have lots of these cognitive biases.

Building those types of stable relationships is really hard when you can’t just walk over to someone’s desk and talk to them face-to-face. This was something we struggled a lot with at Workiva, being a company of a few hundred engineers spread out across 11 or so offices. Not only is split-brain an inevitability, but just a general lack of rapport. That stage is super hard to go through for a lot of companies—going from dozens of engineers to hundreds in a few short years. The cruel irony is no matter how agile a company claims to be, the culture—of which structure and processes are a part of—is usually the slowest to adjust. No longer is decision making done by standing up and telling a roomful of people, “I’m going to do this. Please tell me now if that’s a bad idea.” And no longer are decisions often made unanimously and rapidly.

Building stable relationships is much harder without the random hallway error correction. The right people aren’t always bumping into each other at the right time, whereas before a lot of decisions could be made just-in-time. Instead, communication has to be more deliberate and no longer organic. Decisions are made by the Jeff Bezos philosophy of disagree and commit. But nevertheless, building empathy is hard without face-to-face communication, and you miss out on a lot of the nuance of communication.

Again, communication is highly nuanced, and nuance is hard to convey over HipChat. The role of emotions plays a big part. Imagine the situation where you’re walking down a sidewalk or a hallway and someone accidentally walks in front of you. You might sidestep or do a little dance to get around each other, smile or nod, and get on with your day. This minor inconvenience becomes almost this tiny, pleasant interaction between two people. Now, take the same scenario but between two drivers, and you probably have some kind of road rage type situation. The only real difference is the steel cage surrounding the drivers blocking out the verbal and non-verbal communication. How many times has a HipChat conversation gone completely off the rails only to be resolved by a quick Google Hangout or in-person conversation? It’s the exact same thing.

One last note with respect to cognitive biases: once again, humans have a funny tendency where, in a vacuum of information, people will create their own. We will manufacture information just to fill the void, and often it’s not just creating information but taking information we’ve heard somewhere else and applying it to our own—and often the wrong—context. The extreme example of this is “We heard microservices worked well for Netflix, so we should use them at our growing startup” or “Google does monorepo, so we should too.” You’re ignoring all of the context—the path those companies followed to reach those points, the trade-offs made, the organizational differences and competencies, the pain points. When the cognitive biases and opinions we have as humans are added in, the problems amplify and compound. You get a frustrating game of telephone.

You can’t kill The Grapevine anymore than you can change human nature, so you have to address it head-on where and when you can. Don’t allow the vacuum of information to form in the first place, and be cognizant of when you’re applying information you’ve heard from someone else. Is the problem the same? Do you have the same amount of information? Are your constraints the same? What is the full context?

I tend to look at communication through two different lenses: push and pull. In order to be an empathetic communicator, I think it’s important to look at these while thinking about the cognitive biases discussed earlier, starting with push.

I’ve been in meetings where someone called out someone else as a “blocker”, and there was visible wincing in the room. I think for some, the word probably triggers some sort of PTSD. When you’re depending on something that’s outside of your direct control to get something else done, it’s hard not to drop the occasional B-word. It happens out of frustration. It happens because you’re just trying to ship. It also happens because you don’t want to look bad. And it might seem innocuous in the moment, but it has impact. By invoking it, you are tempting fate.

There’s a really good book that was written way back in 1944 called The Unwritten Laws of Engineering. It was published by the American Society of Mechanical Engineers and the language in the book is quite dated (parts of it read like a crotchety old guy yelling at kids to get off his lawn), but the ideas in the book still apply very much today and even to software engineering. It’s really about people, and fundamentally, people don’t change. One of the ideas in the book is this notion which I call communication impact. I’ve taken a quote from the book which I think highlights this idea (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

I see this a lot. Not just in emails (née “memos”) but in meetings or reviews, someone will—inadvertently or not—throw someone else under the proverbial bus, i.e. “broadcasting memorandum containing damaging or embarrassing statements”, “stepping too heavily upon someone’s toes”, or “revealing a serious shortcoming on somebody’s part.” The problem with this is, by doing it, you immediately put the other party on the defensive and also create a cognitive bias for everyone else in the room. You create a negative predisposition, which may or may not be warranted, toward them. Similarly, I liken “if it concerns manufacturing or customer difficulties” to production postmortems. This is why they need to be blameless. Why is it that retros on production issues are blameless while, at the same time, the development process is full of blame-assigning? It might seem innocuous, but your communication has impact. Push with respect and under the assumption the other person is probably doing the right thing. Don’t be willing to throw anyone under that bus. Likewise, be quick to take responsibility but slow to assign it. Don’t be willing to practice Cover Your Ass Engineering.

I’ve been in meetings where someone would get called out as a blocker, literally articulated in that way, and the person wasn’t even aware they were blocking anything. I’ve seen people create JIRA tickets on another team’s board and then immediately call them blockers. It’s important to call out dependencies ahead of time, and when someone is “blocking” your progress, speak to them about it individually and before it reaches a critical point. No one should be getting caught off guard by these things. Be careful about how and where you articulate these types of problems.

On the same topic of communication impact, I’ve seen engineers develop detailed and extravagant plans like “We’re going to move the entire company to a monorepo while simultaneously switching from Git to Mercurial” or “We’re going to build our own stream-processing framework from the ground up”, and then distribute them widely to the organization (“wide distribution” as referenced in The Unwritten Laws of Engineering passage above). The proposals are usually well-intentioned and maybe even compelling sometimes, but it’s the way in which they are communicated that is problematic. Recall The Grapevine: people see it, assume it’s reality, and then spread misinformation. “Did you know the company is switching to Mercurial?”

An effective way to build rapport between teams is genuinely celebrating the successes of other teams, even the small ones. I think for many organizations, it’s common to celebrate victories within a team—happy hour for shipping a new feature or a team outing for signing a major account—but celebrating another team’s win is more rare, especially when a company grows in size. The operative word is “genuine” though. Don’t just do it for the sake of doing it, be genuine about it. This is a compelling way to build the stable relationships needed to unlock the rarity of teamwork described earlier.

Equally important to understanding communication impact is understanding decision impact. I’ve already written about this, so I’ll keep it brief: your decisions impact others. How does adopting X affect Operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? Doing something might save you time, but does it create work or slow others down?

Teams operate in a way that minimizes the amount of pain they feel. It’s a natural instinct and a phenomenon I call pain displacement. Pain-driven development is following the path of least resistance. By doing this, we end up moving the pain somewhere else or deferring it until later (i.e. tech debt). Where the problems start to happen is when multiple teams or functions are involved. This is when the political and other organizational issues start to seep in. Patrick Lencioni, author of The Five Dysfunctions of a Team, has a book that touches on this subject called Silos, Politics, and Turf Wars.

I believe the solution is multifaceted. First, teams need to think holistically, widening their vision beyond the deliverable immediately in front of them. They need to have a sense of organizational awareness. Second, teams—and especially leaders—need to be able to take off their job’s “hat” periodically in order to solve a shared problem. Lencioni observes that much of what causes organizational dysfunction is siloing, and this typically stems from strong intra-team loyalties. For example, within an engineering organization you might have development, operations, QA, security, and other functional teams. Empathy is being able to look at something through someone else’s perspective, and this requires removing your functional hat from time to time. Lastly, teams need to be able to rally around a common cause. This is a shared, compelling vision that motivates and mobilizes people and helps break down the silos. A shared vision aligns teams and enables them to work more autonomously. This is how decisions get made.

Pull communication is pretty much just how to ask questions without making people hate you, a skill that is very important to be an effective and empathetic communicator.

The single most common communication issue I see in engineering organizations is The XY Problem. It’s when someone focuses on a particular solution to their problem instead of describing the problem itself.

User wants to do X.
User doesn’t know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
User doesn’t know how to do Y either.
User asks for help with Y.
Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn’t even a suitable solution for X.

The problem occurs when people get stuck on what they believe is the solution and are unable to step back and explain the issue in full. The solution to The XY Problem is simple: always provide the full context of what you’re trying to do. Describe the problem, don’t just prescribe the solution.

Part of being an effective communicator is being able to extract information from people and getting help without being a mental and emotional drain. This is especially true when it comes to debugging. I often see this “murder-mystery debugging” where someone basically tries to push off the blame for something that’s wrong with their code onto someone or something else. This flies in the face of the principle discussed earlier—be quick to take responsibility and slow to assign it. The first step when it comes to debugging anything is assume it’s your fault by default. When you run some code you’re writing and the compiler complains, you don’t blame the compiler, you assume you screwed up. It’s just taking this same mindset and applying it to everything else that we do.

And when you do need to seek help from others—just like with The XY Problem—provide as much context as possible. So much of what I see is this sort of information trickle, where the person seeking help drips information to the people trying to provide it. Don’t make it an interrogation. Lastly, provide a minimal working example that reproduces the problem. Don’t make people build a massive project with 20 dependencies just to reproduce your bug. It’s such a common problem for Stack Overflow that they actually have a name for it: MCVE—Minimal, Complete, Verifiable Example. Do your due diligence before taking time out of someone else’s day because the only thing worse than a bug report is a poorly described, hastily written accusation.

Another thing I see often is swoop-and-poop engineering. This is when someone comes to you with something they need help with—maybe a bug in a library you own, a feature request, something along these lines (this is especially true in open source). They have a sense of urgency; they say it either explicitly or just give off that vibe. You offer to setup a meeting to get more information or work through the problem with them only to find they aren’t available or willing to set aside some time with you. They’re heads down on “something more important,” yet their manager is ready to bite your head off weeks later, often without any documentation or warning. They’ve effectively dumped this on you, said the world’s on fire, and left as quickly as they came. You’re left confused and disoriented. You scratch your head and forget about it, then days or weeks later, they return, horrified that the world is still burning. I call these drive-by questions.

First, it’s important to have an appropriate sense of urgency. If you’re not willing to hop on a Hangout to work through a problem or provide additional information, it’s probably not that important, especially if you can’t even take the time to follow up. With few exceptions, it’s not fair to expect a team to drop everything they’re doing to help you at a moment’s notice, but if they do, you need to meet them halfway. It’s essential to realize that if you’re piling onto a team, others probably are too. If you submit a ticket with another team and then turn around and immediately call it a blocker, that just means you failed to plan accordingly. Having empathy is being cognizant that every team has its own set of priorities, commitments, and work that it’s juggling. By creating that ticket and calling it a blocker, you’re basically saying none of that stuff matters as much. Empathy is understanding that shit rolls downhill. For those who find themselves facing drive-by questions: document everything and be proactive about communicating.

There’s a really good essay by Eric Raymond called How To Ask Questions The Smart Way. It’s something I think every engineer should read and take to heart. My number one pet peeve is Help Vampires. These are people who refuse to take the time to ask coherent, specific questions and really aren’t interested in having their questions answered so much as getting someone else to do their work. They ask the same, tired questions over and over again without really retaining information or thinking critically. It’s question, answer, question, answer, question, answer, ad infinitum.

This is often a hard-earned lesson for junior engineers, but it’s an important one: when you ask a question, you’re not entitled to an answer, you earn the answer. Hasty sounding questions get hasty answers. As engineers, we should not operate like a tech support hotline that people call when their internet stops working. We need to put in a higher level of effort. We need to apply our technical and problem-solving aptitude as engineers. This is the only way you can scale this kind of support structure within an engineering organization. If teams are just constantly bombarding each other with low-effort questions, nothing will get done and people will get burnt out.

Avoid being a Help Vampire. Before asking a question, do your due diligence. Think carefully about where to ask your question. If it’s on HipChat, what is the appropriate room in which to ask? Also be mindful of doing things like @all or @here in a large room. Doing that is like walking into a crowded room, throwing your hands up in the air, and shouting at everyone to look at you. Be precise and informative about your problem, but also keep in mind that volume is not precision. Just dumping a bunch of log messages is noise. Don’t rush to claim that you’ve found a bug. As a first step, take responsibility. And just like with The XY Problem, describe the goal, not the step you took—describe vs. prescribe. Lastly, follow up on the solution. Everyone has been in this situation: you’ve found someone that asked the exact same question as you only to find they never followed up with how they fixed it. Even if it’s just in HipChat or Slack, drop a note indicating the issue was resolved and what the fix was so others can find it. This also helps close the loop when you’ve asked a question to a team and they are actively investigating it. Don’t leave them hanging.

In many ways, being an empathetic communicator just comes down to having self-awareness.

Codifying Values and Priorities: Processes

“Process” has a lot of negative connotations associated with it because it usually becomes this thing done on ceremony. But “process” should be a means of documenting and codifying your values. This is why I disagree with the Reed Hastings quote about process from earlier. Process is about repeatability and error correction. Camille Fournier’s new book on engineering management, The Manager’s Path, has a great section on “bootstrapping culture.” I particularly like the way she frames organizational structure and process:

When talking about structure and process with skeptics, I try to reframe the discussion. Instead of talking about structure, I talk about learning. Instead of talking about process, I talk about transparency. We don’t set up systems because structure and process have inherent value. We do it because we want to learn from our successes and our mistakes, and to share those successes and encode the lessons we learn from failures in a transparent way. This learning and sharing is how organizations become more stable and more scalable over time.

When a process “feels” wrong, it’s probably because it doesn’t reflect your organization’s values. For example, if a process feels heavy, it’s because you value velocity. If a process feels rigid, it’s because you value agility. If a process feels risky, it’s because you value safety. We have a hard time articulating this so instead it becomes “process is bad.”

Somewhere along the line, someone decides to document how stuff gets done. Things get standardized. Tools get made. Processes get established. But process becomes dogma when it’s interpreted as documentation of how rather than an explanation of why. Processes should tell the story of an organization: here’s what we value, here’s why we value it, and here’s how we protect and scale those values. The story is constantly evolving, so processes should be flexible. They shouldn’t be set in stone.

Michael Lopp’s book Managing Humans also provides a useful perspective on culture:

It’s entirely possible that too much process or the wrong process is developed during this build-out, but when this inevitable debate occurs, it should not be about the process. It’s a debate about values. The first question isn’t, “Is this a good, bad, or efficient process?” The first question is, “How does this process reflect our values?”

Processes should be traceable back to values. Each process should have a value or set of values associated with it. Understanding the why helps to develop empathy. It’s the difference between “here’s how we do things” and “here’s why we do things.” It’s much harder to develop a sense of empathy with just the how.

What We Value: Priorities

As engineers, we need to be curious. We need to have a “let’s go see!” attitude. When someone comes to you with a question—and hopefully it’s a well-formulated question based on the earlier discussion—your first reaction should be, “let’s go see!” Use it as an opportunity for both of you to learn. Even if you know the answer, sometimes it’s better to show, not tell, and as the person asking the question, you should be eager to learn. This is the reason I love Julia Evans’ blog so much. It’s oozing with wonder, curiosity, and intrigue. Being an engineer should mean having an innate curiosity. It’s not throwing up your hands at the first sign of an API boundary and saying, “not my problem!” It’s a willingness to roll up your sleeves and dig in to a problem but also a capacity for knowing how and when to involve others. Figure out what you don’t know and push beyond it.

Be humble. There’s a book that was written in the 70’s called The Psychology of Computer Programming, and it’s interesting because it focuses on the human elements of software development rather than the purely technical ones that we normally think about. In the book, it presents The 10 Commandments of Egoless Programming, which I think contain a powerful set of guiding principles for software engineers:

Understand and accept that you will make mistakes.
You are not your code.
No matter how much “karate” you know, someone else will always know more.
Don’t rewrite code without consultation.
Treat people who know less than you with respect, deference, and patience.
The only constant in the world is change. Be open to it and accept it with a smile.
The only true authority stems from knowledge, not from position.
Fight for what you believe, but gracefully accept defeat.
Don’t be “the coder in the corner.”
Critique code instead of people—be kind to the coder, not to the code.

Part of being a humble engineer is giving away all the credit. This is especially true for leaders or managers. A manager I once had put it this way: “As a manager, you should never say ‘I’ during a review unless shit went wrong and you’re in the process of taking responsibility for it.” A good leader gives away all the credit and takes all of the blame.

Be engaged. Coding is actually a very small part of our job as software engineers. Our job is to be engaged with the organization. Engage with stakeholder meetings and reviews. Engage with cross-trainings and workshops. Engage with your company’s engineering blog. Engage with other teams. Engage with recruiting and company outreach through conferences or meetups. People dramatically underestimate the value of developing their network, both to their employer and to themselves. You don’t have to do all of these things, but my point is engineers get overly fixated on coding and deliverables. Code is just the byproduct. We’re not paid to write code, we’re paid to add value to the business, and a big part of that is being engaged with the organization.

And of course I’d be remiss not to talk about empathy. Empathy is having a deep understanding of what problems someone is trying to solve. John Allspaw puts it best:

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Changing your perspective is a powerful way to deepen your relationships. Once again, it comes back to Dunbar’s number: we have a limited number of stable relationships, but developing and maintaining those relationships is the key to figuring out the rarity of teamwork.

A former coworker of mine passed away late last year. In going through some of his old files, we came across some notes he had on leadership. There was one quote in particular that I thought really captured the essence of this post nicely:

All music is made from the same 12 notes. All culture is made from the same five components: behaviors, relationships, attitudes, values, and environment. It’s the way those notes or components are put together that makes things sing.

This is what it takes to build a strong engineering culture and really just a healthy culture in general. The technology and everything else is secondary. It really starts with the people.

Follow @tyler_treat

May 3, 2017April 17, 2018

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

Did you notice that QA roles seem to quietly start going away? So many dev organizations I worked with or know seem to do fine without QA.

— Gwen (Chen) Shapira (@gwenshap) April 6, 2017

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

The model of ops as service providers (cluster admins) is terminal and outmoded. Devs will always out-demand their capacity to supply.

— ✕✕✕✕✕ (@peterbourgon) April 5, 2017

The correct model is ops as force multipliers: building automation to let d provision their own clusters and base infrastructure.

— ✕✕✕✕✕ (@peterbourgon) April 5, 2017

Dev: My cluster is broken!
Op: OK, this is my problem now—please wait while I fix it.
—The wrong model :(

— ✕✕✕✕✕ (@peterbourgon) April 5, 2017

Dev: My cluster is broken!
Op: OK, I’m your domain expert, to help you fix it yourself; or, you can use this tooling to reprovision it.
— :)

— ✕✕✕✕✕ (@peterbourgon) April 5, 2017

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

@peterbourgon I have many thoughts on this subject, the tweet version is: ops as we know it is dead, anyone doing infra has 5 years to move to product.

— Dαve Cheney (@davecheney) April 5, 2017

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Ops lock-in: When your organization cannot innovate faster than your ops team will allow or willing to support.

— Kelsey Hightower (@kelseyhightower) April 4, 2017

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Follow @tyler_treat

April 7, 2017April 10, 2017

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.

Follow @tyler_treat