Plant Trees Before You Need the Shade

Like humans, companies go through phases. There’s the early seed and development phase. Founders are so preoccupied with a problem they go crazy. They consider solutions and the feasibility of a business. There’s the startup phase, when a business is actually born, and it stumbles towards product/market fit. There’s the growth and scaling phase, as we try to close more and more deals while, at the same time, hiring the right people. If we’re lucky, we reach the later stages. There’s the expansion phase, as we attempt to land and expand or attack new verticals or geographies. This is when things get really interesting—and hard. Who are the right people to hire? What are the right products to build? The formula that got us here almost certainly won’t get us there. Lastly, there’s maturity, which is when the business has really hit its stride. Maybe there’s an exit, and very likely there’s new leadership involved.

Consistent in all of this are two things: culture and capabilities. Culture is the invisible hand inside your organization. It’s your company’s autopilot. Specifically, culture is the unique combination of processes and values within an organization. These processes and values are what enable us to replicate our success. They allow people to make decisions which are in alignment with the goals of the company without having to constantly coordinate with one another.

This also means your culture is derived from your capabilities, what your organization can and cannot do. Clayton Christensen groups these factors into three buckets: resources, processes, and values. Resources are the (mostly) tangible things a company has—people, capital, brands, intellectual property, relationships with customers, manufacturers, distributors, and so forth. Processes are what we do with resources to accomplish the organization’s goals, such as developing products, developing employees, hiring, firing, doing market research, and allocating resources. They take in resources and produce value. Processes help us protect and scale our values by providing a means of documenting and codifying them. These are predominantly intangible things. Finally, values define how a company makes decisions. What goes to the top of the list, and what gets ignored. What gets investment, and what doesn’t. These are our priorities that guide us.

There are a few problems with how leadership tends to view capabilities. There is typically an overemphasis on resources. This happens because in the startup phase, success is largely governed by resources. Notably, people. This is especially true of software startups. I quote this section from The Innovator’s Dilemma frequently:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

In the beginning stages, people drive success. Early engineers and founders (the two are not mutually exclusive) have an itch to scratch that they are so passionate about, they evangelize this crazy grand vision and people get excited. The company is small, focused, passionate, and everyone is working closely together to solve a customer problem they are obsessed with. And, sometimes, it pays off.

Next, leadership starts asking itself, “OK, we shipped this crazy successful product, now how do we grow?” This is where the wheels start to come off. There are three problems that occur.

First, the focus shifts away from solving a problem you’re obsessed with to finding the next big product. These companies fail to find a new product because they are searching for one without passion. They are seeking top-line revenue growth, not pursuing a vision. They are looking at markets through the lens of “here’s where we can make money” and not “here’s where we can solve problems,” and in doing so, they lose sight of the customer. At this stage, they basically have forgotten what got them here.

Second, their processes get in their own way. This seems contradictory given that processes, by their very nature, are meant to facilitate repeatability. If we have good processes in place, we should be able to apply them time and time again, even with different people, and end up with consistent results. But things break down when we apply the same processes to different problems. In essence, they try to find success using what brought them success to begin with, but with a warped perspective. You can look at every startup and they will all have wildly different stories about how they found success, but the one thing they will all have in common is that relentless itch. When we attack a new market, we need to do a reset on the processes and values, not a recycle. Christensen suggests if your company is so deeply entrenched in its processes that this isn’t viable, as is often the case with large enterprises, spin it out into a new venture. Processes are as dynamic as the company itself. They are not a one-size-fits-all deal. Christensen calls this the migration of capabilities.

The factors that define an organization’s capabilities and disabilities evolve over time—they start in resources; then move to visible, articulated processes and values; and migrate finally to culture. As long as the organization continues to face the same sorts of problems that its processes and values were designed to address, managing the organization can be straightforward. But because those factors also define what an organization cannot do, they constitute disabilities when the problems facing the company change fundamentally.

Third, they don’t fully appreciate the nature of dynamic priorities. Software startups in particular often mistakenly attribute their initial success to technology when, in reality, it’s because of people and timing. Aside from the passion, you need people early on who can ship and ship often. These might not be the most technically capable engineers, but they get things done when it matters. Later, you need people who can still ship but while cleaning things up. Lastly, you need maintainers—people who can refine without breaking anything. That’s not to say you have these archetypes exclusively at each stage—you want a balance of people—but it’s similar to how companies commonly need a different type of CTO at different phases or an Interim CTO.

While resources are an essential part of early success, it’s processes and values that will sustain you. However, we tend to overfit on resources because we become biased from that success—investing heavily in technology and innovation and grounding the company’s success in a few key individuals. We also overfit because resources are more visible and measurable. Being deliberate about establishing processes and values—which are derived from vision—helps to overcome this bias, but it needs to happen early and be continually reinforced. We also need to ensure our processes adapt to new problems. The larger and more complex an organization becomes, the harder this gets.

Moreover, we need to be conscientious about which processes matter to us the most. Often the most important capabilities aren’t reflected by the most visible processes—product development or customer service, for example—but in the less visible, background processes that support decisions about where to invest resources. These might include determining how market research is done, how financial and sales projections are drawn from this analysis, how products are conceived, how planning and budgets are negotiated, and so on. These processes are where many companies get their inability to cope with change.

And this is where the breakdown happens: a company has highly capable people—people who have helped shape its success as a startup—but arms them with the wrong processes and values. The result is often a boom followed by a bunch of fizzles as they try to catch lightning in a bottle once more. A compelling vision plants a seed. Strong processes and clear values help that seed to grow. But the shade produced by that seed—our capabilities—is stationary, so when we approach a new challenge, we need to recognize when to start tilling.

Software Is About Storytelling

Software engineering is more a practice in archeology than it is in building. As an industry, we undervalue storytelling and focus too much on artifacts and tools and deliverables. How many times have you been left scratching your head while looking at a piece of code, system, or process? It’s the story, the legacy left behind by that artifact, that is just as important—if not more—than the artifact itself.

And I don’t mean what’s in the version control history—that’s often useless. I mean the real, human story behind something. Artifacts, whether that’s code or tools or something else entirely, are not just snapshots in time. They’re the result of a series of decisions, discussions, mistakes, corrections, problems, constraints, and so on.  They’re the product of the engineering process, but the problem is they usually don’t capture that process in its entirety. They rarely capture it at all. They commonly end up being nothing but a snapshot in time.

It’s often the sign of an inexperienced engineer when someone looks at something and says, “this is stupid” or “why are they using X instead of Y?” They’re ignoring the context, the fact that circumstances may have been different. There is a story that led up to that point, a reason for why things are the way they are. If you’re lucky, the people involved are still around. Unfortunately, this is not typically the case. And so it’s not necessarily the poor engineer’s fault for wondering these things. Their predecessors haven’t done enough to make that story discoverable and share that context.

I worked at a company that built a homegrown container PaaS on ECS. Doing that today would be insane with the plethora of container solutions available now. “Why aren’t you using Kubernetes?” Well, four years ago when we started, Kubernetes didn’t exist. Even Docker was just in its infancy. And it’s not exactly a flick of a switch to move multiple production environments to a new container runtime, not to mention the politicking with leadership to convince them it’s worth it to not ship any new code for the next quarter as we rearchitect our entire platform. Oh, and now the people behind the original solution are no longer with the company. Good luck! And this is on the timescale of about five years. That’s maybe like one generation of engineers at the company at most—nothing compared to the decades or more software usually lives (an interesting observation is that timescale, I think, is proportional to the size of an organization). Don’t underestimate momentum, but also don’t underestimate changing circumstances, even on a small time horizon.

The point is, stop looking at technology in a vacuum. There are many facets to consider. Likewise, decisions are not made in a vacuum. Part of this is just being an empathetic engineer. The corollary to this is you don’t need to adopt every bleeding-edge tech that comes out to be successful, but the bigger point is software is about storytelling. The question you should be asking is how does your organization tell those stories? Are you deliberate or is it left to tribal knowledge and hearsay? Is it something you truly value and prioritize or simply a byproduct?

Documentation is good, but the trouble with documentation is it’s usually haphazard and stagnant. It’s also usually documentation of how and not why. Documenting intent can go a long way, and understanding the why is a good way to develop empathy. Code survives us. There’s a fantastic talk by Bryan Cantrill on oral tradition in software engineering where he talks about this. People care about intent. Specifically, when you write software, people care what you think. As Bryan puts it, future generations of programmers want to understand your intent so they can abide by it, so we need to tell them what our intent was. We need to broadcast it. Good code comments are an example of this. They give you a narrative of not only what’s going on, but why. When we write software, we write it for future generations, and that’s the most underestimated thing in all of software. Documenting intent also allows you to document your values, and that allows the people who come after you to continue to uphold them.

Storytelling in software is important. Without it, software archeology is simply the study of puzzles created by time and neglect. When an organization doesn’t record its history, it’s bound to repeat the same mistakes. A company’s memory is comprised of its people, but the fact is people churn. Knowing how you got here often helps you with getting to where you want to be. Storytelling is how we transcend generational gaps and the inevitable changing of the old guard to the new guard in a maturing engineering organization. The same is true when we expand that to the entire industry. We’re too memoryless—shipping code and not looking back, discovering everything old that is new again, and simply not appreciating our lineage.

Engineering Empathy

This was a talk I gave at an internal R&D conference my last week at Workiva. I got a lot of positive feedback on the talk, so I figured I would share it with a wider audience. Be warned: it’s long. Feel free to read each section separately, though they largely tie together.

Why do you work where you work? For many in tech, the answer is probably culture. When you tell a friend about your job, the culture is probably the first thing you describe. It’s culture that can be a company’s biggest asset—and its biggest downfall. But what is it?

Culture isn’t a list of values or a mission statement. It’s not a casual dress code or a beer fridge. Culture is what you reward and what you don’t. More importantly, it’s what you reward and what you punish. That’s an important distinction to make because when you don’t punish behavior that’s inconsistent with your culture, you send a message: you don’t care about it.

So culture is what you live day in and day out. It’s not what you say, it’s what you do. Put yourself in the shoes of a new hire at your company. A new hire doesn’t walk in automatically knowing your culture. They walk in—filled with anxiety—hoping for success, fearing failure, and they look around. They observe their environment. They see who is succeeding, and they try to emulate that behavior. They see who is failing, and they try to avoid that behavior. They ask the question: what makes someone successful here?

Jim Rohn was credited with saying we are the average of the five people we spend the most time with, and I think there’s a lot of truth to this idea. The people around you shape who you are. They shape your behavior, your habits, your thoughts, your opinions, your worldview. Culture is really a feedback loop, and we all contribute to it. It’s not mandated or dictated down (or up through grassroots programs), it’s emergent, but we have to be deliberate about our behaviors because that’s what shapes our culture (note that this doesn’t mean culture doesn’t start with leadership—it most certainly does).

In my opinion, a strong engineering culture is comprised of three parts: the right people, the right processes, and the right priorities. The right people means people that align with and protect your values. Processes are how you execute—how you communicate, develop, deliver, etc. And priorities are your values—the skills or behaviors you value in fellow employees. It’s also your vision. These help you to make decisions—what gets done today and what goes to the bottom of the list. For example, many companies claim a customer-first principle, but how many of them actually use it to drive their day-to-day decisions? This is the difference between a list of values and a culture.

What about technology? Experience? Customer relationships? These are all important competitive advantages, but I think they largely emerge from having the right people, processes, and priorities. The three are deeply intertwined. A culture is the unique combination of processes and values within an organization, and it’s those processes and values that enable you to replicate your success. I’m a big fan of Reed Hastings’ Netflix culture slide deck, but there are some things with which I fundamentally disagree. Hastings says, “The more talent density you have the less process you need. The more process you create the less talent you retain.” This is wrong on a number of levels, which I will talk about later.

Empathy is the common thread throughout each of these three areas—the people, the processes, and the priorities—and we’ll see how it applies to each.

The Ultimate Complex System: People

We largely think about software development as a purely technical feat, one which requires skill and creativity and ingenuity. I think for many, it’s why we became engineers in the first place. We like solving problems. But when all is said and done, computers do what you tell them to do. Computers don’t have opinions or biases or agendas or egos. The technical challenges are really a small part of any sufficiently complicated piece of software. Having been an individual contributor, tech lead, and manager, I’ve come to the realization that it’s people that is the ultimate complex system.

There’s a quote from The Five Dysfunctions of a Team which I’ve referenced before on this blog that I think captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” Teamwork is powerful is intuitive, but teamwork is rare is more profound. Teamwork is a competitive advantage because it’s rare—that’s a pretty strong statement. Let’s unpack it a bit.

It’s the difference between technical architecture and social architecture. We tend to focus on the former while neglecting the latter, but software engineering is more about collaboration than code. It wasn’t until I became a manager that I realized good managers are force multipliers, but social architecture is everyone’s responsibility. Remember, culture is a feedback loop which we all contribute to, so everyone is a social architect.

A key part of social architecture is communication empathy. Back in the 90’s, an evolutionary psychologist by the name of Robin Dunbar proposed the idea that humans can only maintain about 150 stable social relationships. The limit is referred to as Dunbar’s number. He drew a correlation between primate brain size and the average size of cohesive social groups. Informally, Dunbar describes this as “the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” This number includes all of the relationships in your life, both personal and professional, past and present.

What’s interesting about Dunbar’s number is how it applies to our jobs as software engineers and the interplay with cognitive biases we have as humans. When you don’t understand what someone else does, you’re automatically biased against them. In your head, you’re king. You understand exactly what you do, what your job is, the value that you add, but outside your head—outside your Dunbar’s number—those people are all a mystery. And humans have this funny tendency to mock what they don’t understand. We have lots of these cognitive biases.

Building those types of stable relationships is really hard when you can’t just walk over to someone’s desk and talk to them face-to-face. This was something we struggled a lot with at Workiva, being a company of a few hundred engineers spread out across 11 or so offices. Not only is split-brain an inevitability, but just a general lack of rapport. That stage is super hard to go through for a lot of companies—going from dozens of engineers to hundreds in a few short years. The cruel irony is no matter how agile a company claims to be, the culture—of which structure and processes are a part of—is usually the slowest to adjust. No longer is decision making done by standing up and telling a roomful of people, “I’m going to do this. Please tell me now if that’s a bad idea.” And no longer are decisions often made unanimously and rapidly.

Building stable relationships is much harder without the random hallway error correction. The right people aren’t always bumping into each other at the right time, whereas before a lot of decisions could be made just-in-time. Instead, communication has to be more deliberate and no longer organic. Decisions are made by the Jeff Bezos philosophy of disagree and commit. But nevertheless, building empathy is hard without face-to-face communication, and you miss out on a lot of the nuance of communication.

Again, communication is highly nuanced, and nuance is hard to convey over HipChat. The role of emotions plays a big part. Imagine the situation where you’re walking down a sidewalk or a hallway and someone accidentally walks in front of you. You might sidestep or do a little dance to get around each other, smile or nod, and get on with your day. This minor inconvenience becomes almost this tiny, pleasant interaction between two people. Now, take the same scenario but between two drivers, and you probably have some kind of road rage type situation. The only real difference is the steel cage surrounding the drivers blocking out the verbal and non-verbal communication. How many times has a HipChat conversation gone completely off the rails only to be resolved by a quick Google Hangout or in-person conversation? It’s the exact same thing.

One last note with respect to cognitive biases: once again, humans have a funny tendency where, in a vacuum of information, people will create their own. We will manufacture information just to fill the void, and often it’s not just creating information but taking information we’ve heard somewhere else and applying it to our own—and often the wrong—context. The extreme example of this is “We heard microservices worked well for Netflix, so we should use them at our growing startup” or “Google does monorepo, so we should too.” You’re ignoring all of the context—the path those companies followed to reach those points, the trade-offs made, the organizational differences and competencies, the pain points. When the cognitive biases and opinions we have as humans are added in, the problems amplify and compound. You get a frustrating game of telephone.

You can’t kill The Grapevine anymore than you can change human nature, so you have to address it head-on where and when you can. Don’t allow the vacuum of information to form in the first place, and be cognizant of when you’re applying information you’ve heard from someone else. Is the problem the same? Do you have the same amount of information? Are your constraints the same? What is the full context?

I tend to look at communication through two different lenses: push and pull. In order to be an empathetic communicator, I think it’s important to look at these while thinking about the cognitive biases discussed earlier, starting with push.

I’ve been in meetings where someone called out someone else as a “blocker”, and there was visible wincing in the room. I think for some, the word probably triggers some sort of PTSD. When you’re depending on something that’s outside of your direct control to get something else done, it’s hard not to drop the occasional B-word. It happens out of frustration. It happens because you’re just trying to ship. It also happens because you don’t want to look bad. And it might seem innocuous in the moment, but it has impact. By invoking it, you are tempting fate.

There’s a really good book that was written way back in 1944 called The Unwritten Laws of Engineering. It was published by the American Society of Mechanical Engineers and the language in the book is quite dated (parts of it read like a crotchety old guy yelling at kids to get off his lawn), but the ideas in the book still apply very much today and even to software engineering. It’s really about people, and fundamentally, people don’t change. One of the ideas in the book is this notion which I call communication impact. I’ve taken a quote from the book which I think highlights this idea (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

I see this a lot. Not just in emails (née “memos”) but in meetings or reviews, someone will—inadvertently or not—throw someone else under the proverbial bus, i.e. “broadcasting memorandum containing damaging or embarrassing statements”, “stepping too heavily upon someone’s toes”, or “revealing a serious shortcoming on somebody’s part.” The problem with this is, by doing it, you immediately put the other party on the defensive and also create a cognitive bias for everyone else in the room. You create a negative predisposition, which may or may not be warranted, toward them. Similarly, I liken “if it concerns manufacturing or customer difficulties” to production postmortems. This is why they need to be blameless. Why is it that retros on production issues are blameless while, at the same time, the development process is full of blame-assigning? It might seem innocuous, but your communication has impact. Push with respect and under the assumption the other person is probably doing the right thing. Don’t be willing to throw anyone under that bus. Likewise, be quick to take responsibility but slow to assign it. Don’t be willing to practice Cover Your Ass Engineering.

I’ve been in meetings where someone would get called out as a blocker, literally articulated in that way, and the person wasn’t even aware they were blocking anything. I’ve seen people create JIRA tickets on another team’s board and then immediately call them blockers. It’s important to call out dependencies ahead of time, and when someone is “blocking” your progress, speak to them about it individually and before it reaches a critical point. No one should be getting caught off guard by these things. Be careful about how and where you articulate these types of problems.

On the same topic of communication impact, I’ve seen engineers develop detailed and extravagant plans like “We’re going to move the entire company to a monorepo while simultaneously switching from Git to Mercurial” or “We’re going to build our own stream-processing framework from the ground up”, and then distribute them widely to the organization (“wide distribution” as referenced in The Unwritten Laws of Engineering passage above). The proposals are usually well-intentioned and maybe even compelling sometimes, but it’s the way in which they are communicated that is problematic. Recall The Grapevine: people see it, assume it’s reality, and then spread misinformation. “Did you know the company is switching to Mercurial?”

An effective way to build rapport between teams is genuinely celebrating the successes of other teams, even the small ones. I think for many organizations, it’s common to celebrate victories within a team—happy hour for shipping a new feature or a team outing for signing a major account—but celebrating another team’s win is more rare, especially when a company grows in size. The operative word is “genuine” though. Don’t just do it for the sake of doing it, be genuine about it. This is a compelling way to build the stable relationships needed to unlock the rarity of teamwork described earlier.

Equally important to understanding communication impact is understanding decision impact. I’ve already written about this, so I’ll keep it brief: your decisions impact others. How does adopting X affect Operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? Doing something might save you time, but does it create work or slow others down?

Teams operate in a way that minimizes the amount of pain they feel. It’s a natural instinct and a phenomenon I call pain displacement. Pain-driven development is following the path of least resistance. By doing this, we end up moving the pain somewhere else or deferring it until later (i.e. tech debt). Where the problems start to happen is when multiple teams or functions are involved. This is when the political and other organizational issues start to seep in. Patrick Lencioni, author of The Five Dysfunctions of a Team, has a book that touches on this subject called Silos, Politics, and Turf Wars. 

I believe the solution is multifaceted. First, teams need to think holistically, widening their vision beyond the deliverable immediately in front of them. They need to have a sense of organizational awareness. Second, teams—and especially leaders—need to be able to take off their job’s “hat” periodically in order to solve a shared problem. Lencioni observes that much of what causes organizational dysfunction is siloing, and this typically stems from strong intra-team loyalties. For example, within an engineering organization you might have development, operations, QA, security, and other functional teams. Empathy is being able to look at something through someone else’s perspective, and this requires removing your functional hat from time to time. Lastly, teams need to be able to rally around a common cause. This is a shared, compelling vision that motivates and mobilizes people and helps break down the silos. A shared vision aligns teams and enables them to work more autonomously. This is how decisions get made.

Pull communication is pretty much just how to ask questions without making people hate you, a skill that is very important to be an effective and empathetic communicator.

The single most common communication issue I see in engineering organizations is The XY Problem. It’s when someone focuses on a particular solution to their problem instead of describing the problem itself.

  • User wants to do X.
  • User doesn’t know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
  • User doesn’t know how to do Y either.
  • User asks for help with Y.
  • Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
  • After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn’t even a suitable solution for X.

The problem occurs when people get stuck on what they believe is the solution and are unable to step back and explain the issue in full. The solution to The XY Problem is simple: always provide the full context of what you’re trying to do. Describe the problem, don’t just prescribe the solution.

Part of being an effective communicator is being able to extract information from people and getting help without being a mental and emotional drain. This is especially true when it comes to debugging. I often see this “murder-mystery debugging” where someone basically tries to push off the blame for something that’s wrong with their code onto someone or something else. This flies in the face of the principle discussed earlier—be quick to take responsibility and slow to assign it. The first step when it comes to debugging anything is assume it’s your fault by default. When you run some code you’re writing and the compiler complains, you don’t blame the compiler, you assume you screwed up. It’s just taking this same mindset and applying it to everything else that we do.

And when you do need to seek help from others—just like with The XY Problem—provide as much context as possible. So much of what I see is this sort of information trickle, where the person seeking help drips information to the people trying to provide it. Don’t make it an interrogation. Lastly, provide a minimal working example that reproduces the problem. Don’t make people build a massive project with 20 dependencies just to reproduce your bug. It’s such a common problem for Stack Overflow that they actually have a name for it: MCVE—Minimal, Complete, Verifiable Example. Do your due diligence before taking time out of someone else’s day because the only thing worse than a bug report is a poorly described, hastily written accusation.

Another thing I see often is swoop-and-poop engineering. This is when someone comes to you with something they need help with—maybe a bug in a library you own, a feature request, something along these lines (this is especially true in open source). They have a sense of urgency; they say it either explicitly or just give off that vibe. You offer to setup a meeting to get more information or work through the problem with them only to find they aren’t available or willing to set aside some time with you. They’re heads down on “something more important,” yet their manager is ready to bite your head off weeks later, often without any documentation or warning. They’ve effectively dumped this on you, said the world’s on fire, and left as quickly as they came. You’re left confused and disoriented. You scratch your head and forget about it, then days or weeks later, they return, horrified that the world is still burning. I call these drive-by questions.

First, it’s important to have an appropriate sense of urgency. If you’re not willing to hop on a Hangout to work through a problem or provide additional information, it’s probably not that important, especially if you can’t even take the time to follow up. With few exceptions, it’s not fair to expect a team to drop everything they’re doing to help you at a moment’s notice, but if they do, you need to meet them halfway. It’s essential to realize that if you’re piling onto a team, others probably are too. If you submit a ticket with another team and then turn around and immediately call it a blocker, that just means you failed to plan accordingly. Having empathy is being cognizant that every team has its own set of priorities, commitments, and work that it’s juggling. By creating that ticket and calling it a blocker, you’re basically saying none of that stuff matters as much. Empathy is understanding that shit rolls downhill. For those who find themselves facing drive-by questions: document everything and be proactive about communicating.

There’s a really good essay by Eric Raymond called How To Ask Questions The Smart Way. It’s something I think every engineer should read and take to heart. My number one pet peeve is Help Vampires. These are people who refuse to take the time to ask coherent, specific questions and really aren’t interested in having their questions answered so much as getting someone else to do their work. They ask the same, tired questions over and over again without really retaining information or thinking critically. It’s question, answer, question, answer, question, answer, ad infinitum.

This is often a hard-earned lesson for junior engineers, but it’s an important one: when you ask a question, you’re not entitled to an answer, you earn the answer. Hasty sounding questions get hasty answers. As engineers, we should not operate like a tech support hotline that people call when their internet stops working. We need to put in a higher level of effort. We need to apply our technical and problem-solving aptitude as engineers. This is the only way you can scale this kind of support structure within an engineering organization. If teams are just constantly bombarding each other with low-effort questions, nothing will get done and people will get burnt out.

Avoid being a Help Vampire. Before asking a question, do your due diligence. Think carefully about where to ask your question. If it’s on HipChat, what is the appropriate room in which to ask? Also be mindful of doing things like @all or @here in a large room. Doing that is like walking into a crowded room, throwing your hands up in the air, and shouting at everyone to look at you. Be precise and informative about your problem, but also keep in mind that volume is not precision. Just dumping a bunch of log messages is noise. Don’t rush to claim that you’ve found a bug. As a first step, take responsibility. And just like with The XY Problem, describe the goal, not the step you took—describe vs. prescribe. Lastly, follow up on the solution. Everyone has been in this situation: you’ve found someone that asked the exact same question as you only to find they never followed up with how they fixed it. Even if it’s just in HipChat or Slack, drop a note indicating the issue was resolved and what the fix was so others can find it. This also helps close the loop when you’ve asked a question to a team and they are actively investigating it. Don’t leave them hanging.

In many ways, being an empathetic communicator just comes down to having self-awareness.

Codifying Values and Priorities: Processes

“Process” has a lot of negative connotations associated with it because it usually becomes this thing done on ceremony. But “process” should be a means of documenting and codifying your values. This is why I disagree with the Reed Hastings quote about process from earlier. Process is about repeatability and error correction. Camille Fournier’s new book on engineering management, The Manager’s Path, has a great section on “bootstrapping culture.” I particularly like the way she frames organizational structure and process:

When talking about structure and process with skeptics, I try to reframe the discussion. Instead of talking about structure, I talk about learning. Instead of talking about process, I talk about transparency. We don’t set up systems because structure and process have inherent value. We do it because we want to learn from our successes and our mistakes, and to share those successes and encode the lessons we learn from failures in a transparent way. This learning and sharing is how organizations become more stable and more scalable over time.

When a process “feels” wrong, it’s probably because it doesn’t reflect your organization’s values. For example, if a process feels heavy, it’s because you value velocity. If a process feels rigid, it’s because you value agility. If a process feels risky, it’s because you value safety. We have a hard time articulating this so instead it becomes “process is bad.”

Somewhere along the line, someone decides to document how stuff gets done. Things get standardized. Tools get made. Processes get established. But process becomes dogma when it’s interpreted as documentation of how rather than an explanation of why. Processes should tell the story of an organization: here’s what we value, here’s why we value it, and here’s how we protect and scale those values. The story is constantly evolving, so processes should be flexible. They shouldn’t be set in stone.

Michael Lopp’s book Managing Humans also provides a useful perspective on culture:

It’s entirely possible that too much process or the wrong process is developed during this build-out, but when this inevitable debate occurs, it should not be about the process. It’s a debate about values. The first question isn’t, “Is this a good, bad, or efficient process?” The first question is, “How does this process reflect our values?”

Processes should be traceable back to values. Each process should have a value or set of values associated with it. Understanding the why helps to develop empathy. It’s the difference between “here’s how we do things” and “here’s why we do things.” It’s much harder to develop a sense of empathy with just the how.

What We Value: Priorities

As engineers, we need to be curious. We need to have a “let’s go see!” attitude. When someone comes to you with a question—and hopefully it’s a well-formulated question based on the earlier discussion—your first reaction should be, “let’s go see!” Use it as an opportunity for both of you to learn. Even if you know the answer, sometimes it’s better to show, not tell, and as the person asking the question, you should be eager to learn. This is the reason I love Julia Evans’ blog so much. It’s oozing with wonder, curiosity, and intrigue. Being an engineer should mean having an innate curiosity. It’s not throwing up your hands at the first sign of an API boundary and saying, “not my problem!” It’s a willingness to roll up your sleeves and dig in to a problem but also a capacity for knowing how and when to involve others. Figure out what you don’t know and push beyond it.

Be humble. There’s a book that was written in the 70’s called The Psychology of Computer Programming, and it’s interesting because it focuses on the human elements of software development rather than the purely technical ones that we normally think about. In the book, it presents The 10 Commandments of Egoless Programming, which I think contain a powerful set of guiding principles for software engineers:

  1. Understand and accept that you will make mistakes.
  2. You are not your code.
  3. No matter how much “karate” you know, someone else will always know more.
  4. Don’t rewrite code without consultation.
  5. Treat people who know less than you with respect, deference, and patience.
  6. The only constant in the world is change. Be open to it and accept it with a smile.
  7. The only true authority stems from knowledge, not from position.
  8. Fight for what you believe, but gracefully accept defeat.
  9. Don’t be “the coder in the corner.”
  10. Critique code instead of people—be kind to the coder, not to the code.

Part of being a humble engineer is giving away all the credit. This is especially true for leaders or managers. A manager I once had put it this way: “As a manager, you should never say ‘I’ during a review unless shit went wrong and you’re in the process of taking responsibility for it.” A good leader gives away all the credit and takes all of the blame.

Be engaged. Coding is actually a very small part of our job as software engineers. Our job is to be engaged with the organization. Engage with stakeholder meetings and reviews. Engage with cross-trainings and workshops. Engage with your company’s engineering blog. Engage with other teams. Engage with recruiting and company outreach through conferences or meetups. People dramatically underestimate the value of developing their network, both to their employer and to themselves. You don’t have to do all of these things, but my point is engineers get overly fixated on coding and deliverables. Code is just the byproduct. We’re not paid to write code, we’re paid to add value to the business, and a big part of that is being engaged with the organization.

And of course I’d be remiss not to talk about empathy. Empathy is having a deep understanding of what problems someone is trying to solve. John Allspaw puts it best:

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Changing your perspective is a powerful way to deepen your relationships. Once again, it comes back to Dunbar’s number: we have a limited number of stable relationships, but developing and maintaining those relationships is the key to figuring out the rarity of teamwork.

A former coworker of mine passed away late last year. In going through some of his old files, we came across some notes he had on leadership. There was one quote in particular that I thought really captured the essence of this post nicely:

All music is made from the same 12 notes. All culture is made from the same five components: behaviors, relationships, attitudes, values, and environment. It’s the way those notes or components are put together that makes things sing.

This is what it takes to build a strong engineering culture and really just a healthy culture in general. The technology and everything else is secondary. It really starts with the people.

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.