Scaling DevOps and the Revival of Operations

Operations is going through a renaissance right now. With the move to cloud, the increasing amount of automation, and the increasing importance of automation, Ops as we know it is reinventing itself out of necessity. Infrastructure is becoming more and more sophisticated—and commoditized—and practices are just now starting to grow up around that. So while some worry about robots taking our jobs, the reality is more about how automation will help augment us to build better software and focus on higher-value things. It’s not so much about the distant future—whatever that may hold—so much as it is about the next five to ten years, what Operations looks like in that timeframe, and why I think it has to retool.

When we think about traditional Operations, we probably think about hardware and servers, managing networks and databases, application servers and runtimes, disaster recovery, Nagios checks, as well as the business side—vendor management, procurement, and so on. Finally, we have applications built on top by development teams.

We have a nice, clean separation—developers focus on building features and products, and Ops focuses on making sure the lights stay on. Of course, we know the reality is this separation also creates a lot of problems, so DevOps was borne out of this as a way to bring these two groups into alignment by improving communication and feedback loops.

Now, with the move to cloud, many of these traditional Ops functions are effectively being outsourced to cloud providers, i.e. the idea of NoOps. We get unprecedented elasticity and on-demand compute with far less overhead than we ever had before—shrinking procurement time from days or weeks to seconds or minutes.

What this leaves is a thin but important slice between Google or Amazon and those products built by developers—the glue, essentially, between cloud and product. I call this NewOps (which I use facetiously in reference to NoSQL/NewSQL), and it’s the future of Ops. This encompasses infrastructure automation, deployment automation, configuration management, logging, monitoring, and many other things. When Marc Andreessen said software is eating the world, he really meant it. The future of Ops—and many other things—is software. It’s killing the boring, repetitive things we really don’t want to be doing anyway and letting us shift our focus elsewhere.

Certainly, automation is nothing new and is, I think, an important part of DevOps, so I’m going to explain what I mean by NewOps and why I’m distinguishing it. I also don’t want to mischaracterize by having these neatly delineated Ops models. The truth is, your company doesn’t just one day graduate and gets its DevOps diploma. Instead, it might evolve through various manifestations of these different models. DevOps is a journey, not a destination in and of itself.

I like to think of a DevOps scale of automation, from manual provisioning all the way to fully self-service. Next, I add a second dimension, org size, from the smallest startups to the biggest enterprises.

Scaling DevOps

Scaling a business is probably one of the hardest things a company has to go through. In particular, dealing with the problem of silos. They happen at every company as it grows, but why is it that silos form in the first place?

Many companies start with a “DevOps” approach, often out of necessity more than anything. As a small startup, we can’t afford to have dedicated developers, QA, Ops, and security people. We just have people, and those people wear many different hats. Developers might be pushing their own code to production. They might even be managing the infrastructure that code runs on. There’s probably not a lot of stability, probably a lot of risk, and probably not a whole lot of thought towards controlling costs.

But as the product scales, we specialize. And as the business scales, we add various safety checks, controls, and processes. Developers write code, Ops people run it, QA gets blamed for defects, security blocks everything, and management wonders why nothing gets shipped.

And so we end up in the top left-hand quadrant with Ops as gatekeepers. Ops is fighting for stability and, at the same time, devs are basically fighting for change. More or less, we have a stable, cost-controlled, risk-averse environment—hopefully. But we also have a significant delivery and innovation bottleneck.

Specialization is good! But misalignment is not good. The question is, then, how do we scale specialization? Cross-functional teams come to mind. After all, DevOps encourages cooperation! We add an Ops engineer to each team, and maybe a reliability engineer, and perhaps a few extra for on-call backup, and of course a QA engineer too. Problem solved, right?

But hold on. What if we have 40 development teams? And all those teams are doing microservices. And, of course, all of those microservices are special snowflakes each with their own stacks, infrastructure, databases, and so on. This quickly gets out of control, but moreover, that’s a lot of teams and specialized roles on those teams. That’s a lot of headcount which equates to a lot of hiring and a lot of time and money. If you’re Google and you can just throw money at the problem, this might work out okay. For the rest of us, it might not be such a realistic option.

We go back to the drawing board and again ask ourselves how do we scale specialization? My thought to how we do this is with vision and product.

A vision is simply a mental image of what the future could be like. It enables independent decision making and alignment. Vision allows all of those teams, and the people on those teams, to make decisions without having to constantly coordinate with each other. Without vision, you’re just iterating to nowhere fast.

But vision without execution is just hallucination. Products are how we scale execution. Specifically, this idea of Operations through the lens of product, which I’ll describe after showing the parallel with what’s happening in QA.

In a lot of engineering organizations, many QA roles have been quietly disappearing. I think what’s happening is this evolution of QA, particularly, this shift from being test-focused to tools-focused.

We can look at companies like Amazon and Microsoft who popularized the SDET (Software Development Engineer in Test) model. These companies recognized that having a separate QA and development group causes a lot of problems, just like how having a separate Ops group does. We end up with SDEs (Software Development Engineers) who still focus on the development aspects of building software and SDETs who focus on the quality aspects, but rather than having two wholly separate groups, we just have development teams with SDETs embedded in them.

More recently, Microsoft moved to what they call a “Combined Engineering” model—effectively combining the SDE and SDET roles into a single role called a Software Engineer. Software Engineers write the product code, test code, and tools code needed to deliver their service. They are responsible for everything. Quality is a core concern of software development anyway.

Software Engineers write the code, unit tests, and integration tests. Those tests run in CI. The code moves through a CD pipeline before finally going out to production in some fashion. QA teams are shrinking, but what’s growing are the teams building the tools—the CI environments, the CD pipelines, the automated testing frameworks, the production tooling and automation, etc. The same is becoming true of Ops.

This is what I mean by “Operations through the lens of product.” The build, release, deploy automation, configuration management, infrastructure automation, logging, monitoring—these are all products.

Constraints often make problems easier. At Workiva, as we were struggling through that scaling phase, we placed a constraint on ourselves. We capped our infrastructure engineering headcount at 15% of R&D. This forced us to solve the problem using technology, and technical problems tend to be easier than people problems. In effect, this required us to productize our infrastructure. In doing so, we scaled. We controlled costs. We kept our headcount in check. We reduced risk. We accelerated development. Ultimately, we delivered value to customers faster, going from about three to four releases per year to multiple releases per day. In the end, this is really the goal of DevOps—to deliver value to customers continuously and to do it rapidly and reliably.

Rethinking Ops

It’s time we start to rethink Operations because clearly this model of Ops as cluster or infrastructure admins does not scale. Developers will always out-demand their capacity to supply. Either your headcount is out of control or your ability to innovate and deliver is severely hamstrung. Operations becomes this interrupt-driven thing where we’re just fighting fires as they happen. Ops as masters of production usually devolves to Ops becoming human incident routers, trying to figure out what team or person can help resolve problems because, being responsible for everything, they don’t have the insight to fix it themselves.

Another path that many companies take is Platform as a Service. Workiva is an example of this. For a very long time, Workiva didn’t have a traditional Ops team because the Ops team was Google. The first product was built on Google App Engine. This helped immensely to deliver value to customers quickly. We could just focus on the product and not the surrounding operational aspects, but there is a very real innovation bottleneck that comes with this.

The idea of “Ops lock-in” can be a major problem, whether it’s a PaaS like App Engine locking you in or your own Ops team who just isn’t able to support the kind of innovation that you’re trying to do.

My vision for the future of Operations is taking Combined Engineering to its logical conclusion. Just like with QA, Ops capabilities should be embedded within development teams. The reality is you can’t be an effective software engineer today without some Ops skills, and I think every role should be working towards automating itself out of a job. Specifically, my vision is enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services.

The knee-jerk reaction to this idea is usually fully embracing Infrastructure as a Service, infrastructure as code, and giving developers freedom—and usually the consequences are dire. The point here is that the pendulum can swing too far in the other direction. This was a problem for a brief period of time at Workiva. As we were building new products off of App Engine, developers had this newfound freedom, so teams all went different directions introducing new tech, new infrastructure, new services, and so forth. It was a free-for-all, an explosion of stuff, and the cost explosion that comes with it.

There has to be some control around that, so we tweak the vision statement a bit: enabling developers to self-service through tooling and automation and empowering them to deploy and operate their services…with minimal Ops intervention. We have to have some checks and balances in place.

With this, Ops become force multipliers. We move away from the reactive, interrupt-driven model where Ops are masters of production responsible for everything. Instead, we make dev teams responsible for their services but provide the tools they need to actually own their systems end-to-end—from the code on their laptops to operating it in production.

Enabling developers to self-service through tooling and automation means treating Ops as a product team. The infrastructure automation, deployment automation, configuration management, logging, monitoring, and production tools—these are all products. It’s these products that allow teams to fully own their services. This leads to empowerment.

I have this theory that all engineering organizations operate in this fashion which I call pain-driven development. As a company grows, it starts to develop limbs—teams or silos. Each of these limbs has its own pain receptors. Teams operate in a way that minimizes the amount of pain that they feel, it’s human instinct. We make locally optimal decisions to minimize pain and end up following a path of least resistance.

Silos promote pain displacement, which results in a “bulkhead” effect. Product development feels the pain of building software, QA feels the pain of testing software, and Ops feels the pain of running software. This creates broken feedback loops. For instance, developers aren’t feeling the pain Ops is feeling trying to run their software. We just throw things over the wall and it becomes an empathy problem.

This leads to misaligned incentives because each team will optimize for the pain that they feel. How do you expect developers to care about quality if they’re not on the hook? Similarly, how do you expect them to care about operability if they’re not on the hook? Developers won’t build truly reliable software until they are on-call for it and directly responsible. However, responsibility requires empowerment. You can’t have one without the other. You can’t ask someone to care about something and fix it without also giving them the power to do so. Most Ops teams simply haven’t done enough to empower and offload responsibility onto development teams.

Products enable ownership. We move away from Ops as masters of production responsible for everything and push that responsibility onto dev teams. They are the experts for their services. They are best equipped to deal with problems that arise. But we provide the tools they need to diagnose and resolve those problems on their own.

Products maintain control through enablement—enabling teams to follow best practices for builds, testing, deploys, support, and compliance. Compliance and other SDLC requirements have to be encoded into the tools and processes. These are things developers won’t empathize with or simply won’t understand. Rather than giving them a long list of things they have to do, we take as many of those things as we can and bake them into our products. If you use these tools or follow these processes, you’ll get a lot of this stuff for free. This reduces risk and accelerates development.

Similarly, we can’t allow all of the special snowflakes to happen. We have to control that explosion of stuff. To do this, we use pain-driven development to our advantage by creating paths of least resistance. Using standardized patterns, application shapes, and infrastructure services, we can setup “paths” to both make it easier to reach production and meet the goals of the business. As a developer, if you follow this path, your life will be a lot easier and you’ll feel less pain. If you deviate from that path, things get much harder—and painful.

We end up with a set “menu” of standard application shapes and infrastructure. If teams want to deviate and go off-menu, it’s on them to make a case for it. For example, if I want to introduce Erlang into our stack, it’s on my team and me to present the case for that. Part of this might mean we help build and maintain the tools needed to support that. If there is a compelling enough case or enough teams are making similar asks, we can start to standardize new shapes.

Note that we aren’t necessarily mandating technologies, but we’re leveraging pain-driven development to work in our favor.

Products in Practice

Next, I’m going to look at this idea of Operations through the lens of product in a bit more detail. We’ll see what this might actually look like in practice, again using Workiva as a bit of a case study.

Below is the high-level flow that I think about, from code on laptop to code in production.

Starting with the Build and continuous integration stage, this workflow tends to look something like the following. A developer pushes a change to a branch in a code repository, e.g. GitHub. This triggers a few things to happen. First, the build process, which runs unit/integration tests and builds artifacts. This, in turn, might trigger a QA and/or compliance process. At the same time, we have code reviews happening. All of these processes provide feedback to the developer to quickly iterate.

Workiva has a lot of automated processes built into the developer workflow, some off-the-shelf and some built in-house. For example, when a PR is opened, a security scanner runs which does static analysis and looks for various security vulnerabilities. This can flag a security review when a closer look is needed. Likewise, there is code coverage, automated builds, unit tests, and integration tests, Docker image builds, and compliance checks. The screenshots below come from an open-source repo showing some of these products in practice.

For compliance reasons, Workiva requires at least one other person sign-off on code changes. GitHub provides pretty good support for this. Code reviewers provide their feedback, developers work through that feedback, and, once satisfied, reviewers give their “plus one.”

The screenshot below shows some of the automated processes Workiva relies on in the developer workflow: Travis CI, Codecov, Smithy (which is Workiva’s internal build system), Skynet (automated testing), Rosie (automated compliance controls, e.g. do you have code reviews, security reviews, other SDLC compliance requirements?), and Aviary (the security scanner). Once all of these have passed, the PR is automatically labeled with “Merge Requirements Met” and the change can be merged into master.

There are a couple things worth pointing out with this workflow. First, the build plan is part of the code and not baked into some build tool. This allows dev teams to fully control their builds. Second, you noticed that Workiva has very deep integration with GitHub. This has allowed them to build automated controls into the development process, which speeds up the developer’s workflow while reducing risk.

Next, we move on to the Release stage. This flow looks something like the following:

The developer tags a branch for release, which triggers a build process for creating the artifact. This may have a QA process which then promotes the artifact to a development artifact repository. As you may have noticed, Workiva has a lot of compliance requirements since they deal with companies’ pre-financial data, so there is typically a sign-off process at various stages involving different parties like Release Management, QA, Security, etc. Depending on your compliance controls, this might just be clicking a button to promote an artifact to a production repository. From there, it can actually be deployed to a production environment.

With this workflow, artifact tagging, building, and promotion is all automated. It’s also important we have processes around security. Container and machine image auditing is automated as well as security patching for OS updates, etc. For example, this workflow might use something like Packer to automate AMI building. Finally, the artifact sign-off is streamlined for the various parties involved, if not fully automated.

Now we’re ready to actually deploy our application. This is a key part of self-service and “owning” a product. This allows a team to configure their application and, ideally, deploy it themselves to production. Initially, this might be handled by a Release Management team who actually clicks the deploy button, but as you become more confident in your processes and your tools become more mature, more of this responsibility can be pushed onto the development teams.

This is also where control comes into play. For instance, I may be allowed to configure my application to use 1GB of RAM, but if I need 1TB, I may need to get additional sign-off.

Self-service deploys and self-service configuration—with guard rails—are an important part of continuous deployment. Additionally, infrastructure provisioning should be automated. No more submitting tickets for a nameless Ops person to provision and configure servers, VMs, or other resources—no ticket-driven development.

I’ve been deliberate about not prescribing particular solutions for some of these problems. You might be using Kubernetes or ECS to orchestrate containers, it doesn’t really matter. These should mostly be implementation details. What does matter, though, is having good abstractions around certain implementation details. For example, Workiva was meticulous about building some layers around workload scheduling. This allowed them at one point to switch from using Fleet to ECS to manage containers with virtually no impact to developers. With the amount of churn that happens in tech, it’s important not to tie yourself too heavily to any one implementation. Instead, think about the APIs you expose for your infrastructure and consider those the deliverable.

Finally, we need to operate our service in production, another important part of ownership. There are a lot of products here, so we’ll just look at a cross section.

Logging is arguably the most important part of how we figure out what is happening in our systems. For this reason, Workiva built structured logging and metrics specs and language libraries implementing these specs. As a developer, this made it easy to simply pull in the library for your language and get structured, contextual logging for free. The other half to this was building out a data pipeline. Basically all metadata at Workiva went into Amazon Kinesis, including logs, metrics, and traces. First, this allowed us to reuse the same infrastructure for all of this data, from the agents running on the machines to the pipeline itself. Second, it allowed us to fan this data out to different backend systems—Splunk, SumoLogic, Datadog, Stackdriver, BigQuery, as well as various internal tools. This is probably one of the most important things you can do with your infrastructure.

Other continuous operations tools include telemetry, tracing, health checks, alerting, and more sophisticated production tools like canary deploys, A/B testing, and traffic shadowing. Some might refer to these as tools for testing in production. Realistically, once you reach a certain scale, testing in production is the only real alternative to the proliferation of deployment environments.

It’s worth mentioning that you do not need to build all of these products yourself. In fact, you shouldn’t. Many off-the-shelf solutions just need glued together. However, I’ve also come to realize that it’s often the “glue” that is important. That is to say, taking some large, commercial off-the-shelf solution and introducing it into a company is frequently rife with headaches. It’s like Jira, a big Frankenstein product that attempts to solve everyone’s problems and, in doing so, solves none of them particularly well. This is why I tend to favor small, modular solutions that can be composed. But it also highlights why there is a cultural aspect to this.

If you think the solution to your ailments is some magical product—maybe a CI/CD pipeline or Kubernetes or something else—you’re misguided. If anything, most problems are cultural, not technical in nature. Technology will not fix your broken culture! The products are not the endgame, they are a means to an end. And the products need to fit the company, its culture, its architecture, and its constraints. It’s tempting to take something you see on Hacker News and introduce it into your stack, but you have to be careful.

Likewise, it’s tempting to dive straight into the deep-end, automate everything, and build out a highly sophisticated infrastructure. But it’s important to start small and evolve over time. My approach to this is get the workflow correct, start manual, then automate more and more over time.

Wrapping Up

Specialization leads to misalignment and broken feedback loops, but it’s an important part of scaling a business. The question is: how do we specialize?

We know the traditional Ops model does not scale—devs will always out-demand capacity in this reactive model. Not only this, the siloing creates an empathy problem. DevOps attempts to help with this by tightening feedback loops and building empathy. NewOps takes this further by empowering teams and providing autonomy. It’s not a replacement for DevOps, it’s an evolution of it. It’s applying a product mindset to the traditional Ops model.

The future of Ops is taking Combined Engineering to its logical conclusion. As such, Ops teams should be redefining their vision from being masters of production to enablers of production. Just like with QA, Ops capabilities need to be embedded within dev teams, but the caveat is they need to be enabled! This is the direction Operations is headed. Software is eating the world, which means both up and down the stack. NewOps treats Ops like a product team whose product, effectively, is infrastructure. It’s creating guard rails, not walls—taking SDLC and compliance controls and encoding them into products rather than giving devs a laundry list of things, having them run the gauntlet through a long, drawn-out development process, and having a gatekeeper at the end.

Offloading responsibility helps correct and scale feedback loops. In my opinion, this is how we scale specialization. Operations isn’t going away, it’s just getting a product manager.

Software Is About Storytelling

Software engineering is more a practice in archeology than it is in building. As an industry, we undervalue storytelling and focus too much on artifacts and tools and deliverables. How many times have you been left scratching your head while looking at a piece of code, system, or process? It’s the story, the legacy left behind by that artifact, that is just as important—if not more—than the artifact itself.

And I don’t mean what’s in the version control history—that’s often useless. I mean the real, human story behind something. Artifacts, whether that’s code or tools or something else entirely, are not just snapshots in time. They’re the result of a series of decisions, discussions, mistakes, corrections, problems, constraints, and so on.  They’re the product of the engineering process, but the problem is they usually don’t capture that process in its entirety. They rarely capture it at all. They commonly end up being nothing but a snapshot in time.

It’s often the sign of an inexperienced engineer when someone looks at something and says, “this is stupid” or “why are they using X instead of Y?” They’re ignoring the context, the fact that circumstances may have been different. There is a story that led up to that point, a reason for why things are the way they are. If you’re lucky, the people involved are still around. Unfortunately, this is not typically the case. And so it’s not necessarily the poor engineer’s fault for wondering these things. Their predecessors haven’t done enough to make that story discoverable and share that context.

I worked at a company that built a homegrown container PaaS on ECS. Doing that today would be insane with the plethora of container solutions available now. “Why aren’t you using Kubernetes?” Well, four years ago when we started, Kubernetes didn’t exist. Even Docker was just in its infancy. And it’s not exactly a flick of a switch to move multiple production environments to a new container runtime, not to mention the politicking with leadership to convince them it’s worth it to not ship any new code for the next quarter as we rearchitect our entire platform. Oh, and now the people behind the original solution are no longer with the company. Good luck! And this is on the timescale of about five years. That’s maybe like one generation of engineers at the company at most—nothing compared to the decades or more software usually lives (an interesting observation is that timescale, I think, is proportional to the size of an organization). Don’t underestimate momentum, but also don’t underestimate changing circumstances, even on a small time horizon.

The point is, stop looking at technology in a vacuum. There are many facets to consider. Likewise, decisions are not made in a vacuum. Part of this is just being an empathetic engineer. The corollary to this is you don’t need to adopt every bleeding-edge tech that comes out to be successful, but the bigger point is software is about storytelling. The question you should be asking is how does your organization tell those stories? Are you deliberate or is it left to tribal knowledge and hearsay? Is it something you truly value and prioritize or simply a byproduct?

Documentation is good, but the trouble with documentation is it’s usually haphazard and stagnant. It’s also usually documentation of how and not why. Documenting intent can go a long way, and understanding the why is a good way to develop empathy. Code survives us. There’s a fantastic talk by Bryan Cantrill on oral tradition in software engineering where he talks about this. People care about intent. Specifically, when you write software, people care what you think. As Bryan puts it, future generations of programmers want to understand your intent so they can abide by it, so we need to tell them what our intent was. We need to broadcast it. Good code comments are an example of this. They give you a narrative of not only what’s going on, but why. When we write software, we write it for future generations, and that’s the most underestimated thing in all of software. Documenting intent also allows you to document your values, and that allows the people who come after you to continue to uphold them.

Storytelling in software is important. Without it, software archeology is simply the study of puzzles created by time and neglect. When an organization doesn’t record its history, it’s bound to repeat the same mistakes. A company’s memory is comprised of its people, but the fact is people churn. Knowing how you got here often helps you with getting to where you want to be. Storytelling is how we transcend generational gaps and the inevitable changing of the old guard to the new guard in a maturing engineering organization. The same is true when we expand that to the entire industry. We’re too memoryless—shipping code and not looking back, discovering everything old that is new again, and simply not appreciating our lineage.

Engineering Empathy

This was a talk I gave at an internal R&D conference my last week at Workiva. I got a lot of positive feedback on the talk, so I figured I would share it with a wider audience. Be warned: it’s long. Feel free to read each section separately, though they largely tie together.

Why do you work where you work? For many in tech, the answer is probably culture. When you tell a friend about your job, the culture is probably the first thing you describe. It’s culture that can be a company’s biggest asset—and its biggest downfall. But what is it?

Culture isn’t a list of values or a mission statement. It’s not a casual dress code or a beer fridge. Culture is what you reward and what you don’t. More importantly, it’s what you reward and what you punish. That’s an important distinction to make because when you don’t punish behavior that’s inconsistent with your culture, you send a message: you don’t care about it.

So culture is what you live day in and day out. It’s not what you say, it’s what you do. Put yourself in the shoes of a new hire at your company. A new hire doesn’t walk in automatically knowing your culture. They walk in—filled with anxiety—hoping for success, fearing failure, and they look around. They observe their environment. They see who is succeeding, and they try to emulate that behavior. They see who is failing, and they try to avoid that behavior. They ask the question: what makes someone successful here?

Jim Rohn was credited with saying we are the average of the five people we spend the most time with, and I think there’s a lot of truth to this idea. The people around you shape who you are. They shape your behavior, your habits, your thoughts, your opinions, your worldview. Culture is really a feedback loop, and we all contribute to it. It’s not mandated or dictated down (or up through grassroots programs), it’s emergent, but we have to be deliberate about our behaviors because that’s what shapes our culture (note that this doesn’t mean culture doesn’t start with leadership—it most certainly does).

In my opinion, a strong engineering culture is comprised of three parts: the right people, the right processes, and the right priorities. The right people means people that align with and protect your values. Processes are how you execute—how you communicate, develop, deliver, etc. And priorities are your values—the skills or behaviors you value in fellow employees. It’s also your vision. These help you to make decisions—what gets done today and what goes to the bottom of the list. For example, many companies claim a customer-first principle, but how many of them actually use it to drive their day-to-day decisions? This is the difference between a list of values and a culture.

What about technology? Experience? Customer relationships? These are all important competitive advantages, but I think they largely emerge from having the right people, processes, and priorities. The three are deeply intertwined. A culture is the unique combination of processes and values within an organization, and it’s those processes and values that enable you to replicate your success. I’m a big fan of Reed Hastings’ Netflix culture slide deck, but there are some things with which I fundamentally disagree. Hastings says, “The more talent density you have the less process you need. The more process you create the less talent you retain.” This is wrong on a number of levels, which I will talk about later.

Empathy is the common thread throughout each of these three areas—the people, the processes, and the priorities—and we’ll see how it applies to each.

The Ultimate Complex System: People

We largely think about software development as a purely technical feat, one which requires skill and creativity and ingenuity. I think for many, it’s why we became engineers in the first place. We like solving problems. But when all is said and done, computers do what you tell them to do. Computers don’t have opinions or biases or agendas or egos. The technical challenges are really a small part of any sufficiently complicated piece of software. Having been an individual contributor, tech lead, and manager, I’ve come to the realization that it’s people that is the ultimate complex system.

There’s a quote from The Five Dysfunctions of a Team which I’ve referenced before on this blog that I think captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” Teamwork is powerful is intuitive, but teamwork is rare is more profound. Teamwork is a competitive advantage because it’s rare—that’s a pretty strong statement. Let’s unpack it a bit.

It’s the difference between technical architecture and social architecture. We tend to focus on the former while neglecting the latter, but software engineering is more about collaboration than code. It wasn’t until I became a manager that I realized good managers are force multipliers, but social architecture is everyone’s responsibility. Remember, culture is a feedback loop which we all contribute to, so everyone is a social architect.

A key part of social architecture is communication empathy. Back in the 90’s, an evolutionary psychologist by the name of Robin Dunbar proposed the idea that humans can only maintain about 150 stable social relationships. The limit is referred to as Dunbar’s number. He drew a correlation between primate brain size and the average size of cohesive social groups. Informally, Dunbar describes this as “the number of people you would not feel embarrassed about joining uninvited for a drink if you happened to bump into them in a bar.” This number includes all of the relationships in your life, both personal and professional, past and present.

What’s interesting about Dunbar’s number is how it applies to our jobs as software engineers and the interplay with cognitive biases we have as humans. When you don’t understand what someone else does, you’re automatically biased against them. In your head, you’re king. You understand exactly what you do, what your job is, the value that you add, but outside your head—outside your Dunbar’s number—those people are all a mystery. And humans have this funny tendency to mock what they don’t understand. We have lots of these cognitive biases.

Building those types of stable relationships is really hard when you can’t just walk over to someone’s desk and talk to them face-to-face. This was something we struggled a lot with at Workiva, being a company of a few hundred engineers spread out across 11 or so offices. Not only is split-brain an inevitability, but just a general lack of rapport. That stage is super hard to go through for a lot of companies—going from dozens of engineers to hundreds in a few short years. The cruel irony is no matter how agile a company claims to be, the culture—of which structure and processes are a part of—is usually the slowest to adjust. No longer is decision making done by standing up and telling a roomful of people, “I’m going to do this. Please tell me now if that’s a bad idea.” And no longer are decisions often made unanimously and rapidly.

Building stable relationships is much harder without the random hallway error correction. The right people aren’t always bumping into each other at the right time, whereas before a lot of decisions could be made just-in-time. Instead, communication has to be more deliberate and no longer organic. Decisions are made by the Jeff Bezos philosophy of disagree and commit. But nevertheless, building empathy is hard without face-to-face communication, and you miss out on a lot of the nuance of communication.

Again, communication is highly nuanced, and nuance is hard to convey over HipChat. The role of emotions plays a big part. Imagine the situation where you’re walking down a sidewalk or a hallway and someone accidentally walks in front of you. You might sidestep or do a little dance to get around each other, smile or nod, and get on with your day. This minor inconvenience becomes almost this tiny, pleasant interaction between two people. Now, take the same scenario but between two drivers, and you probably have some kind of road rage type situation. The only real difference is the steel cage surrounding the drivers blocking out the verbal and non-verbal communication. How many times has a HipChat conversation gone completely off the rails only to be resolved by a quick Google Hangout or in-person conversation? It’s the exact same thing.

One last note with respect to cognitive biases: once again, humans have a funny tendency where, in a vacuum of information, people will create their own. We will manufacture information just to fill the void, and often it’s not just creating information but taking information we’ve heard somewhere else and applying it to our own—and often the wrong—context. The extreme example of this is “We heard microservices worked well for Netflix, so we should use them at our growing startup” or “Google does monorepo, so we should too.” You’re ignoring all of the context—the path those companies followed to reach those points, the trade-offs made, the organizational differences and competencies, the pain points. When the cognitive biases and opinions we have as humans are added in, the problems amplify and compound. You get a frustrating game of telephone.

You can’t kill The Grapevine anymore than you can change human nature, so you have to address it head-on where and when you can. Don’t allow the vacuum of information to form in the first place, and be cognizant of when you’re applying information you’ve heard from someone else. Is the problem the same? Do you have the same amount of information? Are your constraints the same? What is the full context?

I tend to look at communication through two different lenses: push and pull. In order to be an empathetic communicator, I think it’s important to look at these while thinking about the cognitive biases discussed earlier, starting with push.

I’ve been in meetings where someone called out someone else as a “blocker”, and there was visible wincing in the room. I think for some, the word probably triggers some sort of PTSD. When you’re depending on something that’s outside of your direct control to get something else done, it’s hard not to drop the occasional B-word. It happens out of frustration. It happens because you’re just trying to ship. It also happens because you don’t want to look bad. And it might seem innocuous in the moment, but it has impact. By invoking it, you are tempting fate.

There’s a really good book that was written way back in 1944 called The Unwritten Laws of Engineering. It was published by the American Society of Mechanical Engineers and the language in the book is quite dated (parts of it read like a crotchety old guy yelling at kids to get off his lawn), but the ideas in the book still apply very much today and even to software engineering. It’s really about people, and fundamentally, people don’t change. One of the ideas in the book is this notion which I call communication impact. I’ve taken a quote from the book which I think highlights this idea (emphasis mine):

Be careful about whom you mark for copies of letters, memos, etc., when the interests of other departments are involved. A lot of mischief has been caused by young people broadcasting memorandum containing damaging or embarrassing statements. Of course it is sometimes difficult for a novice to recognize the “dynamite” in such a document but, in general, it is apt to cause trouble if it steps too heavily upon someone’s toes or reveals a serious shortcoming on anybody’s part. If it has wide distribution or if it concerns manufacturing or customer difficulties, you’d better get the boss to approve it before it goes out unless you’re very sure of your ground.

I see this a lot. Not just in emails (née “memos”) but in meetings or reviews, someone will—inadvertently or not—throw someone else under the proverbial bus, i.e. “broadcasting memorandum containing damaging or embarrassing statements”, “stepping too heavily upon someone’s toes”, or “revealing a serious shortcoming on somebody’s part.” The problem with this is, by doing it, you immediately put the other party on the defensive and also create a cognitive bias for everyone else in the room. You create a negative predisposition, which may or may not be warranted, toward them. Similarly, I liken “if it concerns manufacturing or customer difficulties” to production postmortems. This is why they need to be blameless. Why is it that retros on production issues are blameless while, at the same time, the development process is full of blame-assigning? It might seem innocuous, but your communication has impact. Push with respect and under the assumption the other person is probably doing the right thing. Don’t be willing to throw anyone under that bus. Likewise, be quick to take responsibility but slow to assign it. Don’t be willing to practice Cover Your Ass Engineering.

I’ve been in meetings where someone would get called out as a blocker, literally articulated in that way, and the person wasn’t even aware they were blocking anything. I’ve seen people create JIRA tickets on another team’s board and then immediately call them blockers. It’s important to call out dependencies ahead of time, and when someone is “blocking” your progress, speak to them about it individually and before it reaches a critical point. No one should be getting caught off guard by these things. Be careful about how and where you articulate these types of problems.

On the same topic of communication impact, I’ve seen engineers develop detailed and extravagant plans like “We’re going to move the entire company to a monorepo while simultaneously switching from Git to Mercurial” or “We’re going to build our own stream-processing framework from the ground up”, and then distribute them widely to the organization (“wide distribution” as referenced in The Unwritten Laws of Engineering passage above). The proposals are usually well-intentioned and maybe even compelling sometimes, but it’s the way in which they are communicated that is problematic. Recall The Grapevine: people see it, assume it’s reality, and then spread misinformation. “Did you know the company is switching to Mercurial?”

An effective way to build rapport between teams is genuinely celebrating the successes of other teams, even the small ones. I think for many organizations, it’s common to celebrate victories within a team—happy hour for shipping a new feature or a team outing for signing a major account—but celebrating another team’s win is more rare, especially when a company grows in size. The operative word is “genuine” though. Don’t just do it for the sake of doing it, be genuine about it. This is a compelling way to build the stable relationships needed to unlock the rarity of teamwork described earlier.

Equally important to understanding communication impact is understanding decision impact. I’ve already written about this, so I’ll keep it brief: your decisions impact others. How does adopting X affect Operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? Doing something might save you time, but does it create work or slow others down?

Teams operate in a way that minimizes the amount of pain they feel. It’s a natural instinct and a phenomenon I call pain displacement. Pain-driven development is following the path of least resistance. By doing this, we end up moving the pain somewhere else or deferring it until later (i.e. tech debt). Where the problems start to happen is when multiple teams or functions are involved. This is when the political and other organizational issues start to seep in. Patrick Lencioni, author of The Five Dysfunctions of a Team, has a book that touches on this subject called Silos, Politics, and Turf Wars. 

I believe the solution is multifaceted. First, teams need to think holistically, widening their vision beyond the deliverable immediately in front of them. They need to have a sense of organizational awareness. Second, teams—and especially leaders—need to be able to take off their job’s “hat” periodically in order to solve a shared problem. Lencioni observes that much of what causes organizational dysfunction is siloing, and this typically stems from strong intra-team loyalties. For example, within an engineering organization you might have development, operations, QA, security, and other functional teams. Empathy is being able to look at something through someone else’s perspective, and this requires removing your functional hat from time to time. Lastly, teams need to be able to rally around a common cause. This is a shared, compelling vision that motivates and mobilizes people and helps break down the silos. A shared vision aligns teams and enables them to work more autonomously. This is how decisions get made.

Pull communication is pretty much just how to ask questions without making people hate you, a skill that is very important to be an effective and empathetic communicator.

The single most common communication issue I see in engineering organizations is The XY Problem. It’s when someone focuses on a particular solution to their problem instead of describing the problem itself.

  • User wants to do X.
  • User doesn’t know how to do X, but thinks they can fumble their way to a solution if they can just manage to do Y.
  • User doesn’t know how to do Y either.
  • User asks for help with Y.
  • Others try to help user with Y, but are confused because Y seems like a strange problem to want to solve.
  • After much interaction and wasted time, it finally becomes clear that the user really wants help with X, and that Y wasn’t even a suitable solution for X.

The problem occurs when people get stuck on what they believe is the solution and are unable to step back and explain the issue in full. The solution to The XY Problem is simple: always provide the full context of what you’re trying to do. Describe the problem, don’t just prescribe the solution.

Part of being an effective communicator is being able to extract information from people and getting help without being a mental and emotional drain. This is especially true when it comes to debugging. I often see this “murder-mystery debugging” where someone basically tries to push off the blame for something that’s wrong with their code onto someone or something else. This flies in the face of the principle discussed earlier—be quick to take responsibility and slow to assign it. The first step when it comes to debugging anything is assume it’s your fault by default. When you run some code you’re writing and the compiler complains, you don’t blame the compiler, you assume you screwed up. It’s just taking this same mindset and applying it to everything else that we do.

And when you do need to seek help from others—just like with The XY Problem—provide as much context as possible. So much of what I see is this sort of information trickle, where the person seeking help drips information to the people trying to provide it. Don’t make it an interrogation. Lastly, provide a minimal working example that reproduces the problem. Don’t make people build a massive project with 20 dependencies just to reproduce your bug. It’s such a common problem for Stack Overflow that they actually have a name for it: MCVE—Minimal, Complete, Verifiable Example. Do your due diligence before taking time out of someone else’s day because the only thing worse than a bug report is a poorly described, hastily written accusation.

Another thing I see often is swoop-and-poop engineering. This is when someone comes to you with something they need help with—maybe a bug in a library you own, a feature request, something along these lines (this is especially true in open source). They have a sense of urgency; they say it either explicitly or just give off that vibe. You offer to setup a meeting to get more information or work through the problem with them only to find they aren’t available or willing to set aside some time with you. They’re heads down on “something more important,” yet their manager is ready to bite your head off weeks later, often without any documentation or warning. They’ve effectively dumped this on you, said the world’s on fire, and left as quickly as they came. You’re left confused and disoriented. You scratch your head and forget about it, then days or weeks later, they return, horrified that the world is still burning. I call these drive-by questions.

First, it’s important to have an appropriate sense of urgency. If you’re not willing to hop on a Hangout to work through a problem or provide additional information, it’s probably not that important, especially if you can’t even take the time to follow up. With few exceptions, it’s not fair to expect a team to drop everything they’re doing to help you at a moment’s notice, but if they do, you need to meet them halfway. It’s essential to realize that if you’re piling onto a team, others probably are too. If you submit a ticket with another team and then turn around and immediately call it a blocker, that just means you failed to plan accordingly. Having empathy is being cognizant that every team has its own set of priorities, commitments, and work that it’s juggling. By creating that ticket and calling it a blocker, you’re basically saying none of that stuff matters as much. Empathy is understanding that shit rolls downhill. For those who find themselves facing drive-by questions: document everything and be proactive about communicating.

There’s a really good essay by Eric Raymond called How To Ask Questions The Smart Way. It’s something I think every engineer should read and take to heart. My number one pet peeve is Help Vampires. These are people who refuse to take the time to ask coherent, specific questions and really aren’t interested in having their questions answered so much as getting someone else to do their work. They ask the same, tired questions over and over again without really retaining information or thinking critically. It’s question, answer, question, answer, question, answer, ad infinitum.

This is often a hard-earned lesson for junior engineers, but it’s an important one: when you ask a question, you’re not entitled to an answer, you earn the answer. Hasty sounding questions get hasty answers. As engineers, we should not operate like a tech support hotline that people call when their internet stops working. We need to put in a higher level of effort. We need to apply our technical and problem-solving aptitude as engineers. This is the only way you can scale this kind of support structure within an engineering organization. If teams are just constantly bombarding each other with low-effort questions, nothing will get done and people will get burnt out.

Avoid being a Help Vampire. Before asking a question, do your due diligence. Think carefully about where to ask your question. If it’s on HipChat, what is the appropriate room in which to ask? Also be mindful of doing things like @all or @here in a large room. Doing that is like walking into a crowded room, throwing your hands up in the air, and shouting at everyone to look at you. Be precise and informative about your problem, but also keep in mind that volume is not precision. Just dumping a bunch of log messages is noise. Don’t rush to claim that you’ve found a bug. As a first step, take responsibility. And just like with The XY Problem, describe the goal, not the step you took—describe vs. prescribe. Lastly, follow up on the solution. Everyone has been in this situation: you’ve found someone that asked the exact same question as you only to find they never followed up with how they fixed it. Even if it’s just in HipChat or Slack, drop a note indicating the issue was resolved and what the fix was so others can find it. This also helps close the loop when you’ve asked a question to a team and they are actively investigating it. Don’t leave them hanging.

In many ways, being an empathetic communicator just comes down to having self-awareness.

Codifying Values and Priorities: Processes

“Process” has a lot of negative connotations associated with it because it usually becomes this thing done on ceremony. But “process” should be a means of documenting and codifying your values. This is why I disagree with the Reed Hastings quote about process from earlier. Process is about repeatability and error correction. Camille Fournier’s new book on engineering management, The Manager’s Path, has a great section on “bootstrapping culture.” I particularly like the way she frames organizational structure and process:

When talking about structure and process with skeptics, I try to reframe the discussion. Instead of talking about structure, I talk about learning. Instead of talking about process, I talk about transparency. We don’t set up systems because structure and process have inherent value. We do it because we want to learn from our successes and our mistakes, and to share those successes and encode the lessons we learn from failures in a transparent way. This learning and sharing is how organizations become more stable and more scalable over time.

When a process “feels” wrong, it’s probably because it doesn’t reflect your organization’s values. For example, if a process feels heavy, it’s because you value velocity. If a process feels rigid, it’s because you value agility. If a process feels risky, it’s because you value safety. We have a hard time articulating this so instead it becomes “process is bad.”

Somewhere along the line, someone decides to document how stuff gets done. Things get standardized. Tools get made. Processes get established. But process becomes dogma when it’s interpreted as documentation of how rather than an explanation of why. Processes should tell the story of an organization: here’s what we value, here’s why we value it, and here’s how we protect and scale those values. The story is constantly evolving, so processes should be flexible. They shouldn’t be set in stone.

Michael Lopp’s book Managing Humans also provides a useful perspective on culture:

It’s entirely possible that too much process or the wrong process is developed during this build-out, but when this inevitable debate occurs, it should not be about the process. It’s a debate about values. The first question isn’t, “Is this a good, bad, or efficient process?” The first question is, “How does this process reflect our values?”

Processes should be traceable back to values. Each process should have a value or set of values associated with it. Understanding the why helps to develop empathy. It’s the difference between “here’s how we do things” and “here’s why we do things.” It’s much harder to develop a sense of empathy with just the how.

What We Value: Priorities

As engineers, we need to be curious. We need to have a “let’s go see!” attitude. When someone comes to you with a question—and hopefully it’s a well-formulated question based on the earlier discussion—your first reaction should be, “let’s go see!” Use it as an opportunity for both of you to learn. Even if you know the answer, sometimes it’s better to show, not tell, and as the person asking the question, you should be eager to learn. This is the reason I love Julia Evans’ blog so much. It’s oozing with wonder, curiosity, and intrigue. Being an engineer should mean having an innate curiosity. It’s not throwing up your hands at the first sign of an API boundary and saying, “not my problem!” It’s a willingness to roll up your sleeves and dig in to a problem but also a capacity for knowing how and when to involve others. Figure out what you don’t know and push beyond it.

Be humble. There’s a book that was written in the 70’s called The Psychology of Computer Programming, and it’s interesting because it focuses on the human elements of software development rather than the purely technical ones that we normally think about. In the book, it presents The 10 Commandments of Egoless Programming, which I think contain a powerful set of guiding principles for software engineers:

  1. Understand and accept that you will make mistakes.
  2. You are not your code.
  3. No matter how much “karate” you know, someone else will always know more.
  4. Don’t rewrite code without consultation.
  5. Treat people who know less than you with respect, deference, and patience.
  6. The only constant in the world is change. Be open to it and accept it with a smile.
  7. The only true authority stems from knowledge, not from position.
  8. Fight for what you believe, but gracefully accept defeat.
  9. Don’t be “the coder in the corner.”
  10. Critique code instead of people—be kind to the coder, not to the code.

Part of being a humble engineer is giving away all the credit. This is especially true for leaders or managers. A manager I once had put it this way: “As a manager, you should never say ‘I’ during a review unless shit went wrong and you’re in the process of taking responsibility for it.” A good leader gives away all the credit and takes all of the blame.

Be engaged. Coding is actually a very small part of our job as software engineers. Our job is to be engaged with the organization. Engage with stakeholder meetings and reviews. Engage with cross-trainings and workshops. Engage with your company’s engineering blog. Engage with other teams. Engage with recruiting and company outreach through conferences or meetups. People dramatically underestimate the value of developing their network, both to their employer and to themselves. You don’t have to do all of these things, but my point is engineers get overly fixated on coding and deliverables. Code is just the byproduct. We’re not paid to write code, we’re paid to add value to the business, and a big part of that is being engaged with the organization.

And of course I’d be remiss not to talk about empathy. Empathy is having a deep understanding of what problems someone is trying to solve. John Allspaw puts it best:

In complex projects, there are usually a number of stakeholders. In any project, the designers, product managers, operations engineers, developers, and business development folks all have goals and perspectives, and mature engineers realize that those goals and views may be different. They understand this so that they can navigate effectively in the work that they do. Being empathetic in this sense means having the ability to view the project from another person’s perspective and to take that into consideration into your own work.

Changing your perspective is a powerful way to deepen your relationships. Once again, it comes back to Dunbar’s number: we have a limited number of stable relationships, but developing and maintaining those relationships is the key to figuring out the rarity of teamwork.

A former coworker of mine passed away late last year. In going through some of his old files, we came across some notes he had on leadership. There was one quote in particular that I thought really captured the essence of this post nicely:

All music is made from the same 12 notes. All culture is made from the same five components: behaviors, relationships, attitudes, values, and environment. It’s the way those notes or components are put together that makes things sing.

This is what it takes to build a strong engineering culture and really just a healthy culture in general. The technology and everything else is secondary. It really starts with the people.

The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.