The Future of Ops

Traditional Operations isn’t going away, it’s just retooling. The move from on-premise to cloud means Ops, in the classical sense, is largely being outsourced to cloud providers. This is the buzzword-compliant NoOps movement, of which many call the “successor” to DevOps, though that word has become pretty diluted these days. What this leaves is a thin but crucial slice between Amazon and the products built by development teams, encompassing infrastructure automation, deployment automation, configuration management, log management, and monitoring and instrumentation.

The future of Operations is actually, in many ways, much like the future of QA. Traditional QA roles are shifting away from test-focused to tools-focused. Engineers write code, unit tests, and integration tests. The tests run in CI and the code moves to production through a CD pipeline and canary rollouts. QA teams are shrinking, but what’s growing are the teams building the tools—the test frameworks, the CI environments, the CD pipelines. QA capabilities are now embedded within development teams. The SDET (Software Development Engineer in Test) model, popularized by companies like Microsoft and Amazon, was the first step in this direction. In 2014, Microsoft moved to a Combined Engineering model, merging SDET and SDE (Software Development Engineer) into one role, Software Engineer, who is responsible for product code, test code, and tools code.

The same is quickly becoming true for Ops. In my time with Workiva’s Infrastructure and Reliability group, we combined our Operations and Infrastructure Engineering teams into a single team effectively consisting of Site Reliability Engineers. This team is responsible for building and maintaining infrastructure services, configuration management, log management, container management, monitoring, etc.

I am a big proponent of leadership through vision. A compelling vision is what enables alignment between teams, minimizes the effects of functional and organizational silos, and intrinsically motivates and mobilizes people. It enables highly aligned and loosely coupled teams. It enables decision making. My vision for the future of Operations as an organizational competency is essentially taking Combined Engineering to its logical conclusion. Just as with QA, Ops capabilities should be embedded within development teams. The fact is, you can’t be an effective software engineer in a modern organization without Ops skills. Ops teams, as they exist today, should be redefining their vision.

The future of Ops is enabling developers to self-service through tooling, automation, and processes and empowering them to deploy and operate their services with minimal Ops intervention. Every role should be working towards automating itself out of a job.

If you asked an old-school Ops person to draw out the entire stack, from bare metal to customer, and circle what they care about, they would draw a circle around the entire thing. Then they would complain about the shitty products dev teams are shipping for which they get paged in the middle of the night. This is broadly an outdated and broken way of thinking that leads to the self-loathing, chainsmoking Ops stereotype. It’s a cop out and a bitterness resulting from a lack of empathy. If a service is throwing out-of-memory exceptions at 2AM, does it make sense to alert the Ops folks who have no insight or power to fix the problem? Or should we alert the developers who are intimately familiar with the system? The latter seems obvious, but the key is they need to be empowered to be notified of the situation, debug it, and resolve it autonomously.

The NewOps model instead should essentially treat Ops like a product team whose product is the infrastructure. Much like the way developers provide APIs for their services, Ops provide APIs for their infrastructure in the form of tools, UIs, automation, infrastructure as code, observability and alerting, etc.

In many ways, DevOps was about getting developers to empathize with Ops. NewOps is the opposite. Overly martyrlike and self-righteous Ops teams simply haven’t done enough to empower and offload responsibility onto dev teams. With this new Combined Engineering approach, we force developers to apply systems thinking in a holistic fashion. It’s often said: the only way engineers will build truly reliable systems is when they are directly accountable for them—meaning they are on call, not some other operator.

With this move, the old-school, wild-west-style of Operations needs to die. Ops is commonly the gatekeeper, and they view themselves as such. Old-school Ops is building in as much process as possible, slowing down development so that when they reach production, the developers have a near-perfectly reliable system. Old-school Ops then takes responsibility for operating that system once it’s run the gauntlet and reached production through painstaking effort.

Old-school Ops are often hypocrites. They advocate for rigorous SDLC and then bypass the same SDLC when it comes to maintaining infrastructure. NewOps means infrastructure is code. Config changes are code. Neither of which are exempt from the same SDLC to which developers must adhere. We codify change requests. We use immutable infrastructure and AMIs. We don’t push changes to a live environment without going through the process. Similarly, we need to encode compliance and other SDLC requirements which developers will not empathize with into tooling and process. Processes document and codify values.

Old-school Ops is constantly at odds with the Lean mentality. It’s purely interrupt-driven—putting out fires and fixing one problem after another. At the same time, it’s important to have balance. Will enabling dev teams to SSH into boxes or attach debuggers to containers in integration environments discourage them from properly instrumenting their applications? Will it promote pain displacement? It’s imperative to balance the Ops mentality with the Dev mentality.

Development teams often hold Ops responsible for being an innovation or delivery bottleneck. There needs to be empathy in both directions. It’s easy to vilify Ops but oftentimes they are just trying to keep up. You can innovate without having to adopt every bleeding-edge technology that hits Hacker News. On the other hand, modern Ops organizations need to realize they will almost never be able to meet the demand placed upon them. The sustainable approach—and the approach that instills empathy—is to break down the silos and share the responsibility. This is the future of Ops. With the move to cloud, Ops needs to reinvent itself by empowering and entrusting development teams, not trying to protect them from themselves.

Ops is dead, long live Ops!

Pain-Driven Development: Why Greedy Algorithms Are Bad for Engineering Orgs

I recently wrote about the importance of understanding decision impact and why it’s important for building an empathetic engineering culture. I presented the distinction between pain displacement and pain deferral, and this was something I wanted to expand on a bit.

When you distill it down, I think what’s at the heart of a lot of engineering orgs is this idea of “pain-driven development.” When a company grows to a certain size, it develops limbs, and each of these limbs has its own pain receptors. This is when empathy becomes important because it becomes harder and less natural. These limbs of course are teams or, more generally speaking, silos. Teams have a natural tendency to operate in a way that minimizes the amount of pain they feel.

It’s time for some game theory: pain is a zero-sum game. By always following the path of least resistance, we end up displacing pain instead of feeling it. This is literally just instinct. In other words, by making locally optimal choices, we run the risk of losing out on a globally optimal solution. Sometimes this is an explicit business decision, but many times it’s not.

Tech debt is a common example of when pain displacement is a deliberate business decision. It’s pain deferral—there’s pain we need to feel, but we can choose to feel it later and in the meantime provide incremental value to the business. This is usually a team choosing to apply a bandaid and coming back to fix it later. “We have this large batch job that has a five-minute timeout, and we’re sporadically seeing this timeout getting hit. Why don’t we just bump up the timeout to 10 minutes?” This is a bandaid, and a particularly poor one at that because, by Parkinson’s law, as soon as you bump up the timeout to 10 minutes, you’ll start seeing 11-minute jobs, and we’ll be having the same discussion over again. I see the exact same types of discussions happening with resource provisioning: “we’re hitting memory limits—can we just provision our instances with more RAM?” “We’re pegging CPU. Obviously we just need more cores.” Throwing hardware at the problem is the path of least resistance for the developers. They have a deliverable in front of them, they have a lot of pressure to ship, this is how they do it. It’s a greedy algorithm. It minimizes pain.

Where things become really problematic is when the pain displacement involves multiple teams. This is why understanding decision impact is so key. Pain displacement doesn’t just involve engineering teams, it also involves customers and other stakeholders in the organization. This is something I see quite a bit: displacing pain away from customers onto various teams within the org by setting unrealistic expectations up front.

For example, we build a product MVP and run it on a single, high-memory instance, and we don’t actually write data out to disk to keep it fast. We then put this product in front of sales folks, marketing, or even customers and say “hey, look at this cool thing we built.” Then the customers say “wow, this is great! I don’t feel any pain at all using this!” That’s because the pain has been moved elsewhere.

This MVP isn’t fault tolerant because it’s running on a single machine. This MVP isn’t horizontally scalable because we keep all the state in memory on one instance. This MVP isn’t safe because the data isn’t durably stored to disk. The problem is we weren’t testing at scale, so we never felt any pain until it was too late. So we start working backward to address these issues after the fact. We need to run multiple instances so we can have failover. But wait, now we need stateful request routing to maintain our performance expectations. Does our infrastructure support that? We need a mechanism to split and merge units of work that plays nicely with our autoscaling system to give us a better scale story, avoid hot instances, and reduce excess capacity. But wait, how long will that take to build? We need to attach persistent disks so we can durably store data and keep things fast. But wait, does our cluster provisioning allow for that? Does that even meet our compliance requirements?

The only way you reach this point is by making local decisions without thinking about the trade-offs involved or the fact that what you’ve actually done is simply displaced the pain.

If someone doesn’t feel pain, they have a harder time developing a sense of empathy. For instance, the goal of any good operations team is to effectively put itself out of a job by empowering developers to self-service through tooling and automation. One example of this is infrastructure as code, so an ops team adds a process requiring developers to provision their own infrastructure using CloudFormation scripts. For the ops folks, this is a boon—now they no longer have to labor through countless UIs and AWS consoles to provision databases, queues, and the like for each environment. Developers, on the other hand, were never exposed to that pain, so to them, writing CloudFormation scripts is a new hoop to jump through—setting up infrastructure is ops’ job! They might feel pain now, but they don’t necessarily see the immediate payoff.

A coworker of mine recently posed an interesting question: why do product teams often overlook the need for tools required to support their product in production until after they’ve deployed to production? And while the answer he posits is good, and one I very much agree with—solving a problem and solving the problem of solving problems are two very different problems—my answer is this: pain-driven development. In this case, you’re deferring the pain by hooking up debuggers or SSHing into the box and poking about instead of relying on instrumentation which is what we’re limited to in the field. As long as you’re cognizant of this and know that at some point you will have to feel some pain, it can be okay. But if you’re just displacing pain thinking it’s actually disappearing, you’ll be in for a rude awakening. Remember, it’s a zero-sum game.

I’m looking at this through an infrastructure or operations lens, but this applies everywhere and it cuts both ways. Understanding the why behind something rather than just the how is critical to building empathy. It’s being able to look at a problem through someone else’s perspective and applying that to your own work. Changing your perspective is a powerful way to deepen your relationships. Pain-driven development is intoxicating because it allows us to move fast. It’s a greedy algorithm, but it provides a poor global approximation for large engineering organizations. Thinking holistically is important.

Decision Impact

I think a critical part of building an empathetic engineering culture is understanding decision impact. This is a blindspot that I see happening a lot: a deliberate effort to understand the effects caused by a decision. How does adopting X affect operations? Does our dev tooling support this? Is this architecture supported by our current infrastructure? What are the compliance or security implications of this? Will this scale in production? A particular decision might save you time, but does it create work or slow others down? Are we just displacing pain somewhere else?

What’s needed is a broad understanding of the net effects. Pain displacement is an indication that we’re not thinking beyond the path of least resistance. The problem is if we lack a certain empathy, we aren’t aware the pain displacement is occurring in the first place. It’s important we widen our vision beyond the deliverable in front of us. We have to think holistically—like a systems person—and think deeply about the interactions between decisions. Part of this is having an organizational awareness.

Tech debt is the one exception to this because it’s pain displacement we feel ourselves—it’s pain deferral. This is usually a decision we can make ourselves, but when we’re dealing with pain displacement involving multiple teams, that’s when problems start happening. And that’s where empathy becomes critical because software engineering is more about collaboration than code and shit has this natural tendency to roll downhill.

The first sentence of The Five Dysfunctions of a Team captures this idea really well: “Not finance. Not strategy. Not technology. It is teamwork that remains the ultimate competitive advantage, both because it is so powerful and so rare.” The powerful part is obvious, but the bit about rarity is interesting when we think about teams holistically. The cause, I think, is deeply rooted in the silos or fiefdoms that naturally form around teams. The difficulty comes as an organization scales. What I see happening frequently are goals that diverge or conflict. The fix is rallying teams around a shared cause—a single, compelling vision. Likewise, it’s thinking holistically and having empathy. Understanding decision impact and pain displacement is one step to developing that empathy. This is what unlocks the rarity part of teamwork.

Shit Rolls Downhill

Building software of significant complexity is tough because a lot of pieces have to come together and a lot of teams have to work in concert to be successful. It can be extraordinarily difficult to get everyone on the same page and moving in tandem toward a common goal. Product development is largely an exercise in trust (or perhaps more accurately, hiring), but even if you have the “right” people—people you can trust and depend on to get things done—you’re only halfway there.

Trust is an important quality to screen for, difficult though it may be. However, a person’s trustworthiness or dependability doesn’t really tell you much about that person as an engineer. The engineering culture is something that must be cultivated. Etsy’s CTO, John Allspaw, said it best in a recent interview:

Post-mortem debriefings every day are littered with the artifacts of people insisting, the second before an outage, that “I don’t have to care about that.”

If “abstracting away” is nothing for you but a euphemism for “Not my job,” “I don’t care about that,” or “I’m not interested in that,” I think Etsy might not be the place for you. Because when things break, when things don’t behave the way they’re expected to, you can’t hold up your arms and say “Not my problem.” That’s what I could call “covering your ass” engineering, and it may work at other companies, but it doesn’t work here.

Allspaw calls this the distinction between hiring software developers and software engineers. This perception often results in heated debate, but I couldn’t agree with it more. There is a very real distinction to be made. Abstraction is not about boundaries of concern, it’s about boundaries of focus. Engineers need to have an intimate understanding of this.

Engineering, as a discipline and as an activity, is multi-disciplinary. It’s just messy. And that’s actually the best part of engineering. It’s not about everyone knowing everything. It’s about paying attention to the shared, mutual understanding.

But engineering is more than just technical aptitude and a willingness to “dig in” to the guts of something. It’s about having an acute awareness of the delicate structure upon which software is built. More succinctly, it’s about having empathy. It’s recognizing the fact that shit rolls downhill.

Shit Rolls Downhill

For things to work, the entire structure has to hold, and no one point is any more or less important than the others. It almost always starts off with good intentions at the top, but the shit starts to compound and accelerate as it rolls effortlessly and with abandon toward the bottom. There are a few aspects to this I want to explore.

Understand the Relationships

This isn’t to say that folks near the top are less susceptible to shit. Everyone has to shovel it, but the way it manifests is different depending on where you find yourself on the hill. The key point is that the people above you are effectively your customers, either directly or indirectly, and if you’re toward the top, maybe literally.

And, as all customers do, they make demands. This is a very normal thing and is to be expected. Some of these demands are reasonable, others not so much. Again, this is normal, but what do we make of these demands?

There are some interesting insights we can take from The Innovator’s Dilemma (which, by the way, is an essential read for anyone looking to build, run, or otherwise contribute to a successful business), which are especially relevant toward the top of the hill. Mainly, we should not merely take the customer’s word as gospel. When it comes to products, feature requests, and “the way things should be done,” the customer tends to have a very narrow and predisposed view. I find the following passage to be particularly poignant:

Indeed, the power and influence of leading customers is a major reason why companies’ product development trajectories overshoot demands of mainstream markets.

Essentially, too much emphasis can be placed on the current or perceived needs of the customer, resulting in a failure to meet their unstated or future needs (or if we’re talking about internal customers, the current or future needs of the business). Furthermore, we can spend too much time focusing on the customer’s needs—often perceived needs—culminating in a paralysis to ship. This is very anti-continuous-delivery. Get things out fast, see where they land, and make appropriate adjustments on the fly.

Giving in to customer demands is a judgement game, but depending on the demand, it can have profound impact on the people further down the hill. Thus, these decisions should be made accordingly and in a way that involves a cross section of the hill. If someone near the top is calling all the shots, things are not going to work out, and in all likelihood, someone else is going to end up getting covered in shit.

An interesting corollary is the relationship between leadership and engineers. Even a single, seemingly innocuous question asked in passing by a senior manager can change the entire course of a development team. In fact, the manager was just trying to gain information, but the team interpreted the question as a statement suggesting “this thing needs to be done.” It’s important to recognize this interaction for what it is.

Set Appropriate Expectations

In truth, the relationship between teams is not equivalent to the relationship between actual customers and the business. You may depend on another team in order to provide a certain feature or to build a certain product. If the business is lagging, the customer might take their money elsewhere. If the team you depend on is lagging, you might not have the same liberty. This leads to the dangerous “us versus them” trap teams fall in as an organization grows. The larger a company gets, the more fingers get pointed because “they’re no longer us, they’re them.” There are more teams, they are more isolated, and there are more dependencies. It doesn’t matter how great your culture is, changing human nature is hard. And when pressure builds from above, the finger-pointing only intensifies.

Therefore, it’s critical to align yourself with the teams you depend on. Likewise, align yourself with the teams that depend on you, don’t alienate them. In part, this means have a realistic sense of urgency, have realistic expectations, and plan accordingly. It’s not reasonable to submit a work item to another team and turn around and call it a blocker. Doing so means you failed to plan, but now to outside observers, it’s the other team which is the problem. As we prioritize the work precipitated by our customers, so do the rest of our teams. With few exceptions, you cannot expect a team to drop everything they’re doing to focus on your needs. This is the aforementioned “us versus them” mentality. Instead, align. Speak with the team you depend on, understand where your needs fit within their current priorities, and if it’s a risk, be willing to roll up your sleeves and help out. This is exactly what Allspaw was getting at when he described what a “software engineer” is.

Setting realistic expectations is vital. Just as products ship with bugs, so does everything else in the stack. Granted, some bugs are worse than others, but no amount of QA will fully prevent them from going to production. Bugs will only get worked out if the code actually gets used. You cannot wait until something is perfect before adopting it. You will wait forever. Remember that Agile is micro failure on a macro level. Adopt quickly, deploy quickly, fail quickly, adjust quickly. As Jay Kreps once said, “The only way to really know if a system design works in the real world is to build it, deploy it for real applications, and see where it falls short.”

While it’s important to set appropriate expectations downward, it’s also important to communicate upward. Ensure that the teams relying on you have the correct expectations. Establish what the team’s short-term and long-term goals are and make them publicly available. Enable those teams to plan accordingly, and empower them so that they can help out when needed. Provide adequate documentation such that another engineer can jump in at any time with minimal handoff.

Be Curious

This largely gets back to the quote by John Allspaw. The point is that we want to hire and develop software engineers, not programmers. Being an engineer should mean having an innate curiosity. Figure out what you don’t know and push beyond it.

Understand, at least on some level, the things that you depend on. Own everything. Similarly, if you built it and it’s running in production, it’s on you to support it. Throwing code over the wall is no longer acceptable. When there’s a problem with something you depend on, don’t just throw up your hands and say “not my problem.” Investigate it. If you’re certain it’s a problem in someone else’s system, bring it to them and help root cause it. Provide context. When did it start happening? What were the related events? What were the effects? Don’t just send an error message from the logs.

This is the engineering culture that gets you the rest of the way there. The people are important, especially early on, but it’s the core values and practices that will carry you. The Innovator’s Dilemma again provides further intuition:

In the start-up stages of an organization, much of what gets done is attributable to resources—people, in particular. The addition or departure of a few key people can profoundly influence its success. Over time, however, the locus of the organization’s capabilities shifts toward its processes and values. As people address recurrent tasks, processes become defined. And as the business model takes shape and it becomes clear which types of business need to be accorded highest priority, values coalesce. In fact, one reason that many soaring young companies flame out after an IPO based on a single hot product is that their initial success is grounded in resources—often the founding engineers—and they fail to develop processes that can create a sequence of hot products.

Summary

There will always be gravity. As such, shit will always roll downhill. It’s important to embrace this structure, to understand the relationships, and to set appropriate expectations. Equally important is fostering an engineering culture—a culture of curiosity, ownership, and mutual understanding. Having the right people is essential, but it’s only half the problem. The other half is instilling the right values and practices. Shit rolls downhill, but if you have the right people, values, and practices in place, that manure might just grow something amazing.

Product Development is a Trust Fall

A couple weeks ago, Marty Cagan gave an outstanding talk at CraftConf on why products fail despite having great engineering teams. In it, he calls out many of the common mistakes made by teams, and I think there is an underlying theme: trust.

Product development is a trust fall. In order to be successful, a chain of trust must be established from the business all the way down to the engineers. If any point in that chain is compromised, the integrity of the product—and specifically its success—is put in jeopardy.

Engineers will innovate. Trust them. Engineers will discover requirements. Trust them. Engineers will identify risks. Trust them.

This trust must be symmetrical. The business must trust its product managers, who must trust the business. Product managers must trust the engineers, who must trust the product managers. Each level in this hierarchy must act as its own trust anchor. Trust is assumed, not derived. To say the opposite would imply your system of hiring and firing is fundamentally flawed. If you do not trust your teams, your teams will not trust you.

Engineers will prioritize. Trust them. Engineers will deliver. Trust them. Engineers will fail. Let them.

Product development is a trust fall. When you let go, your team will catch you. And when they don’t, they’ll pick you up, dust you off, and say, “we’ll make an adjustment.” Fail fast but recover faster. The more times you fall and hit the ground, the more adjustments we make. The quicker we repeat this process, the less time you spend on the ground. Shame on the teams that spend days, weeks, months planning their fall, ensuring everything is in place, only to find the ground has moved.

It’s inexcusable to say you fail fast when it’s really a slow, prolonged death in product, in technology, or in execution, yet it’s surprisingly common. In order to innovate, you have to fail first. In order to build an effective team, you have to fail first. In order to produce a successful product, you have to fail first. The number one fatal mistake teams make is not recognizing when they’ve failed or being too proud to admit it. This is what Agile is actually about. It’s not about roadmaps or requirements gathering or user stories or stand-ups. It’s about failing and adjusting, failing and adjusting, failing and adjusting. Agile is micro failure on a macro level. As Cagan remarks, the biggest visible distinguisher of a great team is no roadmap.

A roadmap is essentially dooming your team to get out a small number of things that will almost certainly—most of them—not work.

Developers need to be part of the ideation process from day one. The customer is usually wrong. They often don’t know what they want or what’s possible. Developers are invested in the technology and understand its capabilities and limitations. Trust them.

If you’re just using your developers to code, you’re only getting about half their value.

I’ve said it before but without a focused vision, a product will fail. Without embracing new ideas and technology, a company will become irrelevant. Developers must be closely involved with both aspects in order to be successful. Innovate, fail, adjust, deliver. Repeat.

Product development is a trust fall. The key is letting go.